Stripped HTML parsers

This parser is designed to parse HTML and extract data from an element. You do this by specifying a “CSS Selector”. This can be any valid value that the QueryPath library understands (QueryPath is similar to jQuery in its implementation). Some common examples would be to select an element by its ID (e.g. “#my_element_id”) or by its class (e.g. “.my_element_class”). You can also combine these together by separating them with a space.

There are three ways to retrieve data from the element after selecting it, by selecting a “Parser Mode”:

  • Attribute: This selects the value of a certain attribute, such as “href” or “src”. Selecting this option will show you another text field to enter the name of the attribute.
  • Inner HTML: This selects the HTML content inside the element, without selecting the element’s tag itself. For example, if you had “<div id=’my_div’><p>hello world</p></div>”, you would only get “<p>hello world</p>”.
  • Outer HTML: This selects the HTML content including the element itself.

There is also a “Character Encoding” selection - usually auto is perfectly fine, but if you encounter weird characters in your import, you may need to change this setting to the character encoding of your source data.

Stripped HTML parsers have the special property of automatically parsing both an English and a French field by applying the same logic to both the English URL in your sitecopy page object and the French URL. This means you do not need Language Aware mappers (see the section on Mappers) to get language-aware properties into your object.

» Submit feedback
    Status: 
  • Accepted
    Topics: 
  • Support
    Types: 
  • User Guide
Back to top