Extractor patterns allow you to pinpoint select snippets of data that you want extracted from a web page. It is a block of text (usually HTML) that contains special tokens that will match pieces of data you're interested in extracting. These tokens are text labels surrounded by the delimiters ~@ and @~ (e.g. ~@NAME@~). The identifier between the delimiters can contain only alpha-numeric characters and underscores.
Extractor patterns are added to scrapeable files under the extractor patterns tab.
You can think of an extractor pattern like a stencil. A stencil is an image in cut-out form, often made of thin cardboard. As you place a stencil over a piece of paper, apply paint to it, then remove the stencil, the paint remains only where there were holes in the stencil. Analogously, you can think of placing an extractor pattern over the HTML of a web page. The tokens correspond to the holes where the paint would pass through. After an extractor pattern is applied it reveals only the portions of the web page you'd like to extract.
Extractor tokens designate regions where data elements are to be captured. For example, given the following HTML snippet:
you would extract piece of text by creating an extractor pattern with a token positioned like so:
The extracted text could then be accessed via the identifier EXTRACTED_TEXT.
If you haven't done so already, we'd recommend going through our first tutorial to get a better feel for using extractor patterns.
If an extractor pattern takes too long to match a block of text it will timeout. The timeout setting may be adjusted from the general tab of the Settings located in the menu. If you find that your extractor pattern is timing out you might try adjusting it by using more precise regular expressions.