Using Extractor Patterns

Overview

Extractor patterns allow you to pinpoint select snippets of data that you want extracted from a web page. It is a block of text (usually HTML) that contains special tokens that will match pieces of data you're interested in extracting. These tokens are text labels surrounded by the delimiters [email protected] and @~ (e.g. [email protected]@~). The identifier between the delimiters can contain only alpha-numeric characters and underscores.

Extractor patterns are added to scrapeable files under the extractor patterns tab.

You can think of an extractor pattern like a stencil. A stencil is an image in cut-out form, often made of thin cardboard. As you place a stencil over a piece of paper, apply paint to it, then remove the stencil, the paint remains only where there were holes in the stencil. Analogously, you can think of placing an extractor pattern over the HTML of a web page. The tokens correspond to the holes where the paint would pass through. After an extractor pattern is applied it reveals only the portions of the web page you'd like to extract.

Extractor tokens designate regions where data elements are to be captured. For example, given the following HTML snippet:

<p>This is the <b>piece of text</b> I'm interested in.</p>

you would extract piece of text by creating an extractor pattern with a token positioned like so:

<p>This is the <b>[email protected][email protected]~</b> I'm interested in.</p>

The extracted text could then be accessed via the identifier EXTRACTED_TEXT.

If you haven't done so already, we'd recommend going through our first tutorial to get a better feel for using extractor patterns.

Tips/Suggestions

  • Test your patterns frequently. Extractor patterns take some practice. Especially when you're first trying them out you'll want to test them as you're working with them. It often helps to test it after every couple of tokens you insert.
  • Use regular expressions to make your extractor patterns more precise. One of the most common problems encountered occurs when an extractor pattern matches too much data, which usually includes a lot of HTML. There are a couple of ways to address this problem. One is to extend the pattern outward. That is, include HTML that falls before and after the block you're trying to match. The second approach, which is generally the easier of the two, is to include regular expressions. We've included a number of common regular expressions that you can select from the drop-down list. In general, if you can use more precise regular expressions you can reduce the amount of HTML in the extractor pattern. Doing so makes your patterns more resilient to changes that might be made to the web site you're scraping.

    If an extractor pattern takes too long to match a block of text it will timeout. The timeout setting may be adjusted from the general tab of the Settings located in the Options menu. If you find that your extractor pattern is timing out you might try adjusting it by using more precise regular expressions.

  • Ensure that the pattern extracts the number of data records you expect it to. Oftentimes your pattern might not be as flexible as you think it is. Test it out to make sure it matches as many times as you think it should.
  • Try tidying the HTML. This will ensure that white space is handled consistently and will often clean up extraneous characters. The setting that determines whether or not HTML gets tidied is adjusted under the advanced tab of the scrapeable file.