Extractor Patterns
Overview
Extractor patterns allow you to pinpoint snippets of data that you want extracted from a web page. They are made up of text (usually HTML), extractor tokens, and possibly even session variables. The text and session variables give context to the tokens that represent the data that you want to extract from the page.
Extractor patterns can be difficult to understand at first. We recommend that you read about using extractor patterns or go through our first tutorial before continuing.
Managing Extractor Patterns
When creating extractor patterns you should use the HTML that will be found under the last response tab associated with a scrapeable file. By default, screen-scraper will tidy the HTML once it's been scraped, meaning that it will format it in a consistent way that makes it easier to work with. If you use the HTML by viewing the source for a page in your web browser it will likely be different from the HTML that screen-scraper generates.
Adding
- Click the Add Extractor Pattern button in the extractor patterns tab of the scrapeable file
- Select desired text in the last response tab of the scrapeable file, right click and select Generate extractor pattern from selected text.
Removing
- Click the Delete on the desired extractor pattern.
Extractor Pattern: Main tab
Main Tab
- Test Pattern: Opens a DataSet window with the results of the extractor pattern matches applied to the the HTML that appears in the last response tab.
- Highlight Extracted Data (professional and enterprise editions only): Opens the last response tab and places a colored background on all text that matches to the extractor tokens.
- Delete Extractor Pattern: Deletes the current extractor pattern.
- Copy Pattern (professional and enterprise editions only): Copies the extractor pattern so that it can be pasted into a different scrapeable file.
- Identifier: A name used to identify the pattern. You'll use this when invoking the extractData and extractOneValue methods.
- Sequence: Determines the order in which the extractor pattern will be applied to the HTML.
- Pattern text: Used to hold the text for the extractor pattern. This will also include the extractor pattern tokens that are analogous to the holes in the stencil.
- Scripts: This table allows you to indicate scripts that should be run in relationship to the extractor pattern's match results. Much like other programming languages, screen-scraper can invoke code based on specified events. In this case, you can invoke scripts before the pattern is applied, after each match it finds, after all matches have been made, once if a pattern matches, or once if a pattern doesn't match. For example, if your pattern finds 10 matches, and you designate a script to be run After each pattern match, that script will get invoked 10 separate times.
- Add Script: Adds a script association to the extractor pattern.
- Script Name: Specifies which script should be run.
- Sequence: The order in which the script should be run.
- When to Run: When the scrapeable file should request to run the script.
- Enabled: A flag to determine which scripts should be run and which shouldn't be.
Extractor Pattern: Sub-Extractor Patterns tab
Sub-Extractor Patterns Tab
- Add Sub-Extractor Pattern: Adds a sub-extractor pattern.
- Paste Sub-Extractor Pattern (professional and enterprise editions only): Paste a previously copied sub-extractor pattern.
The buttons specific to the sub-extractor pattern are discussed in more detail later in this documentation.
Extractor Pattern: Advanced tab
Advanced tab (professional and enterprise editions only)
- Automatically save the data set generated by this extractor pattern in a session variable (professional and enterprise editions only): If this box is checked screen-scraper will place the dataSet object generated when this extractor pattern is applied into a session variable using the identifier as the key (i.e. session variable name). For example, if your extractor pattern were named PRODUCTS, and you checked this box, screen-scraper would apply the pattern and place the resulting dataSet into a session variable named PRODUCTS.
It is recommend that you generally avoid checking this box unless it's absolutely needed because of memory issues it may cause. If this box is checked, screen-scraper will continue to append data to the dataSet, and all of that data will be kept in memory. The preferred method is to save data as it's being extracted, generally by invoking a script with a script association After each pattern match that pulls the data from dataRecord objects or session variables.
- If a data set by the same name has already been saved in a session variable do the following: The action that should be taken when conflicts occur. If this page is on an iterator you might want to append so that you don't loose previous data, but this makes your variable very large.
- Filter duplicate records (enterprise editions only): When this box and the Cache the data set box are checked screen-scraper will filter duplicates from extracted records. See the Filtering duplicate records section for more details.
- Cache the data set (enterprise editions only): In some cases you'll want to store extracted data in a session variable, but the dataSet will potentially grow to be very large. The Cache the data set checkbox will cause the extracted data to be written out to the file system as it's being extracted so that it doesn't consume RAM. When you attempt to access the data set from a script or external code it will be read from the disk into RAM temporarily so that it can be used. You'll also need to check this box if you want to filter duplicates.
- This extractor pattern will be invoked manually from a script (professional and enterprise editions only): If you check this box the extractor pattern will not be invoked automatically by screen-scraper. Instead, you'll invoke it in a script using the extractData and extractOneValue methods.