Using Sub-Extractor Patterns

Overview

Sub-extractor patterns allow you to extract data in the context of an extractor pattern, providing significantly more flexibility in pinpointing the specific pieces you're after. Consider a search results page consisting of rows and columns of data. Using normal extractor patterns you would use a single pattern to extract the data from all columns for a single row. In many cases this works just fine; however, the process gets more complicated when each row differs significantly. For example, certain cell rows may be in different colors or their contents may be completely missing. With a normal extractor pattern it would be difficult to account for the variability in the cells. By using sub-extractor patterns you could create a normal extractor pattern to extract an entire row, then use individual sub-extractor patterns to pull out the individual cells.

When using sub-extractor patterns only the first match will be used. That is, even if a sub-extractor pattern could match multiple times, only the data corresponding to the first match will be extracted. Because of this sub-extractor patterns are not always the correct method for getting data within a larger context. To get multiple matches in a larger context, like all rows in a table, you would instead use manual extractor patterns.

Example

Consider the following HTML table:

Name Phone Address
Juan Ferrero 111-222-3333 123 Elm St.
Joe Bloggs No contact information available
Sherry Lloyd 234-5678 (needs area code) 456 Maple Rd.

Here is the corresponding HTML source:

 <table cellpadding="2" border="1">
     <tr>
        <th>Name</th>
        <th>Phone</th>
        <th>Address</th>
    </tr>
    <tr>
        <td class="Name">Juan Ferrero</td>
        <td class="Phone">111-222-3333</td>
        <td class="Address">123 Elm St.</td>
    </tr>
    <tr class="even">
        <td class="Name">Joe Bloggs</td>
        <td colspan="2">No contact information available</td>
    </tr>
     <tr>
        <td class="Name">Sherry Lloyd</td>
        <td class="Phone warning">234-5678 (needs area code)</td>
        <td class="Address">456 Maple Rd.</td>
    </tr>
</table>

It would be difficult to write a single extractor pattern that would extract the information for each row because the contents of the cells differ so significantly. The different colored cells and the cell spanning two columns make the data too inconsistent to be easily extracted using a single pattern (which would require lots of regular expressions and might still prove impossible or inconsistent).

Consider this extractor pattern:

The [email protected]@~ extractor pattern token is special in that it defines the block of data to which you wish to apply sub-extractor patterns. Sub-extractor patterns cannot be applied to a token with a name other than DATARECORD

If applied to the HTML above the extractor pattern would produce the following three matches:

1.  ><td class="Name">Juan Ferrero</td><td class="Phone">111-222-3333</td><td class="Address">123 Elm St.</td>
2.  class="even"><td class="Name">Joe Bloggs</td><td colspan="2">No contact information available</td>
3.  ><td class="Name">Sherry Lloyd</td><td class="Phone warning">234-5678 (needs area code)</td><td class="Address">456 Maple Rd.</td>

Sub-extractor patterns would allow you to extract individual pieces of information from each row. For example, consider this sub-extractor pattern:

<td class="Name">[email protected]@~</td>

If applied to each of the individual extracted rows above the following three pieces of information would be extracted:

1.  Juan Ferrero
2.  Joe Bloggs
3.  Sherry Lloyd

This is a simple case. Now consider the extractor pattern for the phone number:

<td class="Phone">[email protected]@~</td>

If applied to each of the individual extracted rows above the following three pieces of information would be extracted:

1.  111-222-3333<br />
2.
3.

In the case of Sherry Lloyd this presents a serious problem because she does have a phone number listed. It is not selected because of the additional class. Let's adjust the sub-extractor pattern slightly:

The [email protected]@~ represents an extractor token that uses the Non-double quotes regular expression: [^"]*. Matching anything between where it is covering until it encounters double quotes. In this particular case Sherry's phone number also gets extracted.

We now have the case of the cell in the second row that spans two columns, which would not get extracted by our current sub-extractor patterns. We may still want this information, however, so we create the following sub-extractor pattern, just in case the cell exists:

<td colspan="2">[email protected]@~<

If applied to our data we'd get the following results:

 1.
 2. No contact information available
 3.

When multiple sub-extractor patterns hold a token with the same name (in this case, PHONE), the last one to match is the one that determines the value of the token. In this example either one or the other will match. If both could match then we would want to have the first phone extractor pattern ordered later than the one to match the no-data-available pattern

Sub-extractor patterns aggregate everything that's extracted into a single data set. Using all of our extractor and sub-extractor patterns together we'd get the following data set:

Data record # Name Phone
Data record #1 Juan Ferrero 111-222-3333
Data record #2 Joe Bloggs No contact information available
Data record #3 Sherry Lloyd 234-5678 (needs area code)

Important Notes

  • When two sub-extractor patterns hold a token with the same name, the one that doesn't match anything will have no effect. Sub-extractor patterns are applied in sequence, and those that match something will take precedence over those that don't.
  • [email protected]@~ is the extractor token identifier that defines the block of data to which you wish to apply sub-extractor patterns. You cannot use sub-extractor patterns without using this token name in the main extractor pattern.
  • When using sub-extractor patterns only the first match will be used. That is, even if it could match multiple times, only the data corresponding to the first match will be extracted.