Sub-Sub Extractors?

I am having some issues figuring out the best way to do this. I am scraping content that is essentially in a table laid out similar to this.

Circuit City (Level 1)
Plasma (Level 2)
50"(Level 3)
46"(Level 3)
42"(Level 3)

LCD (Level 2)
55"(Level 3)
37"(Level 3)

Best Buy (Level 1)
Plasma (Level 2)
42"(Level 3)
50"(Level 3)
58"(Level 3)

I can get a pattern to match Level 1 and then the first Level 2 under that section but thats it.

I need a way to under each Level 1 match varying numbers of Level 2 and then under each Level 2 match varying numbers of Level 3.

Any suggestions?

Thanks,
Andrew

Sub Extraction

That's one of the differences between main extractor patterns and sub-extractor patterns: main matches as many as possible, sub matches only once. The sub-extractor patterns provided in the Extractor Patterns tab aren't recursive (It'd be tough to fill a data record with data that's recursive). Recursively matching all possible sub-extractor matches has to be done in a script. There are a couple of ways to go about this.

The first way is to invoke an extractor pattern from a script. To do this:

  1. Create the extractor pattern like any other, then click on the advanced tab of the extractor pattern (next to the sub-extractor pattern tab) and check the bottom box ("This extractor pattern will be invoked manually...").
  2. Create a script to use the extractor pattern and run the extractor pattern with this function call:

scrapeableFile.extractData( levelOneString, "Level 2 Extractor" );

(Documentation: http://community.screen-scraper.com/API/extractData)

Another way we sometimes use here utilizes the Pattern and Matcher classes in Java. Scaffolding for that might look like this:

import java.util.regex.*;

Matcher m1 = Pattern.compile( "L2 Regex" ).matcher( levelOneString );

while ( m1.find() )
{
  levelTwoString = m1.group();

  Matcher m2 = Pattern.compile( "L3 Regex" ).matcher( levelTwoString );

  while m2.find() )
  {
    sizeOfTV = m2.group();
    // Do something with the data, write it to a database, maybe.
  }
}

More on regex: http://community.screen-scraper.com/java_regex

These two methods are pretty similar, and both work. Using the extractor pattern allows you to more easily test your regular expressions.

Either way, the script you use to either call the extractor pattern or the Pattern and Matcher might best be executed on each pattern match of the extractor pattern that matches the level one data. If this is the case, then this line would fill in where levelOneString was coming from in both of the above examples:

levelOneString = dataRecord.get( "LEVEL_ONE_STRING" );

Thank you for the response! I

Thank you for the response!

I am having some problems however with this.

When doing the initial scrape I have a Token that is going to be scraped called X. This token doesn't always pull all the information it needs, it always at least gets the beginning, is there a limit to how much information a single token can hold? The problem is my subsequent scrape relies on all of this information being in that token.

Thanks again,

Andrew

Limits

There's no set limit to what a token can hold. There may be hardware or memory limitations, but those are generally pretty big.

By default, tokens aren't greedy. If you're using a blank regular expression, the token will match as little text as it can. You can make the regular expression greedy by changing the token's regular expression to '.*' instead of empty, though that could make it match too much or, if the file is big, this could make the pattern take too long and the extractor would time out.

I'm guessing that you could change the regular expressions in your tokens to fix the problem, though.

Thanks again

Ok chagning the RegEx to '.*' allows the token to pull all the data I need. However now instead of matching all of my first level patterns my session only grabs the first one it finds. I have it set to run after each pattern match but it still only uses the first one and none of the others. Without a regular expression it matches over 15 per page. Any idea whats going on there?

This also happens when using the Test Pattern button. With the RegEx one match. Without, 19.

Thank you again, you have been an amazing resource and I would have given up on this a while ago without your help!

Andrew

Regex

Yeah, the problem now is probably that the token is matching too much text. Here's an example to illustrate the situation:

Let's say the HTML you're interested in looks like this:

<!-- search results start -->

<div id="result">Result 1

    <div>Detail 1</div>

    <a href="detailspage?num=1">Go to details 1</a>

</div>


<div id="result">Result 2

    <div>Detail 2</div>

    <a href="detailspage?num=2">Go to details 2</a>

</div>


<div id="result">Result 3

    <div>Detail 3</div>

    <a href="detailspage?num=3">Go to details 3</a>

</div>

<!-- search results end -->

Now let's say that we want to get all of the text for each result, including the title Result 1, text Detail 1, and link detailspage?num=1 to get to the details page. Perhaps your extractor pattern looks like this:

<div id="result">~@RESULT@~</div>

Here are the two different regular expressions we've talked about for the RESULT token:

  1. (Nothing—leave it blank), and
  2. .*

Unfortunately, neither is going to do what we want. Here's why:

  1. (Nothing—leave it blank)
    RESULT matches as little text as it can. In this case, the first match would have:
    Result 1

        <div>Detail 1
    in it and miss the link because the token would stop matching once it found the first </div> it encountered. I think this was your original problem.
  2. .*
    Here, RESULT matches as much text as it can, so the first and only match includes:
    Result 1

        <div>Detail 1</div>

        <a href="detailspage?num=1">Go to details 1</a>

    </div>


    <div id="result">Result 2

        <div>Detail 2</div>

        <a href="detailspage?num=2">Go to details 2</a>

    </div>


    <div id="result">Result 3

        <div>Detail 3</div>

        <a href="detailspage?num=3">Go to details 3</a>
    in it and we miss two of the results. This would be your current problem.

There are a couple ways to overcome this. You could make the regular expression in the RESULT token more complicated, like this:

.*?</div>.*?

This regular expression still matches as little text as possible, but now it has to have the extra </div> tag inside what it matches, which was the tag that was causing our extractor pattern to finish a match too soon.
Another solution would be to leave the token regular expression blank and change the text in the extractor pattern after the token to look for a different tag: something other than </div>, like </a>. For this example, the extractor pattern would look like this:
<div id="result">~@RESULT@~</a>

(RESULT regex is blank)
This has the token matching as little as possible, but continuing to something that doesn't occur in the middle of the text we're trying to extract.

There could certainly be other solutions to this problem, as well. Hope that helps.

RE: Regex

This actually doesn't seem to be the problem. I have dug through everything and the tags I am using at the end of each pattern don't occur within the token itself, unless I use the '.*' regex in which case it pulls the entire page.

Also on a separate note, for some reason I cannot get an extractor pattern which I have called manually to run a script itself. For example once EP1 is done it is set to run a script that calls EP2. EP2 is set to run EP3 after each match and EP4 when there are no matches. Is there something I need to put in the scripts themselves since they are being called manually?

Thanks again!

Maybe it's not the problem,

Maybe it's not the problem, but be careful: if the token is left blank, it won't have the end text in it at all. That's different from whether or not the end text exists within the data that you want the token to match. Note that the </div> in the example I gave wasn't in the matched text of the RESULT token, either, when the regex was blank. You can't use the extractor pattern alone to see this sort of thing. If the token isn't grabbing enough text, then something is causing it to match early. Use the highlight extracted data button and see. The text you end on has to be at the end of the highlighted region for the pattern to have matched...so if this isn't your problem...hmm. I may need to know more to help. If you can't figure it out from the highlighted text, maybe include in a post your extractor pattern text as well as the text that you want it to match (including some before and after what you want, and surround it with <code> tags so it stands out).

Scripts aren't fired from manually applied extractor patterns. Session variables get set, if you've checked that box, but scripts aren't executed. You'll need to do something like this:

String text = dataRecord.get("PAGE_SECTION");

DataSet ds = scrapeableFile.extractData(text, "manual ep");

for (int i = 0; i < ds.getNumDataRecords(); i++)
{
    // Code to be run after each pattern match
    // Either get the DataRecord to access the data:

    DataRecord record = ds.getDataRecord(i);
    record.get("DATA_VAR");

    // or access the data directly from the DataSet

    ds.get(i, "DATA_VAR");
}