make optional part of scraping pattern?
Note this questions has been copied from the screen-scraper FAQ to this forum.
I'm trying to extract data using an extractor pattern, but in some cases certain pieces of data don't appear in the block I'm trying to scrape. Can I make certain elements optional?
Re: make optional part of scraping pattern?
Brent said: "I'm trying to extract data using an extractor pattern, but in some cases certain pieces of data don't appear in the block I'm trying to scrape. Can I make certain elements optional?"
This situation is actually fairly common. There are two principle approaches for handling this. Suppose you have a record that you want to extract that sometimes shows up like this:
But also may show up like this, if a phone number is present:
You could write a single extractor pattern that grabs the city/state/zip and the phone number with a single token, like this:
The problem with this approach, though, is that you have to do some post-processing work (i.e. after the data has been extracted) to divde out the city/state/zip from the phone number.
The easier approach would be to simply create two extractor patterns--one that gets all of the information if the phone number is not present, and another that extracts all of the information if it is present. For example, the first might look like this:
And the second might look like this:
Which solution works best for you depends on what the data looks like you're trying to extract; however, generally the second will be the easiest to work with.
Bear in mind that whenever you use multiple extractor patterns to extract similar types of records from a single page you're going to have to merge that information together at some point. Depending on what you do with the data after it gets extracted there are two ways you'll probably want to appraoch this. If you're simply writing the data to a file or database using a screen-scraper script (as in the Slashdot example), simply have that script invoked after each of the two extractor patterns is applied. If you're accessing the extracted data from an external source, such as an ASP or PHP script, the easiest way to do this would be to give both extractor patterns the same name, check the box labelled "Automatically save the data set generated by this extractor pattern in a session variable" on the "Extractor Patterns" tab for a scrapeable file, then in the "If a data set by the same name has already been saved in a session variable do the following:" drop-down list you would select "append" so that the data would all get merged into one long list and stored in a DataSet object bearing the name of the extractor patterns.
These pages from the documentation may also help with this question:
Using extractor patterns
Using scripts