What if the extractor pattern changes somewhere down the list of records being scraped?

There is a set of pages containing records in which one of the fields is a number. I created an extractor pattern to snatch this number. The numbers start at 1 and are sequentially incremented. I know that I could automatically fill in this known information after-the-fact. But I would like to address the issue I'm running into. The extractor pattern changes when the numbers I'm extracting go from 2-digits to 3-digits. Here's what I mean:

The above 2 code lines appear in two different records on a page. The target values I wanna snatch from these lines in those 2 records are 99 and 100. My extractor pattern:

~@RANK_PIN@

works fine for the first 99 records. But it fails to snatch integers greater than 99 because of the " small" addition in the leading string pattern (that's space-small). The trailing string pattern does not change. Can I modify my extractor pattern somehow to make it compensate for the " small" change? Please don't tell me I'm gonna have to rewrite a whole new scraping session to handle this quirk.

Attached is a TAB delimited flat ASCII file containing the records scraped from http://www.yellowpages.com/thomasville-nc/auto-repair-service?g=Thomasville%2C+NC&page=1&q=Auto+Repair+&refinements[radius]=10.0 and related pages. One look at it and you'll immediately see what I'm talking about.

Attachment	Size
YPoutput.txt	19.26 KB

MikeSpike on 07/10/2010 at 12:23 pm

screen-scraper public support

That happens all the time,

That happens all the time, and the extractor won't match. I usually just add some more tokens so that it can match either format.

jason on 07/12/2010 at 7:34 am

Search

Community

screen-scraper

User login

What if the extractor pattern changes somewhere down the list of records being scraped?

That happens all the time,