What if the extractor pattern changes somewhere down the list of records being scraped?
There is a set of pages containing records in which one of the fields is a number. I created an extractor pattern to snatch this number. The numbers start at 1 and are sequentially incremented. I know that I could automatically fill in this known information after-the-fact. But I would like to address the issue I'm running into. The extractor pattern changes when the numbers I'm extracting go from 2-digits to 3-digits. Here's what I mean:
The above 2 code lines appear in two different records on a page. The target values I wanna snatch from these lines in those 2 records are 99 and 100. My extractor pattern:
works fine for the first 99 records. But it fails to snatch integers greater than 99 because of the " small" addition in the leading string pattern (that's space-small). The trailing string pattern does not change. Can I modify my extractor pattern somehow to make it compensate for the " small" change? Please don't tell me I'm gonna have to rewrite a whole new scraping session to handle this quirk.
Attached is a TAB delimited flat ASCII file containing the records scraped from http://www.yellowpages.com/thomasville-nc/auto-repair-service?g=Thomasville%2C+NC&page=1&q=Auto+Repair+&refinements[radius]=10.0 and related pages. One look at it and you'll immediately see what I'm talking about.
Attachment | Size |
---|---|
YPoutput.txt | 19.26 KB |
That happens all the time,
That happens all the time, and the extractor won't match. I usually just add some more tokens so that it can match either format.