scrapable files with same URL present different patterns in screen-scraper

We have run into an example where we created the extractor patterns for a scrapable file, but they patterns do not match when the same URL is called programmatically while running the scraping engine.

We can even cut and paste the URL (the page URL that is generated programmatically) from the logs into the first scrapable file and see that the original extractor patterns still work. However they don't work in the file scraped during the scraping session.

If we examine the html of both files, we do find subtle differences. In some cases, the spacing of elements may be slightly different and in some cases one of the pages will utilize double quotes in some instances and single quotes in the other. Again both pages will have the same URL and both will have been scraped during a scraping session.

We can change the extractor patterns to match those we find when the URLs are created programmatically, but we wanted to understand what is causing this issue.

Thank you

mbss on 11/12/2012 at 10:38 am

screen-scraper support for licensed users

I would suspect that it has

I would suspect that it has to do with the HTML tidy that is on by default. If you disable that (on the scrapeable file > advanced tab), and there are still distortions in the response it pretty much has to be inconsistencies in the way the site it giving it to you. If that's the case, you might adjust your user-agent to make sure it's not tweaking something for you.

jason on 11/12/2012 at 4:15 pm

Search

Community

screen-scraper

User login

scrapable files with same URL present different patterns in screen-scraper

I would suspect that it has