Scraping Large Numbers of Websites
I have a need to scrape 100+ websites reasonably alike commerce (type) websites. Screen-Scraper seems like it would be suitable and has worked in testing. However I can not seem to fathom how to make it useable with large numbers of URLs. I have read the FAQ at [url]http://www.screen-scraper.com/support/faq/faq.php#LargeNumberScrapingSessions[/url] but can't really make much sense of how it would operate in practise.
When I click export on a scraping session it seems to create a wrapped file of the session and all scripts that it uses. If I click export on a script then it saves that script separately (for general scripts). When I do an import of a session the it will also bring back the general scripts too.
Altogether its seems incredibly unwieldy for more than a few websites - can anyone suggest an easier way to cope with high numbers of URLS - it seems a shame as otherwise the software seems to work well.
Also, what is the practical maximum websites (with about 2 web pages / 5 scripts per website) that the software can cope with.
Thanks
Scraping Large Numbers of Websites
jonno,
Oh, and the only reason you would need to export your scraping sessions or scripts is if you're moving between machines for deployment, say. There's nothing useful in the exported files themselves and they're not meant to be interacted with in any way other than to reimport them into a version of screen-scraper.
The scripts automatically get exported with their related scraping sessions 99% of the time. The only exception would be if a given script is called only from within another script using the session.executeScript() method.
-Scott
Scraping Large Numbers of Websites
jonno,
If the sites are similar enough, conceivably, you could reuse parts and share scrapeable files and scripts between the URLs you're scraping from. Just to demonstrate, at a very base level, you could have the URL for a scrapeable file be nothing but a session variable.
Instead of...
http://www.mydomain.com/and-where-i/plan2scrape.html
You'd have...
~#SESSION_VARIABLE_REFERENCE_FOR_BUNCHES_OF_URLS#~
Or, as you probably already thought of.
Instead of a hundred scrapeable files with some variation like this...
http://www.mydomain.com/and-where-i/plan2scrapeB.html
http://www.mydomain.com/and-where-i/plan2scrapeC.html
You'd have...
http://www.mydomain.com/and-where-i/plan2scrape~#THAT_LETTER_THAT_KEEPS_CHANGING#~.html
This way you can programmatically iterate through your list either with a for loop for that changing letter in the URL or with a text file that contains nothing but the big 'ol list of URLs you have. This relies heavily on each site being very similar and in the real world is unlikely but I hope it demonstrates the kind of control you have.
We use this technique in [url=http://www.screen-scraper.com/support/tutorials/tutorial3/scraping_pages_from_scripts.php]Tutorial 3[/url]
-Scott