Automatically Scraping Pages listed on an Index
Hi,
I have just downloaded screen-scraper and gone through a tutorial--it looks like an awesome application.
I was wondering if anybody could tell me how I would go about screen-scraping all of the websites listed on an index page?
I would normally think that I could scrape the index for all of the anchor tags (which I was successfully able to do thanks to the tutorial), but then I'd like to use a script to create new Scrape jobs with the URLs that were scraped from the index.
If anybody knows how this can be done easily, please let me know. I checked the API for ways to programmatically create Scrape Jobs but to no avail.
tylertrussell, There are two
tylertrussell,
There are two approaches you could take depending on your needs. The first and easier way would be to simply extract the URLs and apply them as URLS to the target site. You would put something like the following in the address field.
~#URL#~
The second approach should only be used if you plan on scraping multiple pages of data from each on the index page. For each link you can create a RunnableScrapingSession. This basically means that one scraping session creates another scraping session. From there the child scraping session will scrape as it normally would. You then retrieve saved data from the parent scrape using different methods available.
I hope this helps.
-Scott