Multiple Scraping Sessions of One Site in Parallel

I would like to take advantage of the enterprise edition by running multiple scraping sessions of one site in parallel from the Web interface. Specifically I would like to pass in variables to control which page the scraping session should start from and what page the scraping session should stop.

I'm using the following to loop through the pages:

//Insert name.
String fileToScrape = "Search Results";

/* Generally no need to edit below here */

hasNextPage = "true"; // dummy value to allow the first page to be scraped
for (int currentPage = 1; hasNextPage != null; currentPage++)
{
// Clear this out, so the next page can find its own value for this variable.
session.setVariable("HAS_NEXT_PAGE", null);
session.setVariable("PAGE", currentPage);
session.scrapeFile(fileToScrape);
hasNextPage = session.getVariable("HAS_NEXT_PAGE");
}

How do I do this? What scripts would I use? Since, I'm a newbie the more detail the better. Thanks in advance for your help.

Adrianjay on 05/05/2009 at 11:24 am

screen-scraper support for licensed users

I'm assuming that you're

I'm assuming that you're building this idea off of one of your other questions, so that's the direction I think I'll go-- if I'm heading the wrong way, just stop me.

Assuming you already have some variables to tell you which page ranges you want, I would alter the script to read more like this (changes are in bold) :

// Interpreted Java

String fileToScrape = "Search Results";
/* Generally no need to edit below here */

int startPage = Integer.parseInt(session.getVariable("SEARCH_START_PAGE").toString());
int endPage = Integer.parseInt(session.getVariable("SEARCH_END_PAGE").toString());

hasNextPage = "true"; // dummy value to allow the first page to be scraped
for (int currentPage = startPage; hasNextPage != null && currentPage <= endPage; currentPage++)
{
    // Clear this out, so the next page can find its own value for this variable.
    session.setVariable("HAS_NEXT_PAGE", null);
    session.setVariable("PAGE", currentPage);
    session.scrapeFile(fileToScrape);
    hasNextPage = session.getVariable("HAS_NEXT_PAGE");
}

That code would successfully handle any page range automatically, and would also stop if it ran out of pages.

The major changes is that I've added in 'local' variables called 'startPage' and 'endPage', which serve as the boundaries of the range. On the beginning of the for (... line, I changed int currentPage = 1 to int currentPage = startPage. This way, the page iteration will always begin where the range starts. Additionally, I added into the same line the && currentPage <= endPage to act as condition for page iteration. After the last page has been done, it will exit.

If page range for one scrape were set to 1 and 10, then you would want a second scrape to have 11 through 20. In other words, the range you give it will be "inclusive", meaning that if you say that the endPage is 10, it'll scrape page 10 and *then* stop. (Side note: "Exclusive" range would mean that it would scrape page 9 and then stop, because it 'excludes' the 10.)

The only thing to do after this script would be to make sure you have two session variables in place for each copy of the scrape that you run: SEARCH_START_PAGE and SEARCH_END_PAGE.

You can hard-code those variables into a test scrape if you want. It'd probably be best if we confirm that it does its job, before we get ahead of ourselves.

Hope that was easy enough to follow-- just ask if there are any confusing points.

Tim

timv on 05/05/2009 at 4:42 pm

Search

Community

screen-scraper

User login

Multiple Scraping Sessions of One Site in Parallel

I'm assuming that you're