processing multiple pages in one session

I am a novice at SS and InterpretedJava.

I am trying to scrap multiple pages of search results.

I have tried to make use of the NEXT script posted in the API section....without success.

How do I achieve this goal?

Well, this process is very

Well, this process is very very dependent on which website it is. There is no single way to perform the task. Most of the work is done in SS, rather than a script, most of the time.

The process to deriving the next page is always the same, though:

Browse to the spot on the website where you want to click on the next page button. Turn your proxy on, turn SS's proxy on, and then push that "next page" button.

Once you see it come through in the SS proxy, make a scrapeableFile out of it and then examine the "Parameters" tab on the scrapeableFile.

In order to get to the next page, you're now going to have to make several extractor patterns on your *first* page which seek after the variables that you're seeing on the "Parameters" tab of the *next* page. The names of those variables will be identical to the names found on the Parameters tab that I'm talking about.

So, you can look the HTML bit by bit for those variable names. They could be found in the value="something" part of the various <input> tags that HTML uses, or it could just be hard-coded into the links, so that the variables are all nice and together like so:

Next Page

In that above code block I just wrote, the three variables are "q" and "michaeljackson" and "page". What you could do in this scenario is have a single extractor pattern like this:


Make the pattern on the "PARAMETERS" token be [^"]*, and make sure the token is being saved as a session variable.

Then, all you have to do is remove the parameters from the "Parameters" tab on your "next page" scrapableFile that we made, and make the URL textbox on the first tab look like this:

http://www.somesite.com/somepage.php?~#PARAMETERS#~

Then, to link the two scrapeableFile together, make an extractor pattern on the "first page" file which detects if there is a next page. This will likely be the extractor pattern that I showed you above (the second code block). Make a script that runs "After each pattern application" on that extractor pattern. The script should look like this:


// Interpreted Java
// This fixes some potentially goofy data that sometimes pops up on websites.

session.setVariable("PARAMETERS", session.getVariable("PARAMETERS").replaceAll("&amp;", "&"));
session.scrapeFile("Whatever you've named your 'next page' scrapeableFile");

And then just make sure that the "next page" file has the correct patterns on it as well (including the detection of the next page again).

Does that help? Let me know if you've got more questions!