Input from CSV scrape results has next page how to iterate?

I'm scraping using a file with zipcodes to provide the URL and the scrape works when the scrape results page has one page. In instances where I have a next page, I can't figure out how to get the script to allow all of the next pages to be scraped BEFORE inputting the next zipcode in the CSV value. So the results I get are 1 page scraped from each zipcode even if a zipcode has many pages.

Here is the code that Im working with so far. If anyone can help please jump in!

// If using an offset, this number should be the first search results page's offset, be it 0 or 1.
int initialOffset = 1;

// ... and this number is the amount that the offset increases by each
// time you push the "next page" link on the search results.
int offsetStep = 1;

String fileToScrape = "ScrapeFile";

/* Generally no need to edit below here */

hasNextPage = "true"; // dummy value to allow the first page to be scraped
for (int currentPage = 1; hasNextPage != null; currentPage++)
{
// Clear this out, so the next page can find its own value for this variable.
session.setVariable("HAS_NEXT_PAGE", null);
session.setVariable("NUMBER", currentPage);
session.setVariable("OFFSET", (currentPage - 1) * offsetStep + initialOffset);
session.scrapeFile(fileToScrape);
hasNextPage = session.getVariable("HAS_NEXT_PAGE");
}
import java.io.*;

////////////////////////////////////////////
// Make sure you have the "input/ny1.csv" in your screen-scraper installation directory
session.setVariable("INPUT_FILE", "input/ny1.csv");
////////////////////////////////////////////

BufferedReader buffer = new BufferedReader(new FileReader(session.getVariable("INPUT_FILE")));
String line = "";

while ((line = buffer.readLine()) != null){
// Set next zipcode to be searched
session.setVariable("ZIP", line);

// Output to log for debugging
session.log("***Now scraping: " + line);

// Scrape for next search results with the new zipcode
session.scrapeFile("ScrapeFile");
}

<code>

The code detects a next page and then signals the script to cycle to the next page. The second half of the script takes the input from the ny1.csv and inputs a zipcode as part of the URL to be scraped.

bcb on 11/04/2011 at 4:14 am

screen-scraper public support

Thanks Scott! but I couldn't get it to work so I did this...

Thanks for the reply Scott, I really appreciate the help.

I originally did have the single script set in two scripts as you suggested but when that failed to work I thought combining them might be an option. The iterative next works like a champ when I have it run "after scraping session" and use a single URL as an input source but when I try to input the zip from the file it fails. So I did come up with a workaround that is not quite as elegant but gets the job done for the site I'm scraping. The site has a limit of 100 results pages and then it begins to display duplicate information even if there are more results available.

I am posting my workaround here for others that may be having difficulty using .csv or .txt input as a source when there are multiple page results that have to be scraped. Thanks goes to original coders. I just hacked these bits together.

This code will change the page session variable one page at a time until it reaches 100 and then take the next input variable from the "yourcsvfile.csv", input it and then change the page variable one page at a time. It should be called before scraping session begins. In many ways it is like an old used car... it ain't pretty but it'll get you where you need to go.

The downside to this solution is that if there are not 100 pages (you can change the number of pages to any number you may want. I just happened to have a need for 100) to the site you are scraping it will cycle through them anyway producing 401 errors, which on some sites may get you booted off the server. The other downside is that if you break out of this solution it will stop scraping but the script will continue to load and cycle through the 100 pages until you kill the server. In other words it does not have a graceful exit.

If anyone would like to recommend some improvements or has some other methods they are using to cycle through multiple pages when using data input from a .txt or .csv file please reply to this post.

import java.io.*;

////////////////////////////////////////////
// Make sure you have the "input/yourcsvfile.csv" in your screen-scraper installation directory
session.setVariable("INPUT_FILE", "input/yourcsvfile.csv");
////////////////////////////////////////////

BufferedReader buffer = new BufferedReader(new FileReader(session.getVariable("INPUT_FILE")));
String line = "";

while ((line = buffer.readLine()) != null){
// Set next zipcode to be searched
session.setVariable("ZIPCODE", line);

// Output to log for debugging
session.log("***Now scraping: " + line);

// Scrape for next search results with the new zipcode
/*
USE THE TWO INTEGERS BELOW TO SET THE PAGE NUMBERS TO SCRAPE
n = ENDING page
i = STARTING page
*/
//set starting page #
int i=1;
//set ending page #
int n=100;

//set session var to i
session.setVariable("NUMBER", i);

//LOOP through
while (i < n) {

//Send info to LOG
session.log("+++Scraping page #" + i);
//Scrape the results page
session.scrapeFile("scrapefilename");
//add 1 to i
i++;
//set session variable to new i
session.setVariable("NUMBER", i);
} session.log("+++Scraping page #" + i);

}

bcb on 11/05/2011 at 2:46 am

Use the shouldStopScraping()

Use the shouldStopScraping() method to prevent runaway loops.

swilsonmc on 11/08/2011 at 12:41 pm

bcb, You might need to

bcb,

You might need to separate the next-page part of your script from the read-in-zips part.

With two scripts, fire your first script at the beginning of your scraping session. Then, fire your next page script from an extractor pattern which attempts to match on an indication that there is a next page to navigate to.

You may need to have a distinct scrapeable file just to handle the pagination extractor pattern matching...then a scrapeable file(s) to extract the data you're after.

-Scott

swilsonmc on 11/04/2011 at 12:50 pm

Search

Community

screen-scraper

User login

Input from CSV scrape results has next page how to iterate?

Thanks Scott! but I couldn't get it to work so I did this...

Use the shouldStopScraping()

bcb, You might need to