Iteration problem with research scrape
I'm using the example script of the "memory conscious next page" to scrape case law from a website for quantitative reasearch. For some reason the VOLGPAGE variable does not increase from its initial value and the page sequence never get going. To my knowledge I have not changed anything that should interfere with the logic in the below example. The variable HAS_NEXT_PAGE is saved to session variable and I am calling the script "after each pattern match" on the Search result page and the following Next search result page. I have read the posts about Techniques for Scraping Large Datasets and other relevant posts over and over, but I can't find a solution.. is there something missing in the example script or am I missing something?
// If using an offset, this number should be the first search results page's offset, be it 0 or 1.
int initialOffset = 0;
// ... and this number is the amount that the offset increases by each
// time you push the "next page" link on the search results.
int offsetStep = 20;
String fileToScrape = "Next search results";
/* Generally no need to edit below here */
hasNextPage = "true"; // dummy value to allow the first page to be scraped
for (int currentPage = 1; hasNextPage != null; currentPage++)
{
// Clear this out, so the next page can find its own value for this variable.
session.setVariable("HAS_NEXT_PAGE", null);
session.setVariable("VOLGPAGE", currentPage);
session.setVariable("OFFSET", (currentPage - 1) * offsetStep + initialOffset);
session.scrapeFile(fileToScrape);
hasNextPage = session.getVariable("HAS_NEXT_PAGE");
}
I am having exaclty the same problem
I am (clearly) no expert, but I have also come across this iterative problem
Here is the code I started with:
// If using an offset, this number should be the first search results page's offset, be it 0 or 1.
int initialOffset = 0;
// ... and this number is the amount that the offset increases by each
// time you push the "next page" link on the search results.
int offsetStep = 1;
String fileToScrape = "VansDirectSearchResults";
/* Generally no need to edit below here */
hasNextPage = "true"; // dummy value to allow the first page to be scraped
for (int currentPage = 1; hasNextPage != null; currentPage++)
{
// Clear this out, so the next page can find its own value for this variable.
session.setVariable("HAS_NEXT_PAGE", null);
session.setVariable("PAGE", currentPage);
session.setVariable("OFFSET", (currentPage - 1) * offsetStep + initialOffset);
session.scrapeFile(fileToScrape);
hasNextPage = session.getVariable("HAS_NEXT_PAGE");
}
I am Using an extractor pattern called HAS_NEXT_PAGE saving a session variable (happens to be a number)if there is a next page.
This then calls the above script after each pattern match.
The initialisation script sets the first page to 0
The page number then increases to 1 but stays at 1 throughout and never gets as far as 2
Here is the log:
TITLE=Citroen Nemo 1.4 LX
VARIANT=Citroen Nemo 1.4 LX 70ps
DETAILS=The Citroen Nemo 1.4 LX 70ps - Hurry while stocks last!
DESCRIPTION=Includes as Standard: Near-Side Side Loading Door, Remote Central Locking, Electric Windows, CD Player, Trip Computer and much more. Plus Pack and Comfort Pack now available.
PAYLOAD=Gross Payload of 610kg
CASHPRICE=A small van for only £149 a month!
IMAGE1=files/car-image/Citroen_Nemo_Mini_Van.jpg
ID=/content/citroen-nemo
DetailsPageVansDirect: DATARECORD: Processing scripts after a pattern application.
Processing script: "ZuniversalCSVWriter"
DetailsPageVansDirect: DATARECORD: Processing scripts once if pattern matches.
DetailsPageVansDirect: DATARECORD: Processing scripts after all pattern applications.
VansDirectSearchResults: PRODUCTID: Processing scripts once if pattern matches.
VansDirectSearchResults: PRODUCTID: Processing scripts after all pattern applications.
VansDirectSearchResults: Processing scripts before all pattern applications.
VansDirectSearchResults: Applying extractor pattern: HAS_NEXT_PAGE
VansDirectSearchResults: Extracting data for pattern "HAS_NEXT_PAGE"
VansDirectSearchResults: The following data elements were found:
HAS_NEXT_PAGE--DataRecord 0:
junk=Go to page 3
HAS_NEXT_PAGE=3
Storing this value in a session variable.
VansDirectSearchResults: HAS_NEXT_PAGE: Processing scripts after a pattern application.
Processing script: "IterativeNextVansDirect"
Scraping file: "VansDirectSearchResults"
VansDirectSearchResults: Resolved URL: http://www.vansdirect.co.uk/custom-search?types=Van&tide=New&sort=ASC&page=1
Setting referer to: http://www.vansdirect.co.uk/content/citroen-nemo
VansDirectSearchResults: Sending request.
VansDirectSearchResults: Processing scripts before all pattern applications.
VansDirectSearchResults: Applying extractor pattern: PRODUCTID
VansDirectSearchResults: Extracting data for pattern "PRODUCTID"
VansDirectSearchResults: The following data elements were found:
PRODUCTID--DataRecord 0:
PRODUCTID=/cheap-vans/nissan-vans/nissan-nv200-van
Thanks in advance for your help!
I think that you are making
I think that you are making the same error. If you run that script, it sets the page to 1, and scrapes again. But on that page request, the script runs again, and it also sets page to 1 and tries again.
You need to make sure that your script only starts on the first page.
int initialOffset = 0;
// ... and this number is the amount that the offset increases by each
// time you push the "next page" link on the search results.
int offsetStep = 1;
String fileToScrape = "VansDirectSearchResults";
/* Generally no need to edit below here */
hasNextPage = "true"; // dummy value to allow the first page to be scraped
while (hasNextPage && !session.shouldStopScraping())
{
// Clear this out, so the next page can find its own value for this variable.
session.setVariable("HAS_NEXT_PAGE", null);
session.addToVariable("PAGE", 1);
session.setVariable("OFFSET", (session.getv("PAGE") - 1) * offsetStep + initialOffset);
session.scrapeFile(fileToScrape);
hasNextPage = session.getVariable("HAS_NEXT_PAGE");
}
Since there is no standard
Since there is no standard way that sites iterate pages, I can't see if there's a problem in the script. Could you describe the problem? Are you seeing an error, and if so what is it? Is the script running and not behaving as expected?
Thank you, Jason, for your
Thank you, Jason, for your support.
I don't see an error. The Search results page (1) is scraped very well. The URL of the Next Search Results page is each time the same. The VOLGPAGE variable is 1 when the script has this line:
for (int currentPage = 1; hasNextPage != null; currentPage++)
The VOLGPAGE variable is 4 when the script has this line:
for (int currentPage = 4; hasNextPage != null; currentPage++)
etc etc. The VOLGPAGE variable does not increase from its initial value by itself and the page sequence never get going. Hopefully this makes it a bit clearer..
I think I know what's up,
I think I know what's up, try
{
for (currentPage=2; hasNextPage!=null; currentPage++)
{
session.setv("PAGE", currentPage);
// Whatever else you do
}
}
So the script doesn't start over every time you call it.