Next page
I have been working a scrape now for about 3 days straight and a little confused how I can get it to go to the next page. Below is the code that outlines the pages, but I can not figure out how to make and extractor pattern and/or script to proceed to the next page. Also I am attaching my most current scrape work.
<div class="none">...</div>
<div class="none"><a href="/ProcessLibrary/S/1/">1</a></div>
<div class="none"><a href="/ProcessLibrary/S/2/">2</a></div>
<div class="none"><a href="/ProcessLibrary/S/3/">3</a></div>
<div class="none"><a href="/ProcessLibrary/S/4/">4</a></div>
<div class="none"><a href="/ProcessLibrary/S/5/">5</a></div>
<div class="none"><a href="/ProcessLibrary/S/6/">6</a></div>
<div class="none"><a href="/ProcessLibrary/S/7/">7</a></div>
<div class="none"><a href="/ProcessLibrary/S/8/">8</a></div>
<div class="none"><a href="/ProcessLibrary/S/9/">9</a></div>
<div class="none">...</div>
<div class="other"><a href="/ProcessLibrary/S/107/">Last</a></div>
</div>
Attachment | Size |
---|---|
JJ54 Process Library (Scraping Session)_edited2.sss_.txt | 15.18 KB |
jj54.com (Scraping Session).sss | 15.66 KB |
Pages are just HTTP request,
Pages are just HTTP request, but since there is no standard way to implement it you have to look at each site and figure out their trick.
I think this one looks pretty straight forward. You can see on the "last" link that there are 107 pages. You will need to scrape that, then run a script like:
// Function to convert string to integer
makeNum(num)
{
if (num!=null)
{
num = String.valueOf(num);
num = num.replaceAll("\\D", "");
if (num.length()>0)
{
num = Integer.parseInt(num);
}
else
num = 0;
}
else
num = 0;
return num;
}
// This converts the scraped number to a useable integer
totalPages = makeNum(dataRecord.get("LAST"));
// Start iterator on 2 as we're already on page 1.
{
for (i=2; i
session.setVarible("PAGE", i);
session.log("+++Staring page " + i + " of " + totalPages);
session.scrapeFile("Next search results");
}
Jason Thanks for the reply.
Jason
Thanks for the reply. Back I typed up something similar the first time I was working on this scrape, but kept getting errors on it for some reason. I tried running what you had and keeping getting this error "The error message was: Attempt to invoke method: get() on undefined variable or class name: dataRecord : at Line: 22.". I think I have everything set-up correctly on the scrape of the last page. Any ideas where to look?
J
For the dataRecord to be in
For the dataRecord to be in scope, you need to have it run after each extractor pattern match.
Jason- First off thanks for
Jason-
First off thanks for all the help on this. I tried running through it again with no luck. Still getting the same error along with another one now.
"Storing this value in a session variable.
JJ54 Search results: Processing scripts after a pattern application.
Processing script: "jj54 Loop Pages"
An error occurred while processing the script: jj54 Loop Pages
The error message was: Error in method invocation: Method setVarible( java.lang.String, int ) not found in class'com.screenscraper.scraper.ScrapingSession' : at Line: 27.
JJ54 Search results: Processing scripts after all pattern applications.
Processing scripts after scraping session has ended.
Processing script: "jj54 Loop Pages"
An error occurred while processing the script: jj54 Loop Pages
The error message was: Attempt to invoke method: get() on undefined variable or class name: dataRecord : at Line: 22.
Scraping session "JJ54 Process Library" finished."
I am attaching my updated scrape so you can see where I'm at. I think it might be with the variable "PAGE" where the problem is, but not for sure.
Thanks again
J
It says that it is processing
It says that it is processing the script "jj54 Loop Pages" after scraping session has ended, so the dataRecord can't be in scope.
Jason- Thanks for catching
Jason-
Thanks for catching that. I took it the script out of the scrape, but still not getting the desired result. I have 3 extractor patterns that run in the two scrape files. Do I need to have it run after each extractor pattern as "After each pattern is applied?".
Correct.
Correct.
Jason- That's what I thought.
Jason-
That's what I thought. I have it after each extractor pattern and still get the error. For some reason it keeps hanging on the setting the PAGE variable. This is what I have in my search result scrape file URL: http://www.jj54.com/ProcessLibrary/~#CATS#~/~#PAGE#~/. Any other ideas?
Thanks again
J
What error are you getting
What error are you getting now? I see one in the log you posted:
The error message was: Error in method invocation: Method setVarible( java.lang.String, int ) not found in
That's a typo ....
Thanks for the finding that
Thanks for the finding that error! I failed spelling in school....lol. I looked through everything and it seems to be running fine except for a few things.
1. Every time it runs it loops through the pages twice for some reason, but I can't figure out where to look to see where the problem is. It will scrape page one with no problem and proceed to page 2 and hang there (scraping the page over and over) and never go to page three and beyond.
2. Also is there a way to show in the script to scrape the last page. For example right now the loop page code shows: i
This scrape is definitely teaching me a lot. Thanks for guiding me so far.
I attached the update scrape that I am using so you can see for yourself.
Thanks for all your help. You need a pay raise for putting up with me!!!!....lol
Jason
Are you calling the same
Are you calling the same scrapeableFile recursively? So you scrape it, find the next page link, scrape the same page again so it too finds the next page link.
You can do that, but for it to work you need to wrap the iterator script so it won't recur like that. Make sure you set the PAGE to one when you start, and just add:
if (session.getVariable("PAGE")=1)
{
// Iterate pages
}
This way if the extractor matches again on the next page, it won't try to run again.
Jason- I looked through my
Jason-
I looked through my scrape again and can't tell if I can it recursively or not. New to programming and still learning. I added the IIf statement in, but it keeps throwing a beanshell error at me. I think I am putting it in the right place, but don't know for sure. Can you take a look?
// Function to convert string to integer
makeNum(num)
{
if (num!=null)
{
num = String.valueOf(num);
num = num.replaceAll("\\D", "");
if (num.length()>0)
{
num = Integer.parseInt(num);
}
else
num = 0;
}
else
num = 0;
return num;
}
// This converts the scraped number to a useable integer
totalPages = makeNum(dataRecord.get("LAST"));
// Start iterator on 2 as we're already on page 1.
if (session.getVariable("PAGE")=1)
{
{
for (i=2; i
session.setVariable("PAGE", i);
session.log("+++Starting page " + i + " of " + totalPages);
session.scrapeFile("JJ54 Search results");
}
}
Jason Thanks for all your
Jason
Thanks for all your help. I am not sure I follow what your saying. I looked through my script, but I can't tell if I am calling it recursively. I would like to just scrape the page once to find the next link and move on to the other page if possible. Is that possible or does it have to scrape the same page again?
Thanks
J
You may call whatever page
You may call whatever page you like. If all the pages of search results look the same, though, it's often easier to call the same page recursively so you don't have to maintain duplicate pages of extractor patterns, etc.
Jason, I've attached a slight
Jason,
I've attached a slight variation on the approach Jason was taking. I noticed that the site includes the total number of results in the title tag of each category. So, since there are 90 results per page I just took that number, divided it by ninety for the number of pages. Then I evaluated the modulus to see if it didn't divide evenly, if so, I added one more page.
All the while the scrape is looping through all of the categories and calling this script for each one.
// Function to convert string to integer
makeNum(num)
{
if (num!=null)
{
num = String.valueOf(num);
num = num.replaceAll("\\D", "");
if (num.length()>0)
{
num = Integer.parseInt(num);
}
else
num = 0;
}
else
num = 0;
return num;
}
totalResults = makeNum(dataRecord.get("TOTAL_RESULTS"));
totalPages = (totalResults/90);
//If totalPages does not divide evenly by 90
// then add an additional page for the remaining results
if (totalResults%90>0)
{
totalPages++;
}
session.log("totalPages: " + totalPages);
for (i=1; i
{
session.setVariable("PAGE", i);
session.log("+++Starting page " + i + " of " + totalPages);
session.log("Number of scripts on stack: " + session.getNumScriptsOnStack());
//session.breakpoint();
//To prevent screen-scraper from ignoring your "stop scraping session" request
if (session.shouldStopScraping()==false)
{
session.scrapeFile("Search results");
}
}
-Scott