Page incrementing by 20
Hello,
I'm very happy to find this tool and I would like to thank its creators.
I'm following the tutorial to scrape a site.
I've completed almost all.
My only problem is that I don't need a search criteria and the page is incrementing by 20 (It's not the page but it counts the product. if my search returns 100 products, it shows the first 20 then 20 to 40. So in the URL if I put page=20 it means that it will show the product between 20 and 40)
So I've found 2 approaches, 1 on the forum and the other in the tutorial.
The one on the forum tries to increment the page variable in the initialization script (it's from 2005 and I could't run it)
http://community.screen-scraper.com/node/262
and the other one is on tutorial (Tutorial 7: Page 3: Altering the Scraping Session). It's trying to get search criteria from a text file. I've changed it and put the page variable values in a text file (like 20 40 60 80 100 etc..) and try to get variable from this file.
Which one is the better way to go and how can I solve this problem.
Thank you
Actually, I would try to use
Actually, I would try to use an initialization script that resembles the following code that I'll write. On your scrapeableFile, put a checkmark in the box that says "This scrapeable file will be invoked manually" (the first tab of the scrapeableFile). Then make an Initialize script that does the following:
// Interpreted Java
session.setVariable("TOTAL_RESULTS","20");
for (session.setVariable("OFFSET",0); session.getVariable("OFFSET") <= Integer.parseInt(session.getVariable("TOTAL_RESULTS")) - 20; session.addToVariable("OFFSET", 20)
session.scrapeFile("Your scrapeableFile name here");
Then, add an extractor pattern on your scrapeableFile (the same one that you call in the "for" loop above) which looks something like this:
Search found ~@TOTAL_RESULTS@~ results
The pattern to use in the "TOTAL_RESULTS" token should be "\d+". Make sure you put a checkmark in the box that makes it save to a session variable.
You'll have to adapt that extractor pattern to match whatever is on the search results page.
If your search results page doesn't tell you how many results there are in total, then try to following instead:
// Interpreted Java
session.setVariable("TOTAL_PAGES","1");
for (int currentPage = 1; currentPage <= Integer.parseInt(session.getVariable("TOTAL_PAGES")); currentPage++)
{
session.setVariable("OFFSET", (currentPage - 1) * 20);
session.scrapeFile("Your scrapeableFile name here");
}
And then your extractor pattern on the scrapeableFile would look more like:
Showing page ~@junk@~ of ~@TOTAL_PAGES@~
The pattern to use in the "TOTAL_PAGES" and "junk" tokens should both be "\d+". Only put a checkmark to save a sessionVariable on the token for "TOTAL_PAGES".
Does that all make sense? In both cases, I'm using a For loop to increment a variable automatically. The middle part of the For statement is the part that checks to see if it should keep going. It determines if your current offset/page number is smaller than the maximum.
Let me know if you need more help! :)
Tim
I have the same issue and
I have the same issue and tried your suggestions but they did not work for me. I don't have anywhere that lists total pages, just a NEXT to the +20 page. right now, as i have it configured, i am in a loop, it keeps setting the variable back to 1 so it grabs page one over and over.
here is the structure:
http://www.websiteurl.com/results/results.cfm?startrow=1
http://www.websiteurl.com/results/results.cfm?startrow=21
http://www.websiteurl.com/results/results.cfm?startrow=41
and i am already extracting the startrow number as PAGE. any suggestions.
also, what is the "offset" variable you mention in your example above?
I'm a newbie with the same issue
and I can't get this to work for me. I've tried two approaches and both end up bringing back the same page over and over.
My first approach was to scrape the first_row variable. The scrape worked successfully because I see NEXTROW=26 in the log. However, the parameter value ~#NEXTROW#~ for Key of first_row apparently is not replaced because I still see :
Query result - temp for testing: POST data: first_row=1
in the log.
After the NEXTROW approach didn't work I tried the solution above and that did not work for me either.
This time I used a variable of ~#PAGE#~ as the value of first_row and used the following script executed 'after pattern is applied'
session.setVariable("NEXT_PAGE","nextpage");
for (int currentPage = 1; session.getVariable("NEXT_PAGE") != null; currentPage++)
{
session.setVariable("PAGE", (currentPage - 1) * 20 + 1);
session.setVariable("NEXT_PAGE", null);
session.scrapeFile("Query result - temp for testing");
}
Again this led to the same page being pulled over and over.
Please help!
Thanks
"PAGE" in this case will have
"PAGE" in this case will have the values 1, 21, 41, 61, etc... Do you need actual page numbers, or these 'offset' values?
There's no way that the "PAGE" variable in the code snippet you gave isn't working, since it's just a simple 'for' construct, where 'currentPage' is most definitely incrementing each time. Either you've got a bad variable name in your Parameters on the scrapeableFile, or you're handing it 'offset' numbers (as you're doing in the code) when you actually mean to give it 'page' numbers.
more info
Maybe if I type enough info here I'll figure this thing out myself or give you enough information to see the problem.
From the beginning...
Scraping Session has one script 'Initialize Session' executed Before scraping session begins. Script is:
session.setVariable( "NEXTR", "2" );
Sequence 1 scrapable file is Login. This logs in (successfully) then executes script "Scrape All Pages" after file is scraped.
Scrape All Pages looks like the script I posted before except for the one variable name change to NEXTR:
session.setVariable("NEXT_PAGE","nextpage");
for (int currentPage = 1; session.getVariable("NEXT_PAGE") != null; currentPage++)
{
session.setVariable("NEXTR", (currentPage - 1) * 20 + 1);
session.setVariable("NEXT_PAGE", null);
session.scrapeFile("Query result - temp for testing");
}
Scrapable File "Query result - temp for testing" is flagged to be invoked manually from script. All variables are set with desired values except "first_row" which is set to ~#NEXTR#~
That's it. I set NEXTR to '2' instead of '1' in initialize to see if that would actually start me on the 2nd row. It does not, and I see in the log "POST data: first_row=". (i.e. no value was assigned to first_row)
This is my first screen-scraper and it was so exciting to grab all the data I wanted from page 1 of the website and now I'm so flummoxed trying to get to page 2 of the website! Thanks for your help.
You've almost got it right,
You've almost got it right, except that setting NEXTR to 2 in an initialize script won't do anything for you, because the script that we've posted here (in its variants) only runs the scrapeableFile "Query result - temp for testing" *after* overwritting NEXTR with '(currentPage - 1) * 20 + 1. You could put the word 'moo' in there to prime it, but it would be overwritten by that for loop before the scrapeableFile ever runs.
POST parameters will actually *not* show up in as part of the URL. POST is not part of a URI request, but rather background information for use by the server receiving it. GET parameters *are* a part of the URI request, because their intended purpose is very different from that of POST. (Don't ask a .NET web designer that question, because he'll lie to your face about how POST should be used at all times :P) The purpose of GET and POST are very different, although they are effectively both used simply as variables to send to the target page.
So as for the issue at hand... I'm running a dummy scrape modeled after yours (with that exact script, copied and pasted, except that I simply log the value of "NEXTR" right after that 'scrapeFile' command. (had to do '.toString()' on it, since 'currentPage' is an integer). I just made it post out to a local page of mine with a NEXTR variable, and it resolves just fine... here's my entire log:
Starting scraper.
Running scraping session: ~~Dummy
Processing scripts before scraping session begins.
Processing script: "~~init"
Scraping file: "Query result - temp for testing"
Query result - temp for testing: Preliminary URL: http://localhost:8080/johan.php
Query result - temp for testing: Using strict mode.
Query result - temp for testing: POST data: NEXTR=1
Query result - temp for testing: Resolved URL: http://localhost:8080/johan.php
Query result - temp for testing: Sending request.
Query result - temp for testing: Extracting data for pattern "detect 'next page' link"
Query result - temp for testing: The following data elements were found:
detect 'next page' link--DataRecord 0:
NEXT_PAGE=
Storing this value in a session variable.
1
Scraping file: "Query result - temp for testing"
Query result - temp for testing: Preliminary URL: http://localhost:8080/johan.php
Query result - temp for testing: Using strict mode.
Query result - temp for testing: POST data: NEXTR=21
Query result - temp for testing: Resolved URL: http://localhost:8080/johan.php
Query result - temp for testing: Sending request.
Query result - temp for testing: Extracting data for pattern "detect 'next page' link"
Query result - temp for testing: The following data elements were found:
detect 'next page' link--DataRecord 0:
NEXT_PAGE=
Storing this value in a session variable.
21
Scraping file: "Query result - temp for testing"
Query result - temp for testing: Preliminary URL: http://localhost:8080/johan.php
Query result - temp for testing: Using strict mode.
Query result - temp for testing: POST data: NEXTR=41
Query result - temp for testing: Resolved URL: http://localhost:8080/johan.php
Query result - temp for testing: Sending request.
Query result - temp for testing: Extracting data for pattern "detect 'next page' link"
Query result - temp for testing: The following data elements were found:
detect 'next page' link--DataRecord 0:
NEXT_PAGE=
Storing this value in a session variable.
41
Scraping file: "Query result - temp for testing"
Query result - temp for testing: Preliminary URL: http://localhost:8080/johan.php
Query result - temp for testing: Using strict mode.
Query result - temp for testing: POST data: NEXTR=61
Query result - temp for testing: Resolved URL: http://localhost:8080/johan.php
Query result - temp for testing: Sending request.
Query result - temp for testing: Extracting data for pattern "detect 'next page' link"
Query result - temp for testing: The following data elements were found:
detect 'next page' link--DataRecord 0:
NEXT_PAGE=
Storing this value in a session variable.
61
Scraping file: "Query result - temp for testing"
Query result - temp for testing: Preliminary URL: http://localhost:8080/johan.php
Query result - temp for testing: Using strict mode.
Query result - temp for testing: POST data: NEXTR=81
Query result - temp for testing: Resolved URL: http://localhost:8080/johan.php
Query result - temp for testing: Sending request.
Query result - temp for testing: Extracting data for pattern "detect 'next page' link"
Query result - temp for testing: The following data elements were found:
detect 'next page' link--DataRecord 0:
NEXT_PAGE=
Storing this value in a session variable.
81
Scraping file: "Query result - temp for testing"
Query result - temp for testing: Preliminary URL: http://localhost:8080/johan.php
Query result - temp for testing: Using strict mode.
Query result - temp for testing: POST data: NEXTR=101
Query result - temp for testing: Resolved URL: http://localhost:8080/johan.php
Query result - temp for testing: Sending request.
Query result - temp for testing: Extracting data for pattern "detect 'next page' link"
Query result - temp for testing: The following data elements were found:
detect 'next page' link--DataRecord 0:
NEXT_PAGE=
Storing this value in a session variable.
101
Scraping session "~~Dummy" finished.
I made it break in the middle. I made it always detect a next page by just making it match a '<ul>' tag, so that it just kept on going and going until I made it quit. As you can see, it's posting the data correctly, incrementing the value by 20 each time, starting from 1.
I'm not sure what could be wrong with your setup, since that script performs completely as expected. Semantically speaking, though, you could theoretically be NOT matching your 'NEXT_PAGE' variable, and so it just moves on to the next category or something, which makes it reset to '1' every time that script is executed. Does that make sense? Are you iterating 'categories' of sorts, where each 'category' needs to deal with pages?
It just feels like either *something* is not getting saved as a session variable, or you're not matching a 'NEXT_PAGE' indicator, or nothing is actually wrong and you're just seeing it attempting the first page for every category, and there simply *isn't* a next page for the categories you're seeing scraped. (In that latter case, you should eventually see it hit some 'NEXT_PAGE' action, once it comes upon a category that does in fact have a next page.)
Is any of this helping? If you can, post a log (or the relevant segment thereof) from the scrape. You can copy it out of the window in screen-scraper. Please be sure to put "<code>" and "</code>" around the log that you post, so that any HTML in the log doesn't get interpreted as actual HTML.
Hope we can sort out what's going wrong!
Tim
I've got it working
Tim
You called it correctly... I was not matching a NEXT_PAGE indicator. I misunderstood how getvariable works. Also, after an untimely close of screen-scraper I had to restore C:\Program Files\screen-scraper basic edition\resource\db files from a backup of a days ago and after the restore I had had an misplaced, extraneous call of my scraping script. (It is great that screen-scraper automatically puts those backup folders out there!) I'm not sure if I had that same issue when I last logged and didn't realize it or whether I introduced that today, but I do know that your posts were a big help in getting me to look in the right direction. Many thanks!
Sorry about the delay in posting an all clear reply. I only get a chance to work with this sporadically.
Thanks
Alan
still not working but maybe a new clue
The values 1, 21, 41, 61, etc are what I want and I don't understand why i'm not getting them.
The variable name that I used in Parameters was ~#PAGE#~ but when I look at it now I notice that the ~#PAGE#~ name did not save and the name ~#NEXTROW#~ was listed in the Parameters instead. Odd, since I'm sure I saved it with PAGE. But instead of changing the parameter back to PAGE I instead changed the script to use NEXTROW in place of PAGE but I'm still seeing
POST data: first_row=1
in the log. But this is not showing in the "Resolved URL:" in the log. Shouldn't the parameters show up as part of the URL? Why would it not?
Thanks!
Yeah, I realize that I didn't
Yeah, I realize that I didn't fully explain the offset variable.
The for loop is automatically incrementing a fake page number, starting at 1. Thus, to know what result to start the search at, do (currentPage-1)*20+1 to figure it out. If page is 1, then the starting number is 1, if page is 2, then the starting number should be 21, etc, etc. The example I gave above lacked the last '+1', but you'll need it in your example.
The point of the OFFSET variable is to represent the "startrow", which you've said is "PAGE" in your scrape.
I'll repost the code, but this time with some alterations for your case... Since you don't know how many pages will be there, then your scrapeableFile should have an extractor pattern on it which matches the "next page" link. Make it use the variable "NEXT_PAGE", and make sure that it saves to a session variable. That way, the scrape will continue so long as your scrapeableFile can find a "next page" link.
You shouldn't need to scrape the "PAGE" (startrow) on your file, since we'll set it manually in this script.
Here's that code... all you have to do is change that last line (above the last closing brace) to the name of your file.
session.setVariable("NEXT_PAGE","nextpage");
for (int currentPage = 1; session.getVariable("NEXT_PAGE") != null; currentPage++)
{
session.setVariable("PAGE", (currentPage - 1) * 20 + 1);
session.setVariable("NEXT_PAGE", null);
session.scrapeFile("Your scrapeableFile name here");
}
you are good.. that worked
you are good.. that worked perfect... that helps me understand some other things i am thinking about doing.
The longer I work for the
The longer I work for the company the more I see the need for good efficient, non-recursive, easy-to-fix-when-its-broken code. This is one of the best ways I've been able to do it, where the 'next' page is simply detected by a single pattern's variable.
The advantage I see in doing this is that you can debug easily. You could add a maximum page limit by adding in an if statement at the end of the for loop, something like..
if (currentPage == 3)
break;
and then that way you can control how many pages are done, so that during development you can focus on the process in general.
thank you tim, I'll try it
thank you tim,
I'll try it ASAP