newbie - selecting only first instance of pattern
Hi,
I'm working on my first project and I'm trying to deal with a next link. From the tutorial (excellent btw) I've managed to dig down into multiple sub pages from a list page but I'm having a little difficult with the next link.
I've got two issues - firstly the next link appears twice on the page (top and bottom), how do I just scrape the first instance.
Secondly I'm having a little bit of fun trying to use the next link. The link records the last record displayed e.g. record 61 and when activated the next page then displays 20 more records and records the next last record displayed. How do I parse this and request the next link?
I think I need to store the last displayed record which I can find with an extractor pattern and then call the link with this stored value.
I assume that I need to run some form of script but I'm getting a little lost. Something like the following.
if( dataSet.getNumDataRecords() > 0 )
{
session.scrapeFile( "Search Results" );
}
newbie - selecting only first instance of pattern
Hi, Alex!
Of your two questions, the second looms a little more imposing than the first. Solving the second problem will likely solve the first one for you! It's a little tricky to explain in a text-based forum, but I'll try to be clear.
So...
[quote]Secondly I'm having a little bit of fun trying to use the next link. The link records the last record displayed e.g. record 61 and when activated the next page then displays 20 more records and records the next last record displayed. How do I parse this and request the next link?[/quote]
You are right in your assumption that using a script is the way to go.
One solution is to place the whole session under the control of scripts, rather than allowing screen-scraper to run through scrapeableFiles in sequence. For example, go through each of your scrapeableFiles and in each's [i]Properties[/i] tab, click the checkbox that reads "[color=darkred]This scrapeable file will be invoked manually from a script[/color]"
You'll need an "initilize" script to set a variable at the start of the session which will track which page you are on, and call it something like "PAGE_COUNTER". I tend to initlized it to "1" (yes, a String---it's up to you how to do it, but I like dealing with all string variables). Then, in that script, you can start scraping the first page of results with a quick "[color=darkred]session.scrapeFile("Results Page");[/color]". Set this little initilize script to run on your scraping session's [i]Scripts[/i] tab, and set it to run [color=darkred]"Before scraping session begins"[/color].
Now, as for that extractor pattern to find the "next page" link, you'll want to edit it a little bit so that you basically just detect your current page and save it to a session variable. Upon matching this pattern, you should launch a script (like the one I've written just below, for example) "[color=darkred]After pattern is applied[/color]". (This should help you get around the fact that there are 2 instances of the page numbers on the page.) I usually make this my very last extractor pattern on my "results page" scrapeable file.
As for what that script does, make it follow a basic flow like the following code:
int page = Integer.parseInt(session.getVariable("PAGE_COUNTER")) + 1;
// poke it back into the same page variable
session.setVariable("PAGE_COUNTER", page.toString());
// call the results page again, and this time the page variable is different, so you'll get a new set of results.
session.scrapeFile("Results");
Depending on the webiste you're scraping, this may or may not be enough to finish the setup. If you hand the website a page number of 55 and there are only 54 pages, and it gives you a "page not found" or a 404 error, then you'll be just fine, because your page number extractor pattern won't match, and so it shouldn't launch the "page increment" script that we just went over. However, from time to time you'll have to add an extra extractor pattern into your results scrapeableFile which ONLY matches something unique about the very very last page of results. It could be anything, so long as it matches only on the last page. This little extractor pattern should execute a script ("[color=darkred]After pattern is applied[/color]") that simply terminates the scraping session. Place it's sequence number just above the other extractor pattern that matches for the current page number, so that the last page will cause the session to stop scraping before it has a change to try to increment page numbers into the indefinite realms of auto-generated php pages :)
Note that this certainly won't be the "best" way to do every website you may encounter which needs special attention. I find this to be a thourough and reuseable method.