Page Range Not to Exceed Total

What script would I need to scrape a range of pages but not to exceed the total amount of pages? The page number is in the URL, such as:

http://www.website/1_p/

For example, I wish to scrape pages 1 through 5,000, but if there is 4,800 total pages it would scrape 1 through 4,800.

You'll have to find a way to

You'll have to find a way to gather that information on each page.

If you want, you can use a script that I put together a little while ago, found in our Tips -> Script Repository section: Next Page - Memory Concious

I explain how to use it in that post, but in short, you need to put an extractor pattern on your page's scrapeableFile, which simply matches if a next page exists, but won't match if there are no more pages.

Usually this is done with a "next" link:

Next Page

... where you can replace the text "Next Page" with an extractor pattern variable:
Extractor pattern:

~@HAS_NEXT_PAGE@~

The pattern of that HAS_NEXT_PAGE token should just be the text "Next Page" (which was the text that you replaced with the token). Save that variable as a session variable, and make sure to follow the other directions on the link I gave you.

If your page doesn't have a very explicit "Next page" link like that, try to find something else that clues you in. If you've got no other good ways of doing that, let me know, and we can work out a slightly different approach.

Tim

Two Computers versus One

Thanks Tim for the reply. I don't want to use the "Next Page" script because I may have two computers gathering the data records. One computer would scrape pages 1 through 2,500 and the other computer would scrape pages 2,501 through 5,000, but if there is only 4,800 pages it would not scrape pages 4,801 through 5,000. However, the total number of pages will change so I would like to use an extractor pattern to isolate the total number of pages. Thanks in advance for your help.

Since that's the case, I

Since that's the case, I would still suggest the script I linked to, but you could make a modification, so that it will loop between only certain page ranges as a min/max range. If the loop tries to go beyond the range, you can break out of the loop early. Or, if there are no more pages (for instance, at 4801), it would stop all by itself anyway.