duplicate entries

Hello .. I have a problem ...;-)

I scrap a page and filter the next "PAGE".

The works flawlessly .. was reached until the last page (example No. 8) ..
If the page has reached 8, is always stored in the variable, the variable 8, and again scanned the page 8.

Unfortunately, I find no possibility that may occur in the variable "PAGE" no duplicate entries.

In my Enterprise Edition I think, unfortunately, under "Advanced" option to "filter duplicate records" does not.

Can anyone help with my problem?

Thank you very much
Boergi
::::::::::::::::::::::
Hallo.. ich hab ein Problem ;-)...

ich scrappe eine Seite und Filter die nächste "PAGE" .

Die funktioniert einwandfrei.. bis die letzte Seite (example Nr. 8) erreicht wurde..
Wenn die die Seite 8 erreicht wurde, wird in der Variable immer wieder die Variable 8 gespeichert, und immer wieder die Seite 8 gescannt.

Leider finde ich keine möglichkeit das bei der Variable "PAGE" keine doppelten Einträge vorkommen dürfen.

Bei meiner Enterprise Edition finde ich leider unter "Advanced" den Option "Filter duplicate records" nicht.

Kann mir jemand bei meinem Problem Helfen ??

Vielen Dank
Boergi

Boergi on 04/26/2012 at 4:19 am

screen-scraper support for licensed users

The "filter duplicate

The "filter duplicate records" doesn't do anything when you're writing to a file.

I'm not understanding the problem with the page iteration. I suspect that you have a sequence on the file, and it's being called from a script. If you go to the scrapeable file, and on that first tab, check the box "this scrapeable file will be invoked manually".

jason on 04/27/2012 at 9:17 am

Thanks for the reply. I try

Thanks for the reply.

I try to explain it more precisely ...

I, for example, a search results page of 8 next pages.

First It scans the results of page 1 and read out the variable "PAGE" 2 for the next page.
"http://test.com?page=~#PAGE#~"
"http://test.com?page=2"

Second Now the results of Page 2 are scanned and read out the variable "PAGE" 3 for the next page
"http://test.com?page=~#PAGE#~"
"http://test.com?page=3"

and where more ...

Page 8 also read ... But there is no variable "PAGE" is found ...

Well, unfortunately, is always the page 8 or page "NULL" or page "0" scanned.

"http://test.com?page=~#PAGE#~"
"http://test.com?page=8"
"http://test.com?page=NULL"
"http://test.com?page=0"

I would, however, the "screen-scraper" never scanned a page twice.

Thank you for your help
Boergi

Boergi on 04/27/2012 at 9:52 pm

Page iteration

Sorry it's taken so long to respond.

If I understand correctly, you are scraping a site where the "Next Page" link just has the next page number. On the last page of results (page 8), the "Next Page" link just points back to page 8 again.

The easiest way I have found to fix this is with a simple script that checks the page number before scraping the "Next Page" again. For example:

int lastPage = -1;
int currentPage = 0;

while(lastPage < currentPage)
{
lastPage = currentPage;
session.setVariable("PAGE", currentPage);
session.scrapeFile("Search Results");
// The "Search Results" page needs an extractor pattern that extracts the next page number

if(session.getVariable("NEXT_PAGE_NUMBER") != null)
{
currentPage = Integer.parseInt("NEXT_PAGE_NUMBER");
}
}

Make sure this script isn't called by the "Search Results" scrapeable file, as it would continually scrape the first page of results. If you need to call this from the Search Results file, you could modify the script like so:

if(session.getVariable("ALREADY_IN_SCRIPT") != null)
return;
session.setVariable("ALREADY_IN_SCRIPT", "yes");

/* Put the script body here */

// Reset the session variable, so we can use the script again later
session.setVariable("ALREADY_IN_SCRIPT", null);

mikes on 05/25/2012 at 11:39 am

Search

Community

screen-scraper

User login

duplicate entries

The "filter duplicate

Thanks for the reply. I try

Page iteration