filter duplicates - cannot get it to work!
Hi,
Please advice on the following issue I am facing.
There is this page I am scraping using the following extractor pattern:
"details link extractor":
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Pattern text: a href="?id=~@ID@~&code=~@CODE@~">
Advanced tab for this extractor patterns has checked the following checkboxes:
- Automatically save the dataset generated by this extractor in a session variable
- If a data set by the same name is found: "Overwrite"
- Filter duplicates
- Cache the data set
The extracted tokens are configured as follow:
~@ID@~ : store in session variable, use to filter duplicates, trim white spaces
~@CODE@~ : store in session variable, trim white spaces
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Within the the page there are the following hrefs I need to get:
<a href="?id=22&code=DVD">
<a href="?id=22&code=DVD">
As expected the patterns matches all 3 hrefs and stores the tokens in session variables.
My problem is, that acroding to the documentation, in the dataset I should get only 2 of them.
When I press "Test Pattern", is get the following table:
Sequence ID CODE
- - - - - - - - - - - - - - - -
0 21 DVD
1 22 DVD
2 22 DVD
ID is something unique and this is why I need to filter by "ID".
Shouldn't the data set contain only first 2 entries (dropping the duplicate ones)?
Please advice.
Best regards.
mmmmmm, This turned out to be
mmmmmm,
This turned out to be a bug in screen-scraper. We appreciate you pointing it out so much that we would like to offer you a discount on your future purchase of a screen-scraper license. Please contact our office and mention this bug for your discount.
If you will update to the latest alpha release you will have a fix for the bug.
Thank you,
Scott