Grabbing Incremental Session Variable Changes
Hi everyone, I'm a newbie to Screen Scraper, though not to the concept and I have to say it's the best product out there I've found and the community support I've seen in the forums is really good too!
I'm struggling to get some basic information though: I need to grab and use a couple of session variables during a scrape, but I can't work out how to reference them in VB. I'll explain the two things I'm having difficulty doing:
1st problem: Grabbing all results up to 10 pages where there are less than 10 pages
I'm scraping a site's search result pages, passing keyword and page variables to the URL. The keyword variable is pulled from a .txt file, line by line as in Tutorial 2, the page variable is pulled from the URL string on the 'Next' page link. Each result page has 10 results (I can't change this). I don't want to grab details for all the results, just the first 10 pages. Also, when there are less than 10 pages of results, I need to change to the next keyword. This is the VB code I'm using at the moment:
If session.getVariable( "PAGE" ) <= 91 Then
Call session.ScrapeFile( "Search Results" )
Else
Call session.setVariable ( "PAGE",NULL )
End If
But when there are less than ten pages of results, it gets stuck repeating the last page variable in a loop. The PAGE variable is grabbed from the 'Next' link which is of course not present if there are less than ten pages of results. My guess is that I need to change the logic of the code around, but think I need to reference either the fact that the 'NEXT' Page was not found or the fact that the PAGE variable will be the same. How would I do that?
2nd Problem: Number the order of my results
As I scrape the listings, I write to a file (again as an Tutorial 2). The listings are given in order, but I need to assign the order they were scraped in across all the pages scraped for the keyword. So, the fifth result on page 3 would be 35 and so on. This seems simple, and the most obvious solution would be to get the session number for the keyword but I can't find the name of this session value anywhere. An alternative would be to do the calculation on the fly, but this will add to the processing time (which is already quite long).
Any help appreciated!
10000
Thanks for your questions!
Thanks for your questions! (You're helping us to built a little wealth of knowledge in these forums!)
For your first problem, of repeating the same page over and over again when there are less than 10 pages of results...
I would say that the best way to tell your scrape to bail and move to the next keyword would be to add an extractor pattern to your scrapeableFile which tries to resolve what the final page number for results is. For instance, if I were trying to scrape Google's Image Search, each page clearly tells me several page numbers so that I can jump to whichever I prefer. I could make a pattern for my scrapeableFile which tries to match:
// extractor patterns for the following tokens could be:
// SOME_URL : [^"]*
// SOME_PAGE_NUMBER: \d+
~@SOME_PAGE_NUMBER@~[***now match some thing in the HTML that always comes after the last listed page number***]
For that flagrant last part, about something that appears after the last page number... it could be the "next page" link, it could be a simple "</div", whatever marks the end of the page listings. This way, this pattern will only ever match the last page number listed.
You can then save the "SOME_PAGE_NUMBER" variable, and add it into your page-shifting if statement:
// As quoted from your question, with an addition to the code.
// And... I don't think you meant to write "<= 91" for page comparison :) but I left it anyway.
If session.getVariable( "PAGE" ) == session.getVariable( "SOME_PAGE_NUMBER" ) Then
Call session.setVariable ( "PAGE",NULL )
If session.getVariable( "PAGE" ) <= 91 Then
Call session.ScrapeFile( "Search Results" )
Else
Call session.setVariable ( "PAGE",NULL )
End If
(mind you, I'm no good at VB... I've hardly used it, so I'm only making an educated guess at string-comparing syntax :D )
This way, the session will cut off the current keyword if it hits that "SOME_PAGE_NUMBER" max that it keeps scraping, or it will continue with its keyword so long as the page number is under your desired limit.
As for your second question... I think the easiest thing to do would be to perform that simple calculation. I don't think it would add too much to the processing time. My guess is that you'd want to do something like... (pseudo code with possible Java influence)
currentPage = get the session variable for "PAGE"
currentEntry = dataSet.getNumDataRecords()
keywordEntryNumber = ( currentPage * 10 ) + currentEntry
You could condense that into a single line, if you want, to avoid extra variable assignments. The second line of that code can be used assuming you have a dataSet in scope (which isn't hard... just don't go into nested scripts within scripts, and don't call this calculation "After file is scraped"). All it does is grab the total number of entries made in the dataSet (the cumulative collection of dataRecords during the current execution of the scrapeableFile), which will coincidentally be equal to the number of entries you've found on the current execution of the scrapeableFile.
Useage of the above code might need to be tweaked if you've got some abnormal scope stuff going on. Let me know if this helps you out, or if you encounter more problems with this approach!
Tim
Great Support Guys
Hi Tim (& Rich),
Thanks for all the help - it's really impressive how much support you guys offer.
I've got the session singing now - I got some additional advice via email to use 'After Each Application' for my 'Next Page' extractor pattern to get it to recognise when there is no next page. Semantically odd, but working a treat.
The calculation steps above won't work I think - the getNumDataRecords() returns 10, and is written to each DataRecord, so I simply have 10 next to all my results instead of 1-10. I can't see how to get the actual data record number in there, so instead of performing the calculation 'on the fly' I'll perform it when I process the exported data in VB. Should be no issue as the results are posted sequentially and I'm grabbing the keyword and page variables.
Cheers for all the help guys!
10000
Well, the idea is that
Well, the idea is that "getNumDataRecord()" is slowly growing. Yes, the end result will always be "10" once the page is done scraping, but while the page is in the process of being scraped, that total will slowly be growing. So long as the calculation is being made "After each pattern application", the number yeilded by "getNumDataRecords()" will increment each time. At the first call, there was only 1 record in the set, on the second call, now there are 2, etc etc.