Single page / Next page / Previous page problem

I am scraping from a yellow pages type site. In my start script I provide values for location and category variables. When my script retrieves the first page, I scrape the details from the list of companies. I want to move to the next page if there is one. In the spot where I can retrieve the data I need as a session variable for the next page one of 2 things can happen:

For a Single page with no Next page:

Search Results: 1-9 of 9

shockeymoe on 10/26/2006 at 9:58 pm

screen-scraper public support

Single page / Next page / Previous page problem

Again I'm not entirely sure I understand which p_str you're trying to get and when, but, the way I understand it, what you're trying to do is get the first one unless the "Previous" link is present, in which case you want the second p_str. If that's the case, I believe you could solve the problem by changing the patterns like so:
Search Results:~@JUNK@~ String pStr2 = session.getVariable("P_STR2"); // corresponds to second pattern


// If pStr2 isn't null and isn't blank, use it for P_STR

// Otherwise, use pStr1

if ( pStr2 != null && !(pStr2.equals("")) ) {

  session.setVariable("P_STR", pStr2);

} else {

  session.setVariable("P_STR", pStr1);

}
// Clear P_STR1 and P_STR2 so we don't accidentally reuse them if the

// next time their patterns run there is no match

session.setVariable("P_STR1", "");

session.setVariable("P_STR2", "");
// **********



That way, P_STR2 will only have a value if its pattern matches (meaning that the word "Previous" is present) and therefore P_STR2 will be used for P_STR. Otherwise, P_STR1 will be used for P_STR.
If this doesn't quite hit the mark, it should be easy to add or change some if statements in the script that will catch all your conditions for when you want to use the first p_str, the second, or neither.
Let me know if that does the trick.



       Alan on 11/01/2006 at 10:31 am
  
      Login or register to post comments





  

  
  
    Single page / Next page / Previous page problem

    
      Thanks for your attention to this matter Alan. I really appreciate the use of your brain.

It is the second instance of the p_str that I am trying to get at. I can see that your pattern would work when a "Previous" link is present. The problem your solution presents is that the "Previous" text only appears some of the time. When it doesn't appear, this pattern fails.

Please look at the samples I posted previously. There are pages that I am scraping that are the only page for a particular category. Then there are results for another category that will have multiple pages. So sometimes I have just a "Next p-str" that I want to extract, sometimes I have a "Previous p-str" and a "Next p-str" and I only want to extract the "Next p-str", sometimes I have just a "Previous p-str" and no "Next p-str" so I want nothing extracted, and sometimes I have neither "Previous p-str" or "Next p-str" and again I want nothing extracted.

If I include too much in the extractor pattern it fails because the text doesn't match anything. If I have too little, it will find the first instance it comes across which is often not correct.

I need an "if...  then" type of pattern construction to tell the extractor pattern to ignore the first "p_str" if the text "Previous" is present. That would solve my problem.
          
  

       shockeymoe on 10/31/2006 at 9:31 am
  
      Login or register to post comments

  



  

  
  
    Single page / Next page / Previous page problem

    
      If you're trying to get the first instance of p_str (the one after "Search Results" and before "Previous" -- should be 1 if you use the code you pasted above) try this pattern:
Search Results:~@JUNK@~
Previous |  Alan on 10/30/2006 at 9:42 am
  
      
Login or register to post comments

  



  

  
  
    Single page / Next page / Previous page problem

    
      If I understand, the only thing you can't get that you need to get to the next page is the number that comes after the "p_str=" parameter. Could you post your extractor pattern that is trying to get that variable?
Also, I've found that using a bogus token name for an extractor pattern usually works better than using IGNORE when I just want to ignore some text. For instance, I'll use ~@JUNK@~ instead of ~@IGNORE@~ and if I need to I set the JUNK variable to use a regular expression that fits the text that I just want to skip over. IGNORE tends to be somewhat greedy.
          
  

       Alan on 10/27/2006 at 10:01 am
  
      Login or register to post comments

  
            
            
            
            
                                

	© ekiwi, LLC  |  
	Blog  |  
	About  |  
	Contact  |  
	Legal

Search

Community

screen-scraper

User login

Single page / Next page / Previous page problem

Single page / Next page / Previous page problem

Single page / Next page / Previous page problem

Single page / Next page / Previous page problem

Single page / Next page / Previous page problem