Action on Error code in screen-scraper
I occasionally get annoying 502 (bad proxy) errors that make my scrapes leap when using TOR on Linux-Ubuntu. Is there a way that SS can take action on a specific error message like this? If possible I would like to get a session pause and re-scrape the current scrapeable file when receiving a 502 error.
Is that doable?
/Johan
check out the
check out the 'scrapeableFile' section in the API doco... you can check what the HTTP response code was for the last scraped file... but you have to run it while the scrapefile is in scope... I have a seperate script that run 'after file is scraped' that saves all the status vars to session variables...
Once you've got those your master script can act on the them after the page is scraped...
session.setVariable("SCRAPED_FILE",scrapeableFile);
session.setVariable("SCRAPED_FILE_NAME",scrapeableFile.getName());
session.setVariable("SCRAPED_FILE_URL",scrapeableFile.getCurrentURL());
session.setVariable("SCRAPED_FILE_NO_PATTERNS_MATCHED",scrapeableFile.noExtractorPatternsMatched());
session.setVariable("SCRAPED_FILE_WAS_ERROR_ON_REQUEST",scrapeableFile.wasErrorOnRequest());
session.setVariable("SCRAPED_FILE_STATUS_CODE",scrapeableFile.getStatusCode());
session.setVariable("SCRAPE_FAILED",false);
if (scrapeableFile.noExtractorPatternsMatched() || scrapeableFile.wasErrorOnRequest()) {
if (session.getVariable("NO_ITEMS") == null)
session.setVariable("SCRAPE_FAILED",true);
else
session.setVariable("NO_ITEMS",null);
}
scrapeableFile.getStatusCode() is the specific one you're looking for...
yeah, even in some of the
yeah, even in some of the complex things we've tried to do with Tor and automatically changing identities when blocked, it all revolves around the usage of
scrapeableFile.getStatusCode()
, which returns an 'int' variable. In cases like the 502 error, you might just have to wait a few seconds (session.pause(seconds * 1000)
) and then do something like this:session.scrapeFile(scrapableFile.getName());
... which will call the same scrapeableFile again. Just be careful with that line of code, because you start getting recursive if it happens too frequently.
Beautiful solutions and great
Beautiful solutions and great examples.
Thanks!