RetryPolicy
Overview
Retry Policies are objects that tell a scrapeable file how to check for errors, and optionally what to do before retrying to download the files. Some of the things that can be done are executing scripts when the page loads incorrectly or running Runnables. Usually these things would either request a new proxy, output some helpful information, or could simply stop the scrape. RetryPolicy is an interface and can be implemented to create a custom retry policy, or there is a RetryPolicyFactory class that can be used to create some standard policies.
This policy is checked AFTER all the extractors have been run. This allows for checks on whether extractor patterns matched or not, and also allows a page to have it's 'error status' based off of another page (since extractor patterns could execute scripts that scrape other files, and those files could set a variable that acts as a flag to a previous retry policy). This could also cause some problems if the scrape isn't built to handle a page whose extractors shouldn't be run before the error checking occurs.
This interface is in the com.screenscraper.util.retry
package.
Interface Implementation
If you need a custom retry policy, you can implement your own version of it. Be aware that you will need to ensure the references it has to the scrapeableFile are to the correct scrapeableFile. This could be tricky if you use the session.setDefaultRetryPolicy method. When using the scrapeableFile.setRetryPolicy method, the scrapeableFile will be the correct object. The interface is given below.
To help ensure you can create custom retry policies that have access to the scraping session and the scrapeable file that is currently being checked, there is an AbstractRetryPolicy class in the same package as the interface. This class defines some default behavior and adds protected fields for the session and scrapeable file that get set before the policy is run. If you extend this abstract class you can access the session and scrapeable file through this.scrapingSession and this.theScrapeableFile. Due to some oddities with the interpreter it is best to reference these variables with 'this.' to eliminate a few problems that arise in a few specific cases.
{
/**
* Checks to see if the page loaded incorrectly
*
* @return True on errors, false otherwise
* @throws Exception If something goes wrong while executing this method
*/
public boolean isError() throws Exception;
/**
* Runs this code when the page had an error. This could include things such as rotating the proxy.
*
* @throws Exception If something goes wrong while executing this method
*/
public void runOnError() throws Exception;
/**
* Returns a map that can be used to output an error message to indicate what checks failed. For instance,
* you could set a key to the value "Status Code" and the value '200', or a key with "Valid Page" and value 'false'
*
* @return Map of keys, or null if no values are indicated
*
* @throws Exception If something goes wrong while executing this method
*/
public Map getErrorChecksMap() throws Exception;
/**
* Returns true if the session variables should be reset before attempting to rescrape the file, if there was an error.
* This can be useful especially if extractors null session variables when they don't match, but the value is needed
* to rescrape the file.
*
* @return True if session variables should be reset if there was an error, false otherwise.
*/
public boolean resetSessionVariablesBeforeRescrape();
/**
* Returns true if the referrer should be reset before attempting to rescrape the file,
* if there was an error. This can be useful to reset so the referrer
* doesn't show the page you just requested.
*
* @return True if the referrer should be reset if there was an error, false otherwise.
*/
public boolean resetReferrerBeforeRescrape();
/**
* Returns true if errors should be logged to the log/web interface when they occur
*
* @return True if errors should be logged to the log/web interface when they occur
*/
public boolean shouldLogErrors();
/**
* Return the maximum number of times this policy allows for a retry before terminating in an error
*
* @return The maximum number of times to allow the ScrapeableFile to be rescraped before resulting in an error
*/
public int getMaxRetryAttempts();
/**
* This will be called if all the retry attempts for the scrapeable file failed.
* In other words, if the policy said to retry 25 times, after 25 failures this
* method will be called. Note that {@link #runOnError()} will be called just before this,
* as it is called after each time the scrapeable file fails to load
* correctly, including the last time it fails to load.
* <p/>
* This should only contain code that handles the final error. Any proxy rotating, cookie
* clearing, etc... should generally be done in the {@link #runOnError()}
* method, especially since it will still be called after the final error.
*/
public void runOnAllAttemptsFailed();
}
- Printer-friendly version
- Login or register to post comments