Automatic Anonymization: Setup
Controlling your Account
The anonymous proxy servers will be set up in such a way that they only allow connections from your IP address. This way no one else can use any of the proxies without your authorization. This configuration is tied to your password. For more on restricting connections see documentation on managing the screen-scraper server.
If you'll be running your anonymized scraping sessions on the same machine (or local network) you're currently on and you are using the workbench, you can click the Get the IP address for this computer button to determine your current IP address.
screen-scraper Setup
Using Workbench
Anonymization settings can be configured using screen-scraper's workbench. Settings are determined in the anonymous proxy settings of the settings dialog box.
When you sign up for the anonymization service you'll be given the password that allows your instance of screen-scraper to manage anonymous proxies for you. You'll enter it into the Password textbox in the settings.
As the proxy servers get spawned and terminated, it's a good idea to establish the maximum number of running proxy servers you'd like to allow. This is done via the Max running servers setting. Because you pay for proxy servers by the hour, if you don't have your scraping session set up to automatically shut them down at the end, you'll use the Terminate all running proxy servers button in order to do that.
We find that as many as 10 proxy servers but no fewer than five are adequate for most situations.
Using screen-scraper.properties File
If you're setting this value in a GUI-less environment (i.e., a server with no graphical interface), you'll want to set these values in the resource/conf/screen-scraper.properties file (if these property is not already in the file you'll want to add it).
- AnonymousProxyPassword: The password that you were sent.
- AnonymousProxyAllowedIPs: The IP addresses permitted to access anonymous sessions.
- AnonymousProxyMaxRunning: Maximum number of proxy servers used to do the scrape.
- AnonymizationURLPrepend: Which server to use for anonymization. By default http://anon.screen-scraper.com will be used.
Acceptable values are http://anon.screen-scraper.com and http://anon2.screen-scraper.com.
Be sure to modify the resource/conf/screen-scraper.properties file only when screen-scraper is not running.
Scraping Session Setup
Aside from these global settings, there are a few settings that apply to each scraping session you'd like to anonymize. You can edit these settings under the anoymization tab of your scraping session.
Once you've configured all of the necessary settings, try running your scraping session to test it out. You'll see messages in the log that indicate what proxy servers are being used, how many have been spawned, etc.
As your anonymous scraping session runs, you'll notice that screen-scraper will automatically regulate the pool of proxy servers. For example, if screen-scraper gets a timed out connection or a 403 response (authorization denied), it will terminate the current proxy server, and automatically spawn a new one in its place. This way you will likely always have a complete set of proxy servers, regardless of how frequently the target web site might be blocking your requests. You can also manually report a proxy server as blocked by calling session.currentProxyServerIsBad() in a script. When this method is called the current proxy server will be shut down and replaced by another.
- Printer-friendly version
- Login or register to post comments