Automatic Anonymization

Overview

The screen-scraper automatic anonymization service works by sending each HTTP request made in a scraping session through a separate high-speed HTTP proxy server. The end effect of this is that the site you're scraping will see any request you make as coming from one of several different IP addresses, rather than your actual IP address. These HTTP proxy servers are actually virtual machines that get spawned and terminated as you need them. You'll use screen-scraper to either manually or automatically spawn and terminate the proxy servers.

Steps to take

Cost

  • $150 setup
  • 25 cents per proxy per hour

Note: When using the automatic anonymization method, while the remote web site may not be able to determine your IP address, your activity will still be logged. If you attempt to use the proxy service for any illegal activities, the chances are very good that you will be prosecuted.

Limitations

While the automatic anonymization service provides an excellent way to cloak your IP address it is still possible that the target web site will block enough of the anonymized IP addresses that the anonymization could fail. Unfortunately we can't make any guarantees that you won't get blocked; however, by using the automatic anonymization service the chances of getting blocked are reduced dramatically.

Miscellaneous

  • Anonymization REST Interface
  • Workbench Interface: Scraping Session: Anonymization tab
  • Automatic Anonymization: Setup

    Controlling your Account

    The anonymous proxy servers will be set up in such a way that they only allow connections from your IP address. This way no one else can use any of the proxies without your authorization. This configuration is tied to your password. For more on restricting connections see documentation on managing the screen-scraper server.

    If you'll be running your anonymized scraping sessions on the same machine (or local network) you're currently on and you are using the workbench, you can click the Get the IP address for this computer button to determine your current IP address.

    screen-scraper Setup

    Using Workbench

    Anonymization settings can be configured using screen-scraper's workbench. Settings are determined in the anonymous proxy settings of the settings dialog box.

    When you sign up for the anonymization service you'll be given the password that allows your instance of screen-scraper to manage anonymous proxies for you. You'll enter it into the Password textbox in the settings.

    As the proxy servers get spawned and terminated, it's a good idea to establish the maximum number of running proxy servers you'd like to allow. This is done via the Max running servers setting. Because you pay for proxy servers by the hour, if you don't have your scraping session set up to automatically shut them down at the end, you'll use the Terminate all running proxy servers button in order to do that.

    We find that as many as 10 proxy servers but no fewer than five are adequate for most situations.

    Using screen-scraper.properties File

    If you're setting this value in a GUI-less environment (i.e., a server with no graphical interface), you'll want to set these values in the resource/conf/screen-scraper.properties file (if these property is not already in the file you'll want to add it).

    • AnonymousProxyPassword: The password that you were sent.
    • AnonymousProxyAllowedIPs: The IP addresses permitted to access anonymous sessions.
    • AnonymousProxyMaxRunning: Maximum number of proxy servers used to do the scrape.
    • AnonymizationURLPrepend: Which server to use for anonymization. By default http://anon.screen-scraper.com will be used.

      Acceptable values are http://anon.screen-scraper.com and http://anon2.screen-scraper.com.

    Be sure to modify the resource/conf/screen-scraper.properties file only when screen-scraper is not running.

    Scraping Session Setup

    Aside from these global settings, there are a few settings that apply to each scraping session you'd like to anonymize. You can edit these settings under the anoymization tab of your scraping session.

    Once you've configured all of the necessary settings, try running your scraping session to test it out. You'll see messages in the log that indicate what proxy servers are being used, how many have been spawned, etc.

    As your anonymous scraping session runs, you'll notice that screen-scraper will automatically regulate the pool of proxy servers. For example, if screen-scraper gets a timed out connection or a 403 response (authorization denied), it will terminate the current proxy server, and automatically spawn a new one in its place. This way you will likely always have a complete set of proxy servers, regardless of how frequently the target web site might be blocking your requests. You can also manually report a proxy server as blocked by calling session.currentProxyServerIsBad() in a script. When this method is called the current proxy server will be shut down and replaced by another.