Settings

Overview

This section contains a description of each of the screens found in the Settings window, which can be displayed by selecting Settings from the Options menu, or by clicking the wrench icon in the button bar.

General Settings

General Settings

  • Connection timeout: At times remote web servers will experience problems after screen-scraper has made a connection. When this happens the server will often hold on to the connection to screen-scraper, causing it to appear to freeze. Designating a connection timeout avoids this situation. Generally around 30 seconds is sufficient.
  • Data extractor timeout: In certain cases complex extractor patterns can take an abnormally long time when being applied. You'll likely want to designate a timeout so that screen-scraper doesn't get stuck while applying a pattern. Typically it should not take longer than 2 or 3 seconds to apply a pattern.
  • Maximum number of concurrent running scraping sessions (professional and enterprise editions only): When screen-scraper is running as a server you'll often want to limit the number of scraping sessions that can be run simultaneously, so as to avoid consuming too many resources on a machine. This setting controls how many will be allowed to run at a time. Note that this only applies when a lazy scrape is being performed.
  • Maximum application memory allocation in megabytes: This setting controls the amount of memory screen-scraper will be allowed to consume on your computer. In cases where you notice sluggish behavior or OutOfMemoryError messages appearing in the error.log file (found in the log directory for your screen-scraper installation folder), you'll likely want to increase this number.
  • Default proxy session to use when running in server mode (enterprise edition only): When screen-scraper is running as a server it can also run the proxy server. If you designate a proxy session in this drop-down box screen-scraper will make use of its scripts.
  • Installation directory: In virtually all cases this setting can be left untouched. If you move the screen-scraper installation directory you may need to manually set this.
  • Automatically check for updates on startup (professional and enterprise editions only): If this box is checked screen-scraper will automatically check for updates and notify you if one is available.
  • Allow upgrading to unstable versions (professional and enterprise editions only): If this box is checked when you select Check for updates from the Options menu screen-scraper will give you the option to download alpha/unstable versions of the software.
  • Default character set (professional and enterprise editions only): Indicates the character set that should be used when not designated by the remote server. When scraping sites that use a Roman character set you'll likely want to use ISO-8559-1; otherwise, UTF-8 is probably what you'll want to use. A comprehensive list of supported character sets can be found here. Your web browser will also generally be able to tell you what character set a particular site is using. Even with that, though, when scraping international character sets it can sometimes require trial and error to isolate what character set is best to use. For more information see

    Server Settings

    Server Settings (professional and enterprise editions only)

    Server (professional and enterprise editions only)

    These settings apply when screen-scraper is running in server mode.

    • Port: Sets the port screen-scraper will listen on when running as a server.
    • Generate log files: If checked, a log file will be generated in the log folder each time a scraping session is run.
    • Hosts to allow to connect: Caution should be exercised whenever a network service is running on a computer. This is no exception with screen-scraper. If this box is blank screen-scraper will allow any machine to connect to it. This is not recommended unless the machine on which screen-scraper is running is protected by external firewalls. A comma-delimited list of host names and IP addresses that should be allowed to connect to screen-scraper should be entered into this box. For example, if localhost is designated screen-scraper will only allow connections from the local machine. Note also that portions of IP addresses can be designated. For example, if 192.168 were designated, the following IP addresses would be allowed to connect: 192.168.2.4, 192.168.4.93, etc. Note that this setting applies both to the proxy server as well as when screen-scraper is running in server mode.

    Proxy Server (professional and enterprise editions only)

    These settings apply only to the proxy server portion of screen-scraper.

    • Port: Sets the port screen-scraper's proxy server should listen on.
    • Don't log binary files: If this box is checked screen-scraper will not log any binary files (e.g., images and Flash files) in the HTTP Transactions table for proxy sessions.

    Mail Server (professional and enterprise editions only)

    These settings are used with the sutil.sendMail method in screen-scraper scripts.

    • Host: The host the mail should be sent through.
    • Username: The username required to authenticate to the mail server in order to send mail through it. Note that this may not be required by the mail server.
    • Password: The password required to authenticate to the mail server in order to send mail through it. Note that this may not be required by the mail server.
    • Port: The port that should be used when connecting to the host (corresponding setting in resource/conf/screen-scraper.properties file: MailServerPort=PortNumber).
    • Use TLS/SSL: Whether or not TLS/SSL encryption should be used when communicating with the host (corresponding setting in resource/conf/screen-scraper.properties file: MailServerUsesTLS=true).

    Web/SOAP Server (professional and enterprise editions only)

    These settings apply only to the web interface and SOAP server features of screen-scraper.

    • Port: Sets the port screen-scraper's web/SOAP server should listen on. When accessing the web interface, this number will determine what goes after the colon in the URL. For example, if this number is left at the default value (8779), you would access screen-scraper's web interface with this URL: http://localhost:8779/.

External Proxy Settings

External Proxy Settings

Unless you normally connect to the Internet through an external proxy server, you don't need to modify these settings.

  • External proxy authentication: These text boxes are used in cases where you need to connect to the Internet via an external proxy server.
    • Username: Your username on the proxy server.
    • Password: Your password on the proxy server.
    • Host: The host/domain of the proxy server
    • Port: The port that you use on the host server.
  • External NT proxy authentication: These text boxes are used in cases where you need to connect to the Internet via an external NT proxy server.

    If you are using NTLM (Windows NT) authentication you'll need to designate settings for both the standard proxy as well as the NTLM one.

    • Username: Your username on the NT proxy server.
    • Password: Your password on the NT proxy server.
    • Domain: The domain/group name that the NT proxy server uses.
    • Host: The host of the proxy server.

Anonymous Proxy Settings

Anonymous Proxy Settings (professional and enterprise editions only)

  • Password: The password your received from screen-scraper when you setup your anonymazation account.

    This setting is available in the screen-scraper.properties file as AnonymousProxyPassword

  • Allowed IP addresses: The IP addresses of the machine(s) you wish to allow to connect to your screen-scraper server

    In this field it expects a comma-delimited list of IP addresses that screen-scraper should accept connections from. You can also specify just the beginning portions of IP addresses. For example, if you enter 111.22.333 screen-scraper would accept connections from 111.22.333.1, 111.22.333.2, 111.22.333.3, etc.

    If nothing is entered into this text box screen-scraper will accept connections from any IP address. This is not generally encouraged.

    This setting is available in the screen-scraper.properties file as AnonymousProxyAllowedIPs

  • Get the IP address for this computer: Retrieves the IP address of the computer that screen-scraper is running on. This is provided to help you specify the correct IP address for the Allowed IP addresses field.
  • Max running servers: IP addresses that are blocked will be replaced with the maximum number of servers indicated. Greater than 5 & less than 10 are recommended.

    This setting is available in the screen-scraper.properties file as AnonymousProxyMaxRunning

  • Number of running instances: The total number of proxy servers running anonymous scrapes.
  • Refresh: Retrieves the current number of running proxy servers.
  • Terminate all running proxy servers: Shuts down all running proxy servers.

    As you pay for proxy servers by the hour, if you don't have your scraping session set up to automatically shut them down at the end you will need to use this button to end the proxy servers.

Under certain circumstances you may want to anonymize your scraping so that the target site is unable to trace back your IP address. For example, this might be desirable if you're scraping a competitor's site, or if the web site is blocking too many requests from a given IP address.

There are a few different ways to go about this using screen-scraper. We will discuss how to setup anonymazation in screen-scraper later in the documentation.