Documentation

Getting started


screen-scraper Quick Start Guide
Learn how to use screen-scraper in under 3 minutes.

Typical steps:

  1. Use the proxy server to determine which files to scrape. It's frequently necessary to request a few files before you can get at the file that contains the data you need (e.g. you may need to log in to the site first). The proxy server allows you to surf a site as you normally would, then easily select files you need to have scraped.
  2. Organize and configure files to be scraped. Once you've selected the files to scrape you'll typically need to organize and sequence them. You'll also usually tweak information related to the files, such as POST data to be sent or authentication tokens.
  3. Create extractor patterns. Extractor patterns provide an intuitive way to selectively identify snippets of data you want extracted from individual pages.
  4. Create scripts. Scripts let you do something with the data that gets extracted. This might be writing the data out to a formatted file or inserting the information into a database.

The best way to learn to use screen-scraper is by going through our tutorials.

Helpful Links

These links allow you to get a general feel for screen-scraper. They are not representative of all that can be done. Each link will simple jump you to another section of this documentation.

On the proxy server:

On the scraping engine:

On extractor patterns:

On scripts:

Installation

Overview

screen-scraper will run on any operating system that supports version 1.5 or higher of the Java Virtual Machine. The installation process is almost always very simple, but you may want to read through these pages if you run into trouble or would like to know some related details.

Installation Requirements

  • Support for Java Runtime Environment of 1.8 or higher

Explanation

The only specific restriction for installing screen-scraper is that your operating system platform supports a Java Runtime Environment of 1.8 or higher. Testing has been done for screen-scraper on Microsoft Windows, Linux, Mac OS X, and other platforms that support a Java Runtime Environment of 1.8 or higher. All systems have managed to run the software without any major changes.

The Windows installer comes with a runtime environment included. Mac OS X and Linux should already have a Java Runtime Environment installed. For help installing screen-scraper on other platforms (e.g., Solaris, FreeBSD) please contact us.

See also:
How much memory and what type of CPU is recommended for screen-scraper?
Scaling & Optimizing screen-scraper

Installation Instructions

Overview

To download screen-scraper, see the page for the edition you'd like to install. Run the installer to set installation options and install screen-scraper. For headless servers, run the installer with a -C flag to indicate the system is headless and the installer shouldn't try to open graphical popups for installation options

You may want to compare editions before choosing which to install.

Troubleshooting Linux Installs

screen-scraper ships with a Java Runtime Environment that should work in most distributions of Linux. Because the distributions can vary quite a bit, however, you may need to install a separate Java Runtime Environment and then point screen-scraper to it. If you're able to successfully install screen-scraper, but are having trouble starting it, try downloading the latest Java Runtime Environment from www.java.com for your particular distribution. Once you've installed the JRE, you can point screen-scraper to it by modifying the screen-scraper and server start scripts located in the screen-scraper installation folder.

  1. Open the files.
  2. Locate the INSTALL4J_JAVA_HOME_OVERRIDE property (near the top of both files).
  3. Uncomment the property by removing the pound (#) in front of it.
  4. Set the value to the location of the JRE you installed on your system.

As an example: the property might look like this:

 INSTALL4J_JAVA_HOME_OVERRIDE=/usr/java/jre_1.8.0_u183/

This tells screen-scraper to use the JRE located at the given path rather than the one it ships with, or some other JRE that might be on your system.

screen-scraper License Agreement

screen-scraper License Agreement Copyright © 2002-2014 by ekiwi, LLC.
All Rights Reserved.

YOUR AGREEMENT TO THIS LICENSE

After reading this agreement carefully, if you ("Customer") do not agree to all of the terms of this End-User License Agreement ("EULA"), you may not use this Software (hereafter referred to as "Software Product"). Unless you have a different license agreement signed by ekiwi, LLC (hereafter referred to as "ekiwi") that covers this copy of the Software Product, your use of this Software Product indicates your acceptance of this EULA. All updates to the Software Product shall be considered part of the Software Product and subject to the terms of this EULA. Changes to this EULA may accompany updates to the Software Product, in which
case by installing such update Customer accepts the terms of the EULA as changed. The EULA is not otherwise subject to addition, amendment, modification, or exception unless in writing signed by an officer of both Customer and ekiwi. A software license and a license key ("Software Product License"), issued to a designated user only by ekiwi, is required for each concurrent user of the Software Product. By explicitly accepting this EULA you are acknowledging and agreeing to be bound by the following terms:

1. EVALUATION PERIOD

This Software Product may be used in conjunction with a free evaluation Software Product License. You may use the evaluation copy of the Software Product for only thirty (30) days in order to determine whether to purchase the Software Product, after which the Software Product will cease to function. ekiwi bears no liability for any damages resulting from use of the Software Product, and has no duty to provide any support before or after the expiration date of an evaluation license.

2. GRANT OF NON-EXCLUSIVE LICENSE

You may not tamper with, alter, or use the Software Product in a way that disables, circumvents, or otherwise defeats its built-in licensing verification and enforcement capabilities. You may not modify or create derivative copies of the Software Product or this EULA. All rights not expressly granted to you are retained by ekiwi.

ekiwi grants the non-exclusive, non-transferable right for a single user to use this Software Product. Each additional concurrent user of the Software Product must obtain an additional Software Product License. You may install the Software Product on as many computer systems as desired, so long as two copies of the same Software Product License never come into concurrent use.

3. INTELLECTUAL PROPERTY

The Software Product is owned by ekiwi and is protected by international copyright laws and treaties, as well as other intellectual property laws and treaties. You must not remove or alter any copyright notices on any copies of the Software Product. This Software Product copy is licensed, not sold. You may not use, copy, or distribute the Software Product, except as granted by this EULA, without written authorization from ekiwi. ekiwi reserves all intellectual property rights, including copyrights, patents, and trademarks.

4. TRANSFERABILITY

Customer may not rent, lease, lend, or in any way distribute or transfer any rights in this EULA or the Software Product to third parties without ekiwi's written approval, and subject to written agreement by the recipient of the terms of this EULA.

5. PROHIBITION ON REVERSE ENGINEERING AND DECOMPILATION

You may not reverse engineer, decompile, defeat license encryption mechanisms, or disassemble the Software Product or Software Product License except and only to the extent that such activity is expressly permitted by applicable law notwithstanding this limitation.

6. INDEMNIFICATION

You hereby agree to indemnify ekiwi against and hold harmless ekiwi from any claims, lawsuits, liabilty or other losses that arise out of your breach of any provision of this EULA.

7. THIRD PARTY SOFTWARE

Any software provided along with the Software Product that is associated with a separate license agreement is licensed to you under the terms of that license agreement (which license is provided with the Software Product). This license does not apply to those portions of the Software Product.

8. SUPPORT SERVICES

ekiwi may provide you with support services related to the Software Product. Use of any such support services is governed by ekiwi policies and programs described in online documentation and/or other ekiwi-provided materials.

As part of these support services, ekiwi may make available bug lists, planned feature lists, and other supplemental informational materials. ekiwi makes no warranty of any kind for these materials and assumes no liability whatsoever for damages resulting from any use of these materials. Furthermore, you may not use any materials provided in this way to support any claim made against ekiwi.

Any supplemental software code or related materials that ekiwi provides to you as part of the support services, in periodic updates to the Software Product or otherwise, is to be considered part of the Software Product and is subject to the terms and conditions of this EULA.

With respect to any technical information you provide to ekiwi as part of the support services, ekiwi may use such information for its business purposes without restriction, including for product support and development. ekiwi will not use such technical information in a form that personally identifies you without first obtaining your permission.

9. TERMINATION

This EULA terminates on the date of the first occurrence of either of the following events: (1) The expiration of one (1) month from written notice of termination from Customer to ekiwi; or (2) One party materially breaches any terms of this EULA or any terms of any other agreement between Customer and ekiwi, that are either uncorrectable or that the breaching party fails to correct within one (1) month after written notification by the other party.

10. NO WARRANTIES

YOU ACCEPT THE SOFTWARE PRODUCT AND SOFTWARE PRODUCT LICENSE "AS IS," AND EKIWI MAKES NO WARRANTY AS TO ITS USE, PERFORMANCE, OR OTHERWISE. TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, EKIWI DISCLAIMS ALL OTHER REPRESENTATIONS, WARRANTIES, AND CONDITIONS, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE, INCLUDING, BUT NOT LIMITED TO, IMPLIED WARRANTIES OR CONDITIONS OF MERCHANTABILITY, SATISFACTORY QUALITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE, AND NON-INFRINGEMENT. THE ENTIRE RISK ARISING OUT OF USE OR PERFORMANCE OF THE SOFTWARE PRODUCT REMAINS WITH YOU.

11. LIMITATION OF CONSEQUENTIAL DAMAGES

NEITHER EKIWI NOR ANYONE INVOLVED IN THE CREATION, PRODUCTION, OR DELIVERY OF THIS SOFTWARE SHALL BE LIABLE FOR ANY INDIRECT, CONSEQUENTIAL, OR INCIDENTAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE SUCH SOFTWARE EVEN IF EKIWI HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES OR CLAIMS. IN NO EVENT SHALL EKIWI'S LIABILITY FOR ANY DAMAGES EXCEED THE PRICE PAID FOR THE LICENSE TO USE THE SOFTWARE, REGARDLESS OF THE FORM OF CLAIM. EKIWI SHALL IN NO WAY BE HELD LIABLE OR RESPONSIBLE FOR ANY UNLAWFUL OR ILLEGAL USE OF THE SOFTWARE PRODUCT, INCLUDING, BUT NOT LIMITED TO, THE EXTRACTION AND USE OF COPYRIGHTED DATA FROM EXTERNAL SOURCES (E.G. WEB PAGES). THE PERSON USING THE SOFTWARE BEARS ALL RISK AND RESPONSIBILITY AS TO THE USE, QUALITY, AND PERFORMANCE OF THE SOFTWARE.

12. HIGH RISK ACTIVITIES

The Software Product is not fault-tolerant and is not designed, manufactured or intended for use or resale as on-line control equipment in hazardous environments requiring fail-safe performance, including, but not limited to, in the operation of nuclear facilities, aircraft navigation or communication systems, air traffic control, direct life support machines, and weapons systems, in which the failure of the Software Product, or any software, tool, process, or service that was developed using the Software Product, could lead directly to death, personal injury, or severe physical or environmental damage ("High Risk Activities"). Accordingly, ekiwi and its suppliers and licensors specifically disclaim any express or implied warranty of fitness for High Risk Activities. You agree that ekiwi and its suppliers and licensors will not be liable for any claims or damages arising from the use of the Software Product, or any software, tool, process, or service that was developed using the Software Product, in such applications.

13. GENERAL

This EULA is the complete statement of the agreement between the parties on the subject matter, and merges and supersedes all other or prior understandings, purchase orders, agreements and arrangements.

This EULA shall be governed by the laws of the State of Utah. Exclusive jurisdiction and venue for all matters relating to this EULA shall be in courts located in the State of Utah, and you consent to such jurisdiction and venue. If any action is brought by either party to this EULA against the other party regarding the subject matter hereof, the prevailing party shall be entitled to recover, in addition to any other relief granted, reasonable attorney fees and expenses of litigation.

You acknowledge that, in the event of your breach of any of the foregoing provisions, ekiwi will not have an adequate remedy in money or damages. ekiwi shall therefore be entitled to obtain an injunction against such breach from any court of competent jurisdiction immediately upon request. ekiwi's right to obtain injunctive relief shall not limit its right to seek further remedies.

There are no third party beneficiaries of any promises, obligations or representations made by ekiwi, LLC herein. Any waiver by ekiwi, LLC of any violation of this EULA by you shall not constitute or contribute to a waiver of any other or future violation by you of the same provision, or any other provision, of this EULA.

14. CONTACT INFORMATION

If you have any questions about this EULA, or if you want to contact ekiwi for any reason, please direct correspondence to [email protected]

Browser Configuration

Overview

screen-scraper's proxy server is a valuable tool for manipulating server interactions and building scrapes. In order for the proxy server to gather information from a browser, the browser must be configured to make requests to the server. As many people have never done this before we have provided instructions on how to setup some common browsers to use a proxy server. It is a simple process but one that might be a little foreign.

When you are done developing your sessions you will likely want to reset the proxy settings back to their normal state. This is not required so long as the proxy server is still running but once the proxy server is stopped the proxy settings will cause configured browsers to stop working until the settings are reset.

Alternate Proxy

We've also added support to import proxy sessions from Charles proxy. You can proxy using Charles and then export the data to a "JSON Session File" and import that into screen-scraper. This can be a helpful alternative for proxying SSL sites.

Choosing a Proxy Browser

Though any browser can be used to record transactions on the proxy server, we have found that some tend to experience less problems, complications, ans issues compared to others. On that note you might want to take some time to think about what browser you want to use when proxying a site.

If you are experiencing issues with transactions being recorded as errors, these can often be the results of browser plug-ins/add-ons. We have found that Internet Explorer is especially prone to them where as Opera tends to have the fewest issues.

Configure Chrome

Chrome Proxy Settings

Windows users:

  1. Open Options under the wrench icon .
  2. Click on Under the Hood link on the left-hand navigation.
  3. Under Network click on the Change proxy settings... button.
  4. Click on LAN Settings
  5. Click on the checkbox beginning with Use a proxy server for...
  6. Click on the Advanced button.
  7. In the HTTP and Secure fields type localhost under the Proxy address to use column, and 8777 under Port

    If you have changed your proxy server settings to use a port other than 8777 then type your selected port in place of 8777.

  8. Hit the OK button a few times till you get back to your web browser.

If you're using a dial-up connection the setup will differ slightly. Instead of the LAN Settings button you'll want to find your dial-up connection under the Dial-up and Virtual Private Network settings dialog box, then configure it via the Settings button.

Depending on your operating system, instead of localhost you may need to use either 127.0.0.1 or the IP address of the machine. If you have trouble connecting to screen-scraper's proxy with your web browser, please see this FAQ.

Linux users:

  1. Open Preferences under the wrench icon .
  2. Click on Under the Hood link on the left-hand navigation.
  3. Under Network click on the Change proxy settings... button.
  4. Under the Proxy Configuration tab select the Manual proxy configuration radio button.
  5. Check the box next to Use the same proxy for all protocols.
  6. Enter localhost in the HTTP proxy field.
  7. Enter 8777 in the Port field.
  8. Click on the Ignored Hosts tab.
  9. Enter the following in the Ignore Host List:
    • localhost
    • 127.0.0.0/8
    • *.local
  10. Click Close
  11. We recommend selecting Close when prompted Do you want to apply these settings system-wide...

If you're using a dial-up connection the setup will differ slightly. Instead of the LAN Settings button you'll want to find your dial-up connection under the Dial-up and Virtual Private Network settings dialog box, then configure it via the Settings button.

Depending on your operating system, instead of localhost you may need to use either 127.0.0.1 or the IP address of the machine. If you have trouble connecting to screen-scraper's proxy with your web browser, please see this FAQ.

Mac OS X users:

  1. Open Preferences under the wrench icon .
  2. Click on Under the Hood link on the left-hand navigation.
  3. Under Network click on the Change proxy settings... button.

  4. Select "Manually" next to Configure Proxies
  5. Check box next to Web Proxy (HTTP)
  6. Enter Localhost under Web Proxy Server and 8777 in adjoining port field
  7. Check box next to Secure Web Proxy (HTTPS)
  8. Enter Localhost under Web Proxy Server and 8777 in adjoining port field
  9. Under Bypass proxy settings for these Hosts & Domains enter localhost; 127.0.0.1

If you're using a dial-up connection the setup will differ slightly. Instead of the LAN Settings button you'll want to find your dial-up connection under the Dial-up and Virtual Private Network settings dialog box, then configure it via the Settings button.

Depending on your operating system, instead of localhost you may need to use either 127.0.0.1 or the IP address of the machine. If you have trouble connecting to screen-scraper's proxy with your web browser, please see this FAQ.

Configuring Firefox

Firefox Proxy Settings

  1. Open Options/Preferences window.

    Windows users: click Options from the Tools menu.

    Linux users: click Preferences from the Edit menu.

    Mac OS X users: click Preferences from the Firefox menu.

  2. Click on the Advanced button at the top of the window (if not already selected).
  3. Click on the Settings... button at the top of the Network tab.
  4. Click the Manual proxy configuration radio button.
  5. In the HTTP Proxy field type localhost, and 8777 in Port.

    If you have changed your proxy server settings to use a port other than 8777 then type your selected port in place of 8777.

  6. Click on the Use this proxy server for all protocols check box.
  7. Hit the OK button to get back to your web browser.

If you're using a dial-up connection the setup will differ slightly. Instead of the LAN Settings button you'll want to find your dial-up connection under the Dial-up and Virtual Private Network settings dialog box, then configure it via the Settings button.

Depending on your operating system, instead of localhost you may need to use either 127.0.0.1 or the IP address of the machine. If you have trouble connecting to screen-scraper's proxy with your web browser, please see this FAQ.

For useful add-ons, visit the Browser Tools page.

Configuring Internet Explorer

Internet Explorer Proxy Settings

  1. Click Internet Options in the Tools menu.
  2. Go to the Connections tab.
  3. Click on LAN Settings.
  4. Click on the checkbox beginning with Use a proxy server for....
  5. Click on the Advanced... button.
  6. In the HTTP and Secure fields type localhost under the Proxy address to use column, and 8777 under Port

    If you have changed your proxy server settings to use a port other than 8777 then type your selected port in place of 8777.

  7. Hit the OK button a few times till you get back to your web browser.

If you're using a dial-up connection the setup will differ slightly. Instead of the LAN Settings button you'll want to find your dial-up connection under the Dial-up and Virtual Private Network settings dialog box, then configure it via the Settings button.

Depending on your operating system, instead of localhost you may need to use either 127.0.0.1 or the IP address of the machine. If you have trouble connecting to screen-scraper's proxy with your web browser, please see this FAQ.

Configuring Opera

Opera Proxy Settings

  1. Click Preferences in the Tools menu.

    Windows 7 users will need to select Preferences from the Settings menu after clicking to open Opera's menu.

    Mac OS X users will need to select Preferences... from the Opera menu.

  2. Click on the Advanced tab.
  3. Click on the Network section listed near the bottom on the left.
  4. Click on the Proxy Servers... button.
  5. Enable the first HTTP checkbox, and enter localhost or 127.0.0.1 in the first field and 8777 in the port box.

    If you have changed your proxy server settings to use a port other than 8777 then type your selected port in place of 8777.

If you're using a dial-up connection the setup will differ slightly. Instead of the LAN Settings button you'll want to find your dial-up connection under the Dial-up and Virtual Private Network settings dialog box, then configure it via the Settings button.

Depending on your operating system, instead of localhost you may need to use either 127.0.0.1 or the IP address of the machine. If you have trouble connecting to screen-scraper's proxy with your web browser, please see this FAQ.

Add a Quick On/Off Button to Toolbar

To simplify the whole process of turning the proxy server on and off, you can add a proxy button to a toolbar. One handy location is on the far right of the tab bar. To add the button,

  1. Right-click just about any of Opera's interface toolbars and select Customize....

    You might need to select Appearance... explicitly before continuing.

  2. Click on the buttons tab.
  3. Select preferences to open its options.
  4. Drag and drop the Enable Proxy Servers button onto your toolbar.

Dragonfly

New to Opera 9.5.x is a feature called Dragonfly. You can access it by selecting Developer Tools from Tools > Advanced, or alternately pressing control-shift-I. If you're familiar with Firefox's Firebug add-on, then you'll quickly recognize Dragonfly. It's built off of the same ideas, allowing you to manipulate CSS in realtime, seeing which properties are being overwritten by another. You can debug Javascript running on the page, find HTML elements on the page by clicking on them so that Dragonfly shows you the corresponding page source, see all final properties of elements, etc. It's a great tool for sifting through a website.

Settings

Overview

Most of the general settings for screen-scraper are available through the workbench settings window. There are a handful of settings that are rarely used, so we don't provide a way to adjust them in the workbench. These properties can be edited manually in the screen-scraper.properties file in screen-scraper's resources/conf directory. You can edit it in your favorite text editor.

If running either the Basic Edition or Professional Edition note that when you alter the file you should do so when screen-scraper is not running. It won't get the new settings until the application restarts, and if you edit while it's running it may overwrite your changes.

If running the Enterprise Edition you have two options for reloading the screen-scraper.properties file while screen-scraper is running in server mode.

Notable Settings not available in the Workbench

Example Properties File

Overview

The screen-scraper.properties file can be found in the resource/conf directory of screen-scraper's installation. Most available settings and some sample values are listed below.

For the sake of readability the settings are listed in alphabetical order here. In a settings file they are ordered by a serialized value and so not alphabetical.

Example File Contents

#This file is manipulated by screen-scraper. Edit it manually at your own risk!
#Wed Mar 31 16:43:33 MDT 2017
AllowMultipleSimultaneousInstances=true
AllowProxyScripting=false
AnonymousProxyAllowedIPs=192.168,127.0,localhost,0\:0\:0\:0\:0\:0
AnonymousProxyPassword=
AnonymousProxyMaxRunning=5
AutoSaveTime=600
BreakpointFrame.LastHeight=545
BreakpointFrame.LastWidth=711
BreakpointFrame.LastX=1064
BreakpointFrame.LastY=289
CheckForUpdatesOnStartup=true
CheckLogAutoScroll=true
CommandLine.NumTimesRun=1
ConnectionTimeout=180
DatabaseHost=localhost
DatabasePort=9003
DataExtractorTimeout=30
DefaultCharacterSet=UTF-8
DefaultFont=ArialUnicodeMS
DefaultProxySession=
DefaultRepeatDays=
DefaultRepeatHours=
DefaultRepeatMinutes=
DefaultRepeatSeconds=
DefaultThresholdRecordCount=
DefaultThresholdTime=
DefaultTimeout=
DividerLocation=269
DontLogBinaryFiles=true
DownloadUnstableUpdates=true
Edition=Enterprise
EnableWebServer=true
EnableCachingAndFilteringDataSets=false
EnableCodeFoldingInLastResponse=true
ExternalNTProxyAuthentication=
ExternalNTProxyDomain=
ExternalNTProxyHost=
ExternalNTProxyPassword=
ExternalNTProxyUsername=
ExternalProxyAuthentication=foo\:bar
ExternalProxyHost=
ExternalProxyPassword=
ExternalProxyPort=
ExternalProxyUsername=
FilterHTTPTransactions=false
ForceOverwriteScripts=false
GenericCompletions=true
HelpBrowser=the bulit-in help browser
InstallDirectory=C\:\\Program Files\\screen-scraper Enterprise Edition\\
IPAddressesToAllow=
LastSelectedDirectory=C\:\\Program Files\\screen-scraper Enterprise Edition\\
LogDebugColor=\#000000
LogDebugBackgroundColor=\#ffffff
LogErrorColor=\#ff0000
LogErrorBackgroundColor=\#ffffff
LogHighMemoryUseInformation=true
LogInfoColor=\#0000ff
LogInfoBackgroundColor=\#ffffff
LogWarnColor=\#ffcc33
LogWarnBackgroundColor=\#000000
LookAndFeel=Native
MailServerHost=
MailServerPassword=
MailServerPort=
MailServerUsername=
MailServerUsesTLS=
MainFrame.LastHeight=847
MainFrame.LastWidth=1095
MainFrame.LastX=591
MainFrame.LastY=201
MaximumDisplayedLastResponseLength=1500
MaximumMemoryAllocation=314
MaxConcurrentScrapingSessions=100
MaxScrapeableSessionsToLoad=50
MaxScrapingSessionLogFileSize=10
Messages.DoesUserWantToViewTutorials=false
Nickname=My screen-scraper
OutputLogFiles=true
OverrideServerEncoding=true
ProxyForceAllHTTPRequestsToHTTPS=false
ProxyPort=8777
SaveLargeFields=true
Server.NumTimesRun=2186
ServerPort=8778
SettingsFrame.LastHeight=485
SettingsFrame.LastWidth=677
SettingsFrame.LastX=121
SettingsFrame.LastY=22
ShowVariableCompletionsAt=2
SOAPPort=8779
SpawnSeparateDatabaseProcess=true
TokenEditor.LastHeight=369
TokenEditor.LastWidth=418
TruncateWorkbenchRequestPOSTDataLength=100000
UseGlobalExternalProxyForAllScrapingSessions=false
Version=7.0
WebInterfaceUser=username
WebInterfacePassword=password
WebServerShutdownPort=8551
Workbench.MaxLogLines=1000
Workbench.NumTimesRun=9415
WrapScriptText=true

Proxy Server

Description

When running, the proxy server listens on a specified port for incoming HTTP requests from your web browser. Upon receiving a request from your browser the proxy server records it, then sends it along to the server for which it was intended. When that server responds it is received by the proxy server, which, once again, makes a record of it, then sends it along to your web browser.

Purpose

screen-scraper's proxy server allows you to view HTTP requests and responses as they pass between your web browser and remote servers. In scraping files from web sites there are a few more details than you typically worry about when surfing, such as HTTP headers and POST data. The proxy server makes all of these details visible to you.

Viewing HTTPS requests

Often one of the headaches of scraping information from sites that use HTTPS is that it's not always easy to tell what's getting passed back and forth in the way of cookies, POST data, etc. Even if you put a proxy server in the way that lets you view the requests and responses, the information is encrypted as it's leaving your browser and as it's leaving the web server that responds to the request. screen-scraper gets around this problem by using it's own temporary certificate to encrypt traffic from itself to the browser and then encrypting each request before sending it up to the server. The result of this is that your browser will issue a warning about the certificate that screen-scraper returned. You can safely accept the certificate and be assured that all your traffic is encrypted.

We've also used other proxy software, such as Charles proxy, for handling SSL sites. They have additional features to allow the browser to trust the certificates so you don't see a warning in the browser. We've also added an import proxy session feature so you can import a JSON Session File from Charles and use it in screen-scraper to build a scraping session.

Running the proxy in server mode

This feature is only available to Professional and Enterprise editions of screen-scraper.

screen-scraper has the ability to act as a proxy while in server mode. Combined with the ability to execute scripts, this functionality opens up many possibilities for how you use screen-scraper. More information about how to go about using screen-scraper in this capacity is available on our using scripts with the proxy server page.

Using the Proxy Server

Configure the proxy server

First, create a proxy session to organize your interactions with the specific web sites.

Configure your web browser

Configuring a web browser to use a proxy server is generally pretty straightforward, but varies somewhat for each browser. We have provided instructions on how to setup different browsers:

Run the proxy server

Assuming you've configured everything and set up a proxy session, from here you should be able to start up the proxy server by selecting your proxy session in the objects tree and then clicking on the Start Proxy Server button in the general tab. Now just surf the pages that you want to record.

View requests and responses

After you've surfed a bit with your web browser click on the progress tab. From here you can view all of the HTTP and HTTPS requests and responses logged by the proxy server. Clicking on a transaction brings up its details in the lower pane.

Viewing encrypted transactions

If you are using Internet Explorer 7 you have to adjust your security settings. To do this open Internet Options in the Tools menu and under the security tab change the security level to medium.

If security settings are not updated you will see an error page when accessing a site that uses HTTPS encryption.


IE domain mismatch warning

This warning occurs because screen-scraper is using a temporary certificate for encryption that will not match the url that you are accessing. You can safely ignore this warning by clicking Continue to this website. This practice is, however, not recommended.

Most browsers have recently started preventing you from accepting the certificate. As a work-around you can use another proxy service, such as Charles, to proxy SSL sites by installing their root authority certificate on your server (which prevents an error from showing up while using them). You can then import the proxy data into screen-scraper for building your scrape

Using an external proxy server

If you normally use an external proxy server when connecting to the internet (on your local area network, for example), you'll need to specify this information in screen-scraper's external proxy settings. Before you can run the proxy server.

Using Scripts

This feature has been deprecated and by default is not available in the workbench interface. To enable proxy scripting please add AllowProxyScripting=true to your resource/conf/screen-scraper.properties file and restart screen-scraper.

Overview

screen-scraper has the ability to run custom-made scripts while the proxy server is running (more information on starting and stopping the server is available). This allows you to setup blacklists, filter web pages, or otherwise manipulate browser requests and server responses. It is recommended that you read about managing and using scripts before continuing.

Using the scripts

The scripts tab is used to associate scripts with a proxy. Depending on when you decide to run your script, certain built in objects will be in scope that are unique to the proxy environment.

Built-in objects

screen-scraper offers a few objects that you can work with in a script in the proxy environment. See the variable scope section and/or API documentation for more details.

  • proxySession: This object allows for interaction with the currently running proxy session.
  • request: This object allows for interaction with the currently received HTTP request.
  • response: This object allows for interaction with the currently received HTTP response.

Variable scope

Depending on when a script gets run different variables may be in scope (available). The table that follows specifies what variables will be in scope depending on when a given script is run.

When Script is Run proxySession in scope request in scope response in scope
Beginning of proxy session X
Before HTTP request X X
After HTTP request X X
Before HTTP response X X X
After HTTP response X X X

Debugging scripts

One of the best ways to fix errors is to simply watch the proxy session log (under the log tab in the proxy session) and the error.log file (located in the log directory of screen-scraper's install directory) for script errors. When a problem arises in executing a script screen-scraper will output a series of error-related statements to the logs. Often a good approach in debugging is to build your script bit by bit, running it frequently to ensure that it runs without errors as you add each piece.

Running screen-scraper in Server Mode

Overview

This feature is only available to Professional and Enterprise editions of screen-scraper.

It is strongly advised NOT to run both the server and workbench simultaneously.

There are two main reasons to run screen-scraper as a server:

  1. Interaction with screen-scraper in external scripts: Start/stop and otherwise manage scrapes from your own scripts (including PHP, Java, Ruby, and others). This can be helpful when doing application integration and other such development. More information regarding managing screen-scraper from scripts is available.
  2. Use screen-scraper as a proxy server: Use screen-scraper to set up blacklists or otherwise manipulate browser requests and server responses on an on-going basis.

Managing Server

Windows

If you're running Microsoft Windows screen-scraper will run as a service. This allows it to be run in a background process, and doesn't require you to be logged in to the machine on which it's running. screen-scraper gets registered as a service upon installation, and may be run as a server using either the Start server and Stop server links from the Start menu, or using the Services control panel applet.

In Windows XP, when the server is running an icon will appear in the system tray. You can right-click this icon to stop the server.

Windows Vista and Windows 7

The screen-scraper service can be started, stopped, and monitored via the Services control panel applet under Administrative Tools. The server can also be started and stopped via the Start server and Stop server shortcuts found under the Start menu.

When the server is running, the system tray icon will not appear, as it does in other versions of Windows.

There are a few ways to determine whether or not the server is running in Vista:

  1. Look at the Services control panel applet.
  2. Open a web browser and type in the URL for the web interface (enterprise edition only).

Windows Prompt

As of 5.0 there are three batch files that have been added to screen-scraper to aid in managing the server from the command line.

These files are created when screen-scraper installs and keyed specifically to the new instance of screen-scraper.

Unix/Linux or Mac OS X

Under Unix/Linux or Mac OS X the server is controlled via the server script, which operates much like a typical Unix daemon. The server will run in a background process, allowing you to start and stop the server remotely, or log out of your session after starting it. You can issue the following commands to the server script:

Connection Restrictions

While screen-scraper is running as a server it will accept connections from any other machine unless you specify otherwise. It is important to consider the security of your machine when running any service of this type. screen-scraper allows you to specify the IP addresses of the machine(s) you wish to allow to connect to it in the IP addresses to allow to connect in the settings window.

In this field it expects a comma-delimited list of IP addresses that screen-scraper should accept connections from. You can also specify just the beginning portions of IP addresses. For example, if you enter 111.22.333 screen-scraper would accept connections from 111.22.333.1, 111.22.333.2, 111.22.333.3, etc.

If nothing is entered into this text box screen-scraper will accept connections from any IP address. This is discouraged unless the computer running screen-scraper is protected by an external firewall.

If you need to alter this setting in a GUI-less environment, you can close screen-scraper and edit the resource/conf/screen-scraper.properties file. The setting to change is IPAddressesToAllow. When you start screen-scraper, it will make use of the new setting.

Troubleshooting Server Mode

If you're having trouble starting screen-scraper in server mode or running scraping sessions in server mode please see our FAQ on troubleshooting server mode issues.

Scraping Engine

Description

The scripting engine requests files which it then parses, manipulates, and/or stores according to user defined processes. It is the heart of screen-scraper and has been optimized at all points of development to be as efficient as possible. It is made up of multiple parts which can be manipulated using the workbench:

The rest of this section contains information about using screen-scraper, through the workbench, to achieve different goals. These can be difficult to understand without some exposure to the software. That is why we would like to encourage you to go through our first few tutorials before continuing.

Adding Java Libraries

Overview

screen-scraper allows for Java libraries to be added to the classpath. Simply copy any jar files you'd like to reference into the lib\ext folder found in screen-scraper's directory. The next time you start up screen-scraper it will automatically add the jar files to its classpath. Note that you'll still need to use the import statement within your scripts to refer to specific classes:

//import all classes in com.foo.bar
import com.foo.bar.*;

screen-scraper was built on a Java 1.5 platform. Your Java scripts must accept at least a version 1.5 JRE in order to compile and run properly.

Anonymization

Overview

Under certain circumstances you may want to anonymize your scraping so that the target site is unable to trace back your IP address. For example, this might be desirable if you're scraping a competitor's site, or if the web site blocks IP addresses that make too many requests.

There are a few different ways to go about this using screen-scraper:

  1. Automatic
  2. Manual Proxy Pools

If you choose to run anonymous scripts from an external script, it is valuable to read through the documentation on controlling anonymization externally.

Aside from the above methods, you might find our blog posting on how to surf and screen-scrape anonymously helpful. It's slightly dated, but still very relevant.

Automatic Anonymization

Overview

The screen-scraper automatic anonymization service works by sending each HTTP request made in a scraping session through a separate high-speed HTTP proxy server. The end effect of this is that the site you're scraping will see any request you make as coming from one of several different IP addresses, rather than your actual IP address. These HTTP proxy servers are actually virtual machines that get spawned and terminated as you need them. You'll use screen-scraper to either manually or automatically spawn and terminate the proxy servers.

Steps to take

Cost

  • $150 setup
  • 25 cents per proxy per hour

Note: When using the automatic anonymization method, while the remote web site may not be able to determine your IP address, your activity will still be logged. If you attempt to use the proxy service for any illegal activities, the chances are very good that you will be prosecuted.

Limitations

While the automatic anonymization service provides an excellent way to cloak your IP address it is still possible that the target web site will block enough of the anonymized IP addresses that the anonymization could fail. Unfortunately we can't make any guarantees that you won't get blocked; however, by using the automatic anonymization service the chances of getting blocked are reduced dramatically.

Miscellaneous

  • Anonymization REST Interface
  • Workbench Interface: Scraping Session: Anonymization tab
  • Automatic Anonymization: Setup

    Controlling your Account

    The anonymous proxy servers will be set up in such a way that they only allow connections from your IP address. This way no one else can use any of the proxies without your authorization. This configuration is tied to your password. For more on restricting connections see documentation on managing the screen-scraper server.

    If you'll be running your anonymized scraping sessions on the same machine (or local network) you're currently on and you are using the workbench, you can click the Get the IP address for this computer button to determine your current IP address.

    screen-scraper Setup

    Using Workbench

    Anonymization settings can be configured using screen-scraper's workbench. Settings are determined in the anonymous proxy settings of the settings dialog box.

    When you sign up for the anonymization service you'll be given the password that allows your instance of screen-scraper to manage anonymous proxies for you. You'll enter it into the Password textbox in the settings.

    As the proxy servers get spawned and terminated, it's a good idea to establish the maximum number of running proxy servers you'd like to allow. This is done via the Max running servers setting. Because you pay for proxy servers by the hour, if you don't have your scraping session set up to automatically shut them down at the end, you'll use the Terminate all running proxy servers button in order to do that.

    We find that as many as 10 proxy servers but no fewer than five are adequate for most situations.

    Using screen-scraper.properties File

    If you're setting this value in a GUI-less environment (i.e., a server with no graphical interface), you'll want to set these values in the resource/conf/screen-scraper.properties file (if these property is not already in the file you'll want to add it).

    • AnonymousProxyPassword: The password that you were sent.
    • AnonymousProxyAllowedIPs: The IP addresses permitted to access anonymous sessions.
    • AnonymousProxyMaxRunning: Maximum number of proxy servers used to do the scrape.
    • AnonymizationURLPrepend: Which server to use for anonymization. By default http://anon.screen-scraper.com will be used.

      Acceptable values are http://anon.screen-scraper.com and http://anon2.screen-scraper.com.

    Be sure to modify the resource/conf/screen-scraper.properties file only when screen-scraper is not running.

    Scraping Session Setup

    Aside from these global settings, there are a few settings that apply to each scraping session you'd like to anonymize. You can edit these settings under the anoymization tab of your scraping session.

    Once you've configured all of the necessary settings, try running your scraping session to test it out. You'll see messages in the log that indicate what proxy servers are being used, how many have been spawned, etc.

    As your anonymous scraping session runs, you'll notice that screen-scraper will automatically regulate the pool of proxy servers. For example, if screen-scraper gets a timed out connection or a 403 response (authorization denied), it will terminate the current proxy server, and automatically spawn a new one in its place. This way you will likely always have a complete set of proxy servers, regardless of how frequently the target web site might be blocking your requests. You can also manually report a proxy server as blocked by calling session.currentProxyServerIsBad() in a script. When this method is called the current proxy server will be shut down and replaced by another.

Anonymization via Manual Proxy Pools

Overview

If the automatic anonymization method isn't right for you, the next best alternative might be to manually handle working with screen-scraper's built-in ProxyServerPool object. The basic approach involves running a script at the beginning of your scraping session that sets up the pool, then calling session.currentProxyServerIsBad() as you find that proxy servers are getting blocked. In order to use a proxy pool, you'll also need to get a list of anonymous proxy servers. Generally you can find these by Googling around a bit.

See available methods:
ProxyServerPool
Anonymization API

Example

import com.screenscraper.util.*;
 
// Create a new ProxyServerPool object. This object will
// control how screen-scraper interacts with proxy servers.
proxyServerPool = new ProxyServerPool();
 
// We give the current scraping session a reference to
// the proxy pool. This step should ideally be done right
// after the object is created (as in the previous step).
session.setProxyServerPool( proxyServerPool );
 
// This tells the pool to populate itself from a file
// containing a list of proxy servers. The format is very
// simple--you should have a proxy server on each line of
// the file, with the host separated from the port by a colon.
// For example:
// one.proxy.com:8888
// two.proxy.com:3128
// 29.283.928.10:8080
// But obviously without the slashes at the beginning.
proxyServerPool.populateFromFile( "proxies.txt" );
 
// screen-scraper can iterate through all of the proxies to
// ensure theyre responsive. This can be a time-consuming
// process unless it's done in a multi-threaded fashion.
// This method call tells screen-scraper to validate up to
// 25 proxies at a time.
proxyServerPool.setNumProxiesToValidateConcurrently( 25 );
 
// This method call tells screen-scraper to filter the list of
// proxy servers using 7 seconds as a timeout value. That is,
// if a server doesnt respond within 7 seconds, it's deemed
// to be invalid.
proxyServerPool.filter( 7 );
 
// Once filtering is done, it's often helpful to write the good
// set of proxies out to a file. That way you may not have to
// filter again the next time.
proxyServerPool.writeProxyPoolToFile( "good_proxies.txt" );
 
// You might also want to write out the list of proxy servers
// to screen-scraper's log.
proxyServerPool.outputProxyServersToLog();
 
// This is the switch that tells the scraping session to make
// use of the proxy servers. Note that this can be turned on
// and off during the course of the scrape. You may want to
// anonymize some pages, but not others.
session.setUseProxyFromPool( true );

// Check number of available proxies
if (proxyServerPool.getNumProxyServers() < 4)
{
   // As a scrapiing session runs, screen-scraper will filter out
   // proxies that become non-responsive. If the number of proxies
   // gets down to a specified level, screen-scraper can repopulate
   // itself. Thats what this method call controls.
   proxyServerPool.setRepopulateThreshold( 5 );
}

That's about all there is to it. Aside from occasionally calling session.currentProxyServerIsBad(), you may also want to call session.setUseProxyFromPool to turn anonymization on and off within the scraping sesison.

Mapping Extracted Data

The web interface is only available for enterprise edition users of screen-scraper.

Overview

The mapping tab allows you to alter extracted values. Often once you extract data from a web page you need to put it into a consistent format. For example, you may want products with very similar names to have identical names.

screen-scraper makes use of mapping sets when determining how to map a given extracted value. A mapping set may contain any number of mappings, which screen-scraper will analyze in sequence until it finds a match, or runs out of mappings. As such, you'll often want to put more specific mappings higher in sequence than more general mappings.

Example

Consider the screen-shot of the mapping tab: if the extracted value were Widget 123 screen-scraper would first try to match using the Widget 1 mapping. Because this is an equals match the mapping wouldn't occur, so screen-scraper would proceed to the second mapping. The second mapping would match because a contains type was designated. That is, the text Widget 123 contains the text Widget. As such, the extracted data Widget 123 would become Product ABC, because that is the To value designated for the second mapping.

Using Regular Expressions

When using regular expressions in your mapping you can also make use of back references. Back references allow you to preserve values in the original text when mapped to the To value. For example, if you were mapping the value Widget 123 you could use the regular expression Widget (\d*). In the To column you could then enter the value Product \1, which, when mapped, would convert Widget 123 to Product 123. The value in parentheses in the From column gets inserted via the \1 marker found in the To column.

Running Scraping Sessions within Scraping Sessions

Overview

This feature is only available to Professional and Enterprise editions of screen-scraper.

It is possible to run a scraping session within a scraping session that is already running. This is done with the RunnableScrapingSession class. Detailed documentation on methods available for the RunnableScrapingSession class are available in our API documentation. Here's a specific example of how the RunnableScrapingSession might be used in a screen-scraper script:

import com.screenscraper.scraper.*;

// Generate a new RunnableScrapingSession object that will inherit
// from the current scraping session.  This object will be used
// to run the scraping session "My Scraping Session"
myRunnableScrapingSession = new RunnableScrapingSession( "My Session", session );

// Because we passed the "session" object to the RunnableScrapingSession
// it will have access to all of the session variables within the
// currently running session.  As such, there's no need to explicitly
// set any new session variables.  We simply tell it to scrape.<
myRunnableScrapingSession.scrape();

// Once it's done scraping, because it inherited from our currently
// running scraping session, we have access to any session variables
// that were set when the RunnableScrapingSession ran in the context
// of our currently running scraping session.  For example, let's
// suppose that when the RunnableScrapingSession ran it set a new
// variable called "MY_VAR".  Because of the inheritance, we could
// do something like this to see th new value:<
session.log( "MY_VAR: " + session.getVariable( "MY_VAR" ) );

Script Overwriting

Overview

Scripts attached to a scraping session are exported along with it. When you subsequently import that scraping session into another instance of screen-scraper it might overwrite existing scripts in that instance. In some cases, though, you might have a series of general scripts shared by many scraping sessions. In these cases you often want to ensure that the very latest versions of these general scripts get retained in a given instance.

Using the Workbench

In the main pane of the script there is a Overwrite this script on import checkbox. When checked, any name clashes between existing scripts and imported versions will prompt you whether it should overwrite the script or not. If the local version's Overwrite this script on import checkbox is unchecked this file will not be overwritten even if you click to have it overwritten.

In a GUI-less environment

Checking the Overwrite this script on import requires access to the screen-scraper workbench, which you may not have access to if screen-scraper is running in a GUI-less environment. In these cases you can make use of the ForceOverwriteScripts property in the resource/conf/screen-scraper.properties file to allow scripts that have this box un-checked to be overwritten. In order to overwrite scripts that have this checkbox un-checked in a GUI-less environment you would follow these steps:

  1. Stop screen-scraper (or ensure that it isn't currently running).
  2. Open the resource/conf/screen-scraper.properties file.
  3. Add this line to the properties file: ForceOverwriteScripts=true (or edit the line, if it already exists).
  4. Save and close the properties file.
  5. Start up screen-scraper and import the scraping sessions or scripts.

Once you're finished importing you may want to stop screen-scraper, set the property back to false, then start up again. Note that when you import scripts with the ForceOverwriteScripts property set to true screen-scraper will import the scripts regardless of whether or not the Overwrite this script on import checkbox is checked.

Using Extractor Patterns

Overview

Extractor patterns allow you to pinpoint select snippets of data that you want extracted from a web page. It is a block of text (usually HTML) that contains special tokens that will match pieces of data you're interested in extracting. These tokens are text labels surrounded by the delimiters [email protected] and @~ (e.g. [email protected]@~). The identifier between the delimiters can contain only alpha-numeric characters and underscores.

Extractor patterns are added to scrapeable files under the extractor patterns tab.

You can think of an extractor pattern like a stencil. A stencil is an image in cut-out form, often made of thin cardboard. As you place a stencil over a piece of paper, apply paint to it, then remove the stencil, the paint remains only where there were holes in the stencil. Analogously, you can think of placing an extractor pattern over the HTML of a web page. The tokens correspond to the holes where the paint would pass through. After an extractor pattern is applied it reveals only the portions of the web page you'd like to extract.

Extractor tokens designate regions where data elements are to be captured. For example, given the following HTML snippet:

<p>This is the <b>piece of text</b> I'm interested in.</p>

you would extract piece of text by creating an extractor pattern with a token positioned like so:

<p>This is the <b>[email protected][email protected]~</b> I'm interested in.</p>

The extracted text could then be accessed via the identifier EXTRACTED_TEXT.

If you haven't done so already, we'd recommend going through our first tutorial to get a better feel for using extractor patterns.

Tips/Suggestions

  • Test your patterns frequently. Extractor patterns take some practice. Especially when you're first trying them out you'll want to test them as you're working with them. It often helps to test it after every couple of tokens you insert.
  • Use regular expressions to make your extractor patterns more precise. One of the most common problems encountered occurs when an extractor pattern matches too much data, which usually includes a lot of HTML. There are a couple of ways to address this problem. One is to extend the pattern outward. That is, include HTML that falls before and after the block you're trying to match. The second approach, which is generally the easier of the two, is to include regular expressions. We've included a number of common regular expressions that you can select from the drop-down list. In general, if you can use more precise regular expressions you can reduce the amount of HTML in the extractor pattern. Doing so makes your patterns more resilient to changes that might be made to the web site you're scraping.

    If an extractor pattern takes too long to match a block of text it will timeout. The timeout setting may be adjusted from the general tab of the Settings located in the Options menu. If you find that your extractor pattern is timing out you might try adjusting it by using more precise regular expressions.

  • Ensure that the pattern extracts the number of data records you expect it to. Oftentimes your pattern might not be as flexible as you think it is. Test it out to make sure it matches as many times as you think it should.
  • Try tidying the HTML. This will ensure that white space is handled consistently and will often clean up extraneous characters. The setting that determines whether or not HTML gets tidied is adjusted under the advanced tab of the scrapeable file.

Using Scripts

Overview

screen-scraper's scraping engine allows you to associate custom scripts with various events in the scraping process. It is recommended that you read about managing and scripting in screen-scraper before continuing.

Using the scripts

Depending on what event triggers a script to be run different objects will be in-scope. Triggers regarding the scraping session are added on the general tab of the scraping session, file request/response triggers are associated on the properties tab of the scrapeable file, and extractor pattern events in the scripts section of the main tab in the extractor patterns tab of the scrapeable file.

Scripts can also be used to run scripts using the session.executeScript method.

Built-in objects

screen-scraper offers a few objects that you can work with in a script in the scraping engine. See the variable scope section and/or API documentation for more details.

  • session: The running scraping session.
  • scrapeableFile: The file interaction including request and response, it also holds the extractor pattern requests.
  • dataSet: All of the matches from an extractor pattern's tokens.
  • dataRecord: A single match of an extractor pattern's tokens.

Variable scope

Depending on when a script gets run different variables may be in or out of scope. When associating a script with an object, such as a scraping session or scrapeable file, you're asked to specify when the script is to be run. The table that follows specifies what variables will be in scope depending on when a given script is run. Only variables that are in scope are accessible to the script.

When Script is Run session in scope scrapeableFile in scope dataSet in scope dataRecord in scope
Before scraping session begins X
After scraping session ends X
Before file is scraped X X
After file is scraped X X
Before pattern is applied X X
After pattern is applied X X X
Once if pattern matches X X X X
Once if no matches X X
After each pattern match X X X X

Debugging scripts

One of the best ways to fix errors is to simply watch the scraping session log and the error.log file (located in the log directory where screen-scraper was installed) for script errors. When a problem arises in executing a script screen-scraper will output a series of error-related statements to the logs. Often a good approach in debugging is to build your script bit by bit, running it frequently to ensure that it runs without errors as you add each piece.

When screen-scraper is running as a server it will automatically generate individual log files in the log directory for each running scraping session (this can be disabled in the settings window). An error.log file will also be generated in that same directory when internal screen-scraper errors occur.

The breakpoint window can also be invaluable in debugging scripts. You can invoke it by inserting the line session.breakpoint() into your script.

Using Session Variables

Overview

Session variables allow you store values that will persist across the life of a scraping session.

Setting session variables

There are a few different ways to set session variables.

  1. Within a script using the setVariable method of the session object
  2. Designate that the value matched by a extractor token should be saved in a session variable (check Save in session variable in the main tab of the extractor token).
  3. Using the RemoteScrapingSession object from external sources (such as a PHP or ASP script) via their setVariable methods (see scripting in screen-scraper for more details).

Retrieving values from session variables

As with setting session variables, there is more than one way to retrieve values of session variables.

  1. Within a script using the getVariable method of the session object.
  2. Embed the identifier for the session variable, surrounded by ~# and #~ delimiters.

    If you have a session variable identified by QUERY_PARAM you might embed it into the URL field of a scrapeable file using http://www.mydomain.com/myscript.php?query=~#QUERY_PARAM#~. screen-scraper will automatically replace the ~#QUERY_PARAM#~ with the value of the session variable.

Using Sub-Extractor Patterns

Overview

Sub-extractor patterns allow you to extract data in the context of an extractor pattern, providing significantly more flexibility in pinpointing the specific pieces you're after. Consider a search results page consisting of rows and columns of data. Using normal extractor patterns you would use a single pattern to extract the data from all columns for a single row. In many cases this works just fine; however, the process gets more complicated when each row differs significantly. For example, certain cell rows may be in different colors or their contents may be completely missing. With a normal extractor pattern it would be difficult to account for the variability in the cells. By using sub-extractor patterns you could create a normal extractor pattern to extract an entire row, then use individual sub-extractor patterns to pull out the individual cells.

When using sub-extractor patterns only the first match will be used. That is, even if a sub-extractor pattern could match multiple times, only the data corresponding to the first match will be extracted. Because of this sub-extractor patterns are not always the correct method for getting data within a larger context. To get multiple matches in a larger context, like all rows in a table, you would instead use manual extractor patterns.

Example

Consider the following HTML table:

Name Phone Address
Juan Ferrero 111-222-3333 123 Elm St.
Joe Bloggs No contact information available
Sherry Lloyd 234-5678 (needs area code) 456 Maple Rd.

Here is the corresponding HTML source:

 <table cellpadding="2" border="1">
     <tr>
        <th>Name</th>
        <th>Phone</th>
        <th>Address</th>
    </tr>
    <tr>
        <td class="Name">Juan Ferrero</td>
        <td class="Phone">111-222-3333</td>
        <td class="Address">123 Elm St.</td>
    </tr>
    <tr class="even">
        <td class="Name">Joe Bloggs</td>
        <td colspan="2">No contact information available</td>
    </tr>
     <tr>
        <td class="Name">Sherry Lloyd</td>
        <td class="Phone warning">234-5678 (needs area code)</td>
        <td class="Address">456 Maple Rd.</td>
    </tr>
</table>

It would be difficult to write a single extractor pattern that would extract the information for each row because the contents of the cells differ so significantly. The different colored cells and the cell spanning two columns make the data too inconsistent to be easily extracted using a single pattern (which would require lots of regular expressions and might still prove impossible or inconsistent).

Consider this extractor pattern:

The [email protected]@~ extractor pattern token is special in that it defines the block of data to which you wish to apply sub-extractor patterns. Sub-extractor patterns cannot be applied to a token with a name other than DATARECORD

If applied to the HTML above the extractor pattern would produce the following three matches:

1.  ><td class="Name">Juan Ferrero</td><td class="Phone">111-222-3333</td><td class="Address">123 Elm St.</td>
2.  class="even"><td class="Name">Joe Bloggs</td><td colspan="2">No contact information available</td>
3.  ><td class="Name">Sherry Lloyd</td><td class="Phone warning">234-5678 (needs area code)</td><td class="Address">456 Maple Rd.</td>

Sub-extractor patterns would allow you to extract individual pieces of information from each row. For example, consider this sub-extractor pattern:

<td class="Name">[email protected]@~</td>

If applied to each of the individual extracted rows above the following three pieces of information would be extracted:

1.  Juan Ferrero
2.  Joe Bloggs
3.  Sherry Lloyd

This is a simple case. Now consider the extractor pattern for the phone number:

<td class="Phone">[email protected]@~</td>

If applied to each of the individual extracted rows above the following three pieces of information would be extracted:

1.  111-222-3333<br />
2.
3.

In the case of Sherry Lloyd this presents a serious problem because she does have a phone number listed. It is not selected because of the additional class. Let's adjust the sub-extractor pattern slightly:

The [email protected]@~ represents an extractor token that uses the Non-double quotes regular expression: [^"]*. Matching anything between where it is covering until it encounters double quotes. In this particular case Sherry's phone number also gets extracted.

We now have the case of the cell in the second row that spans two columns, which would not get extracted by our current sub-extractor patterns. We may still want this information, however, so we create the following sub-extractor pattern, just in case the cell exists:

<td colspan="2">[email protected]@~<

If applied to our data we'd get the following results:

 1.
 2. No contact information available
 3.

When multiple sub-extractor patterns hold a token with the same name (in this case, PHONE), the last one to match is the one that determines the value of the token. In this example either one or the other will match. If both could match then we would want to have the first phone extractor pattern ordered later than the one to match the no-data-available pattern

Sub-extractor patterns aggregate everything that's extracted into a single data set. Using all of our extractor and sub-extractor patterns together we'd get the following data set:

Data record # Name Phone
Data record #1 Juan Ferrero 111-222-3333
Data record #2 Joe Bloggs No contact information available
Data record #3 Sherry Lloyd 234-5678 (needs area code)

Important Notes

  • When two sub-extractor patterns hold a token with the same name, the one that doesn't match anything will have no effect. Sub-extractor patterns are applied in sequence, and those that match something will take precedence over those that don't.
  • [email protected]@~ is the extractor token identifier that defines the block of data to which you wish to apply sub-extractor patterns. You cannot use sub-extractor patterns without using this token name in the main extractor pattern.
  • When using sub-extractor patterns only the first match will be used. That is, even if it could match multiple times, only the data corresponding to the first match will be extracted.

Workbench

To launch the Workbench, double-click the screen-scraper icon

Overview

The workbench provides an intuitive and convenient way to interact with screen-scraper's scraping engine. This section of our documentation covers the interfaces provided in the workbench to develop and manage scrapes. If you're interested in learning to use screen-scraper, the best approach is to go through at least our first few tutorials.

Introduction

To ensure clarity, the first thing that you need to know about the workbench is the names for the various regions of the window. This will help to keep us oriented correctly during the documentation of the workbench.

Workbench Layout

  • Menu Bar: Options available at the very top of the page (just below the title bar) starting with File on the far left.
  • Buttons Bar: Just below the File Menu this area contains buttons for such common tasks as saving, creating new objects, cut, copy, and paste.
  • Objects Tree: Located on the left side of the window, this pane lists all of the screen-scraper objects that are currently available to this instillation of screen-scraper.

    If this is your first time opening screen-scraper the only item listed will be the root folder.

  • Main Pane: This is the largest pane in the workbench and usually takes up at least two-thirds of the window. It changes to reflect the content of the object that is selected in the assets pane. Because of this feature, it is most commonly referred to by the name of the screen being displayed in it.
  • Status Bar: Strip at the bottom of the window that reports on the current memory usage of screen-scraper. Occasional messages will also show up, such as when screen-scraper has saved your work.

The size of the two panes can be adjusted to your liking by clicking on the vertical bar that divides the two panes and dragging it to the left or right.

Settings

Overview

This section contains a description of each of the screens found in the Settings window, which can be displayed by selecting Settings from the Options menu, or by clicking the wrench icon in the button bar.

General Settings

General Settings

  • Connection timeout: At times remote web servers will experience problems after screen-scraper has made a connection. When this happens the server will often hold on to the connection to screen-scraper, causing it to appear to freeze. Designating a connection timeout avoids this situation. Generally around 30 seconds is sufficient.
  • Data extractor timeout: In certain cases complex extractor patterns can take an abnormally long time when being applied. You'll likely want to designate a timeout so that screen-scraper doesn't get stuck while applying a pattern. Typically it should not take longer than 2 or 3 seconds to apply a pattern.
  • Maximum number of concurrent running scraping sessions (professional and enterprise editions only): When screen-scraper is running as a server you'll often want to limit the number of scraping sessions that can be run simultaneously, so as to avoid consuming too many resources on a machine. This setting controls how many will be allowed to run at a time. Note that this only applies when a lazy scrape is being performed.
  • Maximum application memory allocation in megabytes: This setting controls the amount of memory screen-scraper will be allowed to consume on your computer. In cases where you notice sluggish behavior or OutOfMemoryError messages appearing in the error.log file (found in the log directory for your screen-scraper installation folder), you'll likely want to increase this number.
  • Default proxy session to use when running in server mode (enterprise edition only): When screen-scraper is running as a server it can also run the proxy server. If you designate a proxy session in this drop-down box screen-scraper will make use of its scripts.
  • Installation directory: In virtually all cases this setting can be left untouched. If you move the screen-scraper installation directory you may need to manually set this.
  • Automatically check for updates on startup (professional and enterprise editions only): If this box is checked screen-scraper will automatically check for updates and notify you if one is available.
  • Allow upgrading to unstable versions (professional and enterprise editions only): If this box is checked when you select Check for updates from the Options menu screen-scraper will give you the option to download alpha/unstable versions of the software.
  • Default character set (professional and enterprise editions only): Indicates the character set that should be used when not designated by the remote server. When scraping sites that use a Roman character set you'll likely want to use ISO-8559-1; otherwise, UTF-8 is probably what you'll want to use. A comprehensive list of supported character sets can be found here. Your web browser will also generally be able to tell you what character set a particular site is using. Even with that, though, when scraping international character sets it can sometimes require trial and error to isolate what character set is best to use. For more information see

    Server Settings

    Server Settings (professional and enterprise editions only)

    Server (professional and enterprise editions only)

    These settings apply when screen-scraper is running in server mode.

    • Port: Sets the port screen-scraper will listen on when running as a server.
    • Generate log files: If checked, a log file will be generated in the log folder each time a scraping session is run.
    • Hosts to allow to connect: Caution should be exercised whenever a network service is running on a computer. This is no exception with screen-scraper. If this box is blank screen-scraper will allow any machine to connect to it. This is not recommended unless the machine on which screen-scraper is running is protected by external firewalls. A comma-delimited list of host names and IP addresses that should be allowed to connect to screen-scraper should be entered into this box. For example, if localhost is designated screen-scraper will only allow connections from the local machine. Note also that portions of IP addresses can be designated. For example, if 192.168 were designated, the following IP addresses would be allowed to connect: 192.168.2.4, 192.168.4.93, etc. Note that this setting applies both to the proxy server as well as when screen-scraper is running in server mode.

    Proxy Server (professional and enterprise editions only)

    These settings apply only to the proxy server portion of screen-scraper.

    • Port: Sets the port screen-scraper's proxy server should listen on.
    • Don't log binary files: If this box is checked screen-scraper will not log any binary files (e.g., images and Flash files) in the HTTP Transactions table for proxy sessions.

    Mail Server (professional and enterprise editions only)

    These settings are used with the sutil.sendMail method in screen-scraper scripts.

    • Host: The host the mail should be sent through.
    • Username: The username required to authenticate to the mail server in order to send mail through it. Note that this may not be required by the mail server.
    • Password: The password required to authenticate to the mail server in order to send mail through it. Note that this may not be required by the mail server.
    • Port: The port that should be used when connecting to the host (corresponding setting in resource/conf/screen-scraper.properties file: MailServerPort=PortNumber).
    • Use TLS/SSL: Whether or not TLS/SSL encryption should be used when communicating with the host (corresponding setting in resource/conf/screen-scraper.properties file: MailServerUsesTLS=true).

    Web/SOAP Server (professional and enterprise editions only)

    These settings apply only to the web interface and SOAP server features of screen-scraper.

    • Port: Sets the port screen-scraper's web/SOAP server should listen on. When accessing the web interface, this number will determine what goes after the colon in the URL. For example, if this number is left at the default value (8779), you would access screen-scraper's web interface with this URL: http://localhost:8779/.

External Proxy Settings

External Proxy Settings

Unless you normally connect to the Internet through an external proxy server, you don't need to modify these settings.

  • External proxy authentication: These text boxes are used in cases where you need to connect to the Internet via an external proxy server.
    • Username: Your username on the proxy server.
    • Password: Your password on the proxy server.
    • Host: The host/domain of the proxy server
    • Port: The port that you use on the host server.
  • External NT proxy authentication: These text boxes are used in cases where you need to connect to the Internet via an external NT proxy server.

    If you are using NTLM (Windows NT) authentication you'll need to designate settings for both the standard proxy as well as the NTLM one.

    • Username: Your username on the NT proxy server.
    • Password: Your password on the NT proxy server.
    • Domain: The domain/group name that the NT proxy server uses.
    • Host: The host of the proxy server.

Anonymous Proxy Settings

Anonymous Proxy Settings (professional and enterprise editions only)

  • Password: The password your received from screen-scraper when you setup your anonymazation account.

    This setting is available in the screen-scraper.properties file as AnonymousProxyPassword

  • Allowed IP addresses: The IP addresses of the machine(s) you wish to allow to connect to your screen-scraper server

    In this field it expects a comma-delimited list of IP addresses that screen-scraper should accept connections from. You can also specify just the beginning portions of IP addresses. For example, if you enter 111.22.333 screen-scraper would accept connections from 111.22.333.1, 111.22.333.2, 111.22.333.3, etc.

    If nothing is entered into this text box screen-scraper will accept connections from any IP address. This is not generally encouraged.

    This setting is available in the screen-scraper.properties file as AnonymousProxyAllowedIPs

  • Get the IP address for this computer: Retrieves the IP address of the computer that screen-scraper is running on. This is provided to help you specify the correct IP address for the Allowed IP addresses field.
  • Max running servers: IP addresses that are blocked will be replaced with the maximum number of servers indicated. Greater than 5 & less than 10 are recommended.

    This setting is available in the screen-scraper.properties file as AnonymousProxyMaxRunning

  • Number of running instances: The total number of proxy servers running anonymous scrapes.
  • Refresh: Retrieves the current number of running proxy servers.
  • Terminate all running proxy servers: Shuts down all running proxy servers.

    As you pay for proxy servers by the hour, if you don't have your scraping session set up to automatically shut them down at the end you will need to use this button to end the proxy servers.

Under certain circumstances you may want to anonymize your scraping so that the target site is unable to trace back your IP address. For example, this might be desirable if you're scraping a competitor's site, or if the web site is blocking too many requests from a given IP address.

There are a few different ways to go about this using screen-scraper. We will discuss how to setup anonymazation in screen-scraper later in the documentation.

Proxy Sessions

Overview

A proxy session in screen-scraper is a record of the requests and responses that go between a browser and a proxy server. It is useful in learning how to scrape a site and is used to configure screen-scraper's proxy server. For more information see our documentation about using the proxy server.

Managing Proxy Sessions

Adding

  1. Select New Proxy Session from the File menu.
  2. Click on the globe in the button bar.
  3. Right click on a folder in teh objects tree and select New Proxy Session.
  4. Use the keyboard shortcut Ctrl-J

Removing

  • Press the Delete key when it is selected in the objects tree
  • Right-click on the proxy session in the objects tree and select Delete.
  • Click the Delete button in the proxy session general tab.

Proxy Session: General Tab

General Tab

  • Start Proxy Server: Starts the proxy server and records the requests and responses that go through it.
  • Delete: Removes the proxy session from screen-scraper
  • Name: The name used to refer to the proxy session
  • Port: The port that this proxy session should connect to in the proxy server

Proxy Session: Progress Tab

Progress Tab

  • Clear All Transactions: Remove all of the transaction records currently in the list.
  • Find (professional and enterprise editions only): Search transactions for text string.
  • Detect JS Cookies: Show cookies that were not set by the server.

    For the button to work correctly you will want to clear your browser cookies before having the proxy session record all transactions. This makes it so that cookies already in existence are not considered to be javascript cookies.

  • Filter out less useful transactions (professional and enterprise editions only): When checked files that are unlikely to contain desired information do not show up in the transactions list. This includes such things as JavaScript and CSS files.
  • Don't record binary files: When checked this option will cause screen-scraper to not display files such as images or other media files to the list of transactions under the progress tab. This will make it easier to find the files that you want without having to look through everything that goes through the server.

    Transactions not included in the list are still recorded to the proxy session log.

  • HTTP Transactions: A log of each of the transactions that has taken place (except for binary files if you have selected not to log them).
    • #: The order in which the requests were initiated.
    • Note: Editable field to help keep track of the transactions, when transactions are turned into scrapeable files the note becomes the initial name of the scrapeable file.
    • URL: The requested URL of the transaction.
    • Status: Indication of the current state of the transaction.

When a transaction is selected more information regarding the request and response is displayed.

Request Sub-tab

  • Display Raw Request: Displays the whole request as it was sent to the server.
  • Generate scrapeable file in: Creates scrapeable files in the specified scraping session for each of the selected transactions. The names of the scrapeable files are the text specified in the note section of each transaction.
  • Request Line: The first line of the request.
  • Headers: Any additional headers specified in the request.
  • POST Data: All POST data that was sent along with the request.

Response Sub-tab

  • Display Raw Response: Displays the whole response as it came from the server.
  • Display Response in Browser: Opens your system's default browser and displays the contents of the response as they would appear when passed through a browser.
  • Status Line: HTTP status of the transaction.
  • Headers: Headers sent along with the response from the server.
  • Content: The content of the response with headers and such removed.

Detect JS Cookie

Overview

screen-scraper has always kept track of server set cookies and does that for you automatically; however, when the cookies are set by javascript screen-scraper does not catch them. This saves on the time lost having screen-scraper scrape every javascript file when most of the time there is nothing there that matters.

This mean that you have to set any javascript added cookies using the setCookie method. To help find where javascript cookies are being set we have added a Detect JS Cookies button in the proxy session progress tab.

For the button to work correctly you will want to clear your browser cookies before having the proxy session record all transactions. This makes it so that cookies already in existence are not considered to be javascript cookies.

Proxy Session: Scripts Tab

This feature has been deprecated and by default is not available in the workbench interface. To enable proxy scripting please add AllowProxyScripting=true to your resource/conf/screen-scraper.properties file and restart screen-scraper.

You are unlikely to use this tab unless you are running screen-scraper as a proxy in server mode.

Scripts Tab (enterprise edition only)

  • Add Script: Adds a script association to filter requests and/or responses on the Proxy Server.
    • Script Name: Specifies which script should be run.
    • Sequence: The order in which the scripts should be run.
    • When to Run: When the proxy server should request to run the script.
    • Enabled: A flag to determine which scripts should be run and which shouldn't be.

Proxy Session: Log Tab

Log Tab

  • Clear Log: Erase the current contents of the proxy session log.

If you are trying to troubleshoot problems with scripts not working the way you expected the log can give you clues as to where problems might exists. Likewise, you can have your scripts write to the log to help identify what they are doing. If you have selected to filter out binary files and/or less useful transactions a log of those transactions will be available here.

The proxy session log is not saved in the workbench, if you close screen-scraper you will lose the current contents of the proxy session log.

Proxy Session: Advanced Tab

Advanced Tab

  • Key store file path: The path to a JKS file that contains the certificates required for this scrape
  • Key store password: The password used when generating the JKS file

    Some web sites require that you supply a client certificate, that you would have previously been given, in order to access them. This feature allows you to access this type of site while using screen-scraper.

For more info see our blog entry on the topic.

Scraping Sessions

Overview

A scraping session is simply a way to collect together files that you want scraped. Typically you'll create a scraping session for each site from which you want to scrape information.

Managing Scraping Sessions

Adding

  • Select New Scraping Session from the File menu.
  • Click the gear in the button bar.
  • Right click on a folder in the objects tree and select New Scraping Session.
  • Use the keyboard shortcut Ctrl-K

Removing

  • Press the Delete key when it is selected in the objects tree
  • Right-click on the scraping session in the objects tree and select Delete.
  • Click the Delete button in the general tab of the scraping session.

Importing

  • Right-click on the folder in the objects tree that you want to import the files into (other than the root folder) and select Import Into. In the window that opens, navigate to and select the scraping session you want to import.
  • Select Import from the File menu. In the window that opens, navigate to and select the scraping session you want to import.
  • Add the scraping session to the import folder in screen-scraper's install directory

    screen-scraper should not be running when you add the file into the folder. All files will be imported into the root folder the next time screen-scraper starts.

Exporting

When a scraping session is exported it will use the character set indicated under the advanced tab. If a value isn't indicated there it will use the character set indicated in the general settings.

  • Right-click on the scraping session in the objects tree and select Export.
  • Click the Export button in the general tab of the scraping session.

Scraping Session: General tab

General Tab

  • Run Scraping Session: Starts the scraping session. Once the scraping session begins running you can watch its progress under the Log tab.
  • Delete: Deletes the scraping session.
  • Add Scrapeable File: Adds a new scrapeable file to this scraping session.
  • Export: Allows you to export the scraping session to an XML file. This might be useful for backing up your work or transferring information to a different screen-scraper installation.
  • Name: Used to identify the scraping session. The name should be unique relative to other scraping sessions.
  • Notes: Useful for keeping notes specific to the scraping session.
  • Scripts: All of the scripts associated with the scraping session.
    • Add Script: Adds a script association to direct and manipulate the flow of the scraping session.
    • Script Name: Specifies which script should be run.
    • Sequence: The order in which the scripts should be run.
    • When to Run: When the scraping session should run the script.
    • Enabled: A flag to determine which scripts should be run and which shouldn't be.

Each script can be designated to run either before or after the scraping session runs. This can be useful for functions like initializing session variables and performing clean-up after the scraping session is finished. It's often helpful to create debugging scripts in your scraping session, then disable them once you're ready to run your scraping session in a production environment.

Scraping Session: Log tab

Log Tab

  • Clear Log: Erase the current contents of the log.
  • Find: Search the log for the specified text.
  • Run Scraping Session / Stop Scraping Session: Start/Stop the scraping session.
  • Breakpoint (professional and enterprise editions only): Pause the scrape and open a breakpoint window.
  • Logging Level (professional and enterprise editions only): Determines what types of messages appear on the log. This is often referred to as the verbosity of the log. This effects the file system logs as well as the workbench log.
  • Show only the following number of lines: The number of lines that the log should maintain as it runs. When it is left blank it will keep everything.
  • Auto-scroll: When checked, the log will make sure that you can always see the most recent entries into the log on the screen.

If you are trying to troubleshoot problems with scripts not working the way you expected the log can give you clues as to where problems might exists. Likewise, you can have your scripts write to the log to help identify what they are doing.

This tab displays messages as the scraping session is running. This is one of the most valuable tools in working with and debugging scraping sessions. As you're creating your scraping session you'll want to run it frequently and check the log to ensure that it's doing what you expect it to.

Scraping Session: Advanced tab

Advanced tab

  • Max retries per file (professional and enterprise editions only): The number of times that screen-scraper should attempt to request a page, in the case that a request fails. In some cases web sites may not be completely reliable, which could necessitate making the request for a given page more than once.
  • Cookie policy (professional and enterprise editions only): The way screen-scraper works with cookies. In most cases you won't need to modify this setting.

    There may be instances where you find yourself unable to log in to a web site or advance through pages as you're expecting. If you've checked other settings, such as POST and GET parameters, you may need to adjust the cookie policy. Some web sites issue cookies in uncommon ways, and adjusting this setting will allow screen-scraper to work correctly with them.

  • Character set (professional and enterprise editions only): Set the character set for the scraping session.

    If pages are rendering with strange characters then you likely have the wrong character set. You should also try turning off tidying if international characters aren't being rendered properly.

  • Key store file path: The path to a JKS file that contains the certificates required for this scrape
  • Key store password: The password used when generating the JKS file

    Some web sites require that you supply a client certificate, that you would have previously been given, in order to access them. This feature allows you to access this type of site while using screen-scraper.

  • External proxy authentication: These text boxes are used in cases where you need to connect to the Internet via an external proxy server.
    • Username: Your username on the proxy server.
    • Password: Your password on the proxy server.
    • Host: The host/domain of the proxy server
    • Port: The port that you use on the host server.
  • External NT proxy authentication: These text boxes are used in cases where you need to connect to the Internet via an external NT proxy server.

    If you are using NTLM (Windows NT) authentication you'll need to designate settings for both the standard proxy as well as the NTLM one.

    • Username: Your username on the NT proxy server.
    • Password: Your password on the NT proxy server.
    • Domain: The domain/group name that the NT proxy server uses.
    • Host: The host of the proxy server.

Scraping Session: Anonymization tab

Anonymization Tab (professional and enterprise editions only)

  • Anonymize this scraping session (professional and enterprise editions only): Specifies that this scraping session should make use of the anonymization settings of screen-scraper.
  • Terminate proxies when scrapping session is completed (professional and enterprise editions only): Determines whether the scraping session should terminate proxies or leave them open.
  • This scrape requires at least the following number of proxies to run (professional and enterprise editions only): The required number of proxies for the scrape to run.

Should proxy servers fail to spawn screen-scraper will proceed forward with the scraping session once at least 80% of the minimum required proxy servers are available.

This tab is specific for automatic anonymization For more information on anonymization, see our page on how to set it up anonymization in screen-scraper.

Scrapeable Files

Overview

A scrapeable file is a URL-accessible file that you want to have retrieved as part of a scraping session. These files are the core of screen-scraping as they determine what files will be available to extract data from.

In addition to working with files on remote servers, screen-scraper can also handle files on local file systems. For example, the following is a valid path to designate in the URL field: C:\wwwroot\myweb\my_file.htm.

Managing Scrapeable Files

Adding

  • Click the Add Scrapeable File button on the general tab of the desired scraping session.
  • Right click on the desired scraping session in the objects tree and select Add Scrapeable File.

Removing

  • Press the Delete key when it is selected in the objects tree
  • Right-click on the desired scrapeable file and select Delete.
  • Click the Delete button in the properties tab of the scrapeable file.

Scrapeable File: Properties tab

Properties Tab

  • Delete: Deletes the scrapeable file.
  • Copy (professional and enterprise editions only): Copies the scrapeable file.
  • Name: Identifies the scrapeable file.
  • URL: The URL of the file to be scraped. This is likely something like http://www.mysite.com/, but can also contain embedded session variables, like this: http://www.mysite.com/cgi-bin/test.cgi?param1=~#TEST#~. In the latter case the text ~#TEST#~ would get replaced with the value of the session variable TEST.
  • Sequence: Indicates the order in which the scraping session will request this file.
  • This scrapeable file will be invoked manually from a script: Indicates that this scrapeable file will be invoked within a script, so it should not be scraped in sequence. If this box is checked the Sequence text box becomes grayed out.

    You can tell what files are being scraped manually and which are in sequence using the objects tree. Sequenced scrapeable files are displayed with a pound sign (#) on them.

  • Scripts: All of the scripts associated with the scrapeable file.
    • Add Script: Adds a script association to direct and manipulate the flow of the scrapeable file.
    • Script Name: Specifies which script should be run.
    • Sequence: The order in which the scripts should be run.
    • When to Run: When the scrapeable file should run the script.
    • Enabled: A flag to determine which scripts should be run and which shouldn't be.

Scrapeable File: Parameters tab

Parameters Tab

  • Add Parameter: Adds a parameter to scrapeable file request.
    • Key: The name of the parameter.
    • Value: The value to be associated with the parameter.
    • Sequence: Order in which the parameters appear on the request strings.
    • Type: Indicates if the parameter should be sent using a GET or POST method when the file is requested.

      GET parameters can also be embedded in the URL field under the Properties tab.

Parameters can be deleted by selecting them and either hitting the Delete key on the keyboard, or by right-clicking and selecting Delete.

Using Session Variables

Session variables can be used in the Key and Value fields. For example, if you have a POST parameter, username, you might embed a USERNAME session variable in the Value field with the token ~#USERNAME#~. This would cause the value of the USERNAME session variable to be substituted in place of the token at run time.

Upload a File (enterprise edition only)

In the enterprise edition of screen-scraper you can also designate files to be uploaded. This is done by designating FILE as the parameter type. The Key column would contain the name of the parameter (as found in the corresponding HTML form), and the value would be the local path to the file you'd like to upload (e.g., C:\myfiles\this_file.txt).

Scrapeable File: Extractor Patterns tab

Extractor Patterns Tab

  • Add Extractor Pattern: Add a blank extractor pattern to the scrapeable file.
  • Paste Extractor Pattern (professional and enterprise editions only): Creates a new extractor pattern from a previously copied one.

    This button is grayed out if there is not a extractor pattern currently copied.

This tab holds the various extractor patterns that will be applied to the HTML of this scrapeable file. The inner frame will be discussed in more detail when discussing them.

Scrapeable File: Last Request tab

Last Request Tab

  • Refresh: Requests the newest version of the last requested file.
  • Compare with proxy transaction (professional and enterprise editions only): Open a Compare Last Request and Proxy Transaction window, allowing you to compare the last request of the scrapeable file with a proxy session HTTP Transaction request.

    This can be very helpful for pages that are very specific on request settings or where you are getting unexpected results from the page. This is the best place to start when you experience this type of issue.

This tab will display the raw HTTP request for the last time this file was retrieved. This tab can be useful for debugging and looking at POST and GET parameters that were sent to the server.

Scrapeable File: Last Response tab

Last Response Tab

  • Display Response in Browser: Displays the web page in your default web browser.
  • Find: Search the source code for a string of text.
  • Refresh: Reload display with the most recent response.
  • Load Response from Clipboard: Loads an html response from the clipboard.

The contents shown under the this tab might appear differently from the original HTML of the page. screen-scraper has the
ability to tidy the HTML, which is done to facilitate data extraction. See using extractor patterns for more details.

Creating Extractor Patterns from Last Response

The most common use for this tab is in generating and testing extractor patterns. You can generate
an extractor patterns by highlighting a block of text or HTML, right-clicking and selecting
Generate extractor pattern from selected text.

Scrapeable File: Advanced tab

Advanced Tab (professional and enterprise editions only)

  • Username and Password (professional and enterprise editions only): These two text fields are used with sites that make use of Basic, Digest, NTLM authentication.

    You can generally recognize when a web site requires this type of authentication because, after requesting the page, a small box will pop up requesting a username and password.

  • Tidy HTML (professional and enterprise editions only): Which tidier screen-scraper should use to tidy the HTML after requesting the file. This cleans up the HTML, which facilitates extracting data from it.

    A minor performance hit is incurred, however, when tidying. In cases where performance is critical Don't Tidy HTML should be selected.

Extractor Patterns

Overview

Extractor patterns allow you to pinpoint snippets of data that you want extracted from a web page. They are made up of text (usually HTML), extractor tokens, and possibly even session variables. The text and session variables give context to the tokens that represent the data that you want to extract from the page.

Extractor patterns can be difficult to understand at first. We recommend that you read about using extractor patterns or go through our first tutorial before continuing.

Managing Extractor Patterns

When creating extractor patterns you should use the HTML that will be found under the last response tab associated with a scrapeable file. By default, screen-scraper will tidy the HTML once it's been scraped, meaning that it will format it in a consistent way that makes it easier to work with. If you use the HTML by viewing the source for a page in your web browser it will likely be different from the HTML that screen-scraper generates.

Adding

  • Click the Add Extractor Pattern button in the extractor patterns tab of the scrapeable file
  • Select desired text in the last response tab of the scrapeable file, right click and select Generate extractor pattern from selected text.

Removing

  • Click the Delete on the desired extractor pattern.

Extractor Pattern: Main tab

Main Tab

  • Test Pattern: Opens a DataSet window with the results of the extractor pattern matches applied to the the HTML that appears in the last response tab.
  • Highlight Extracted Data (professional and enterprise editions only): Opens the last response tab and places a colored background on all text that matches to the extractor tokens.
  • Delete Extractor Pattern: Deletes the current extractor pattern.
  • Copy Pattern (professional and enterprise editions only): Copies the extractor pattern so that it can be pasted into a different scrapeable file.
  • Identifier: A name used to identify the pattern. You'll use this when invoking the extractData and extractOneValue methods.
  • Sequence: Determines the order in which the extractor pattern will be applied to the HTML.
  • Pattern text: Used to hold the text for the extractor pattern. This will also include the extractor pattern tokens that are analogous to the holes in the stencil.
  • Scripts: This table allows you to indicate scripts that should be run in relationship to the extractor pattern's match results. Much like other programming languages, screen-scraper can invoke code based on specified events. In this case, you can invoke scripts before the pattern is applied, after each match it finds, after all matches have been made, once if a pattern matches, or once if a pattern doesn't match. For example, if your pattern finds 10 matches, and you designate a script to be run After each pattern match, that script will get invoked 10 separate times.
    • Add Script: Adds a script association to the extractor pattern.
    • Script Name: Specifies which script should be run.
    • Sequence: The order in which the script should be run.
    • When to Run: When the scrapeable file should request to run the script.
    • Enabled: A flag to determine which scripts should be run and which shouldn't be.

Extractor Pattern: Sub-Extractor Patterns tab

Sub-Extractor Patterns Tab

  • Add Sub-Extractor Pattern: Adds a sub-extractor pattern.
  • Paste Sub-Extractor Pattern (professional and enterprise editions only): Paste a previously copied sub-extractor pattern.

The buttons specific to the sub-extractor pattern are discussed in more detail later in this documentation.

Extractor Pattern: Advanced tab

Advanced tab (professional and enterprise editions only)

  • Automatically save the data set generated by this extractor pattern in a session variable (professional and enterprise editions only): If this box is checked screen-scraper will place the dataSet object generated when this extractor pattern is applied into a session variable using the identifier as the key (i.e. session variable name). For example, if your extractor pattern were named PRODUCTS, and you checked this box, screen-scraper would apply the pattern and place the resulting dataSet into a session variable named PRODUCTS.

    It is recommend that you generally avoid checking this box unless it's absolutely needed because of memory issues it may cause. If this box is checked, screen-scraper will continue to append data to the dataSet, and all of that data will be kept in memory. The preferred method is to save data as it's being extracted, generally by invoking a script with a script association After each pattern match that pulls the data from dataRecord objects or session variables.

    • If a data set by the same name has already been saved in a session variable do the following: The action that should be taken when conflicts occur. If this page is on an iterator you might want to append so that you don't loose previous data, but this makes your variable very large.
  • Filter duplicate records (enterprise editions only): When this box and the Cache the data set box are checked screen-scraper will filter duplicates from extracted records. See the Filtering duplicate records section for more details.
  • Cache the data set (enterprise editions only): In some cases you'll want to store extracted data in a session variable, but the dataSet will potentially grow to be very large. The Cache the data set checkbox will cause the extracted data to be written out to the file system as it's being extracted so that it doesn't consume RAM. When you attempt to access the data set from a script or external code it will be read from the disk into RAM temporarily so that it can be used. You'll also need to check this box if you want to filter duplicates.
  • This extractor pattern will be invoked manually from a script (professional and enterprise editions only): If you check this box the extractor pattern will not be invoked automatically by screen-scraper. Instead, you'll invoke it in a script using the extractData and extractOneValue methods.

Sub-Extractor Patterns

Overview

Sub-extractor patterns allow you to extract data within the context of an extractor pattern, providing significantly more flexibility in pinpointing the specific pieces you're after. Please read our documentation on using sub-extractor patterns before deciding to use them.

Sub-extractor patterns only match the first element they can. To get multiple matches, you would use manual extractor patterns instead.

Managing Sub-Extractor Patterns

Adding

Removing

Sub-Extractor Pattern: Main Pane

  • Test Pattern: Opens a dataSet window with the information extracted from the last scrape of the file.
  • Highlight Extracted Data (professional and enterprise editions only): Opens the last response tab and places a colored background on all text that matches to the extractor tokens.
  • Delete: Removes the sub-extractor pattern from the extractor pattern.
  • Copy (professional and enterprise editions only): Removes the sub-extractor pattern from the extractor pattern.
  • Sequence: Order in which the sub-extractor patterns should be applied.

Extractor Tokens

Overview

Extractor tokens select the information from a file that you want to be able to access. The purpose of an extractor pattern is to give context to the extractor token(s) that it contains. This is to assist in getting the tokens to only return the information that you desire to have. Without extractor tokens you will not gather any information from the site.

Extractor tokens become available to dataRecord, dataSet, and session objects depending on their settings and the scope of the scripts invoked. All extractor tokens are surrounded by the delimiters [email protected] and @~ (one for each side of the token). Between the two delimiters is where the name/identifier of the token is specified.

Managing Extractor Tokens

Adding

  • Type [email protected][email protected]~ in the appropriate location in the Pattern text of the extractor or sub-extractor pattern.

    Make appropriate changes to the TOKEN_NAME text to reflect the desired name of the token.

  • Select a portion of the extractor pattern, right click, and select Generate extractor pattern token from selected text

Removing

  • Remove the token and delimiters from the Pattern text of the extractor pattern like you would with any text editor

Editing

  • Double-click on the desired extractor token's name
  • Select the extractor token's name, right click, and choose Edit token

Extractor Token: General tab

General Tab

  • Identifier: This is a string that will be used to identify the piece of data that gets extracted as a result of this token. You can use only alphanumeric characters and underscores here.
  • Save in session variable: Checking this box causes the value extracted by the token to be saved in a session variable using the token's identifier.
  • Null session variable if no match (enterprise edition only): When checked, if a session variable was matched previously but not this time, the value will be set to null. If unchecked the unmatched token would do nothing to the session variable so that the old session variable persists.
  • Regular Expression: Here you can designate a regular expression that will be used to match the text covered by this token. In most cases you should designate a regular expression for tokens. This makes the extraction more efficient and helps to guard against future changes that might be made to the target web site.
    • Enter: Type in your own regular expression.
    • Select: Select a predefined regular expression by name.

      The regular expressions that appear in the drop-down list can be edited by selecting Edit regular expressions from the Options menu.

Extractor Token: Mapping tab

Mapping Tab (enterprise edition only)

We would encourage you to read our documentation on mapping extracted data before you start using mappings.

  • Set (enterprise edition only): Name of the mapping group.

    To create a new set, select the text in the Set textbox and start typing the name of the new set.

  • Delete Set (enterprise edition only): Deletes the currently selected set.
  • Add Mapping (enterprise edition only): Adds a mapping to the currently selected set.
    • From: The value screen-scraper should match.
    • To: Once a match is found, indicates the new value the extracted data will assume.
    • Type: Determines the type of match that should be made in working with the value in the From field. The Equals option will match if an exact match is found, the Contains value will match if the value contains the text in the From field, and the regular expression type uses the From value as a regular expression to attempt to find a match (see regular expression help for more information on regular expressions).
    • Case Sensitive: Indicates whether or not the match should be case sensitive.
    • Sequence: Determines the sequence in which the particular mapping should be analyzed.

Mappings can be deleted by pressing the Delete key on your keyboard after selecting them.

Extractor Token: Advanced tab

Advanced Tab (enterprise edition only)

  • Strip HTML (enterprise edition only): Check this box if you'd like screen-scraper to pull out HTML tags from the extracted value.
  • Resolve relatively URL to absolute URL (enterprise edition only): If checked, this will resolve a relative URL (e.g., /myimage.gif) into an absolute URL (e.g., http://www.mysite.com/myimage.gif).
  • Convert HTML entities (enterprise edition only): This will cause any html entities to be converted into plain text (e.g., it will convert & into &).
  • Trim white space (enterprise edition only): This will cause any white space characters (e.g., space, tab, return) to be removed from the start and end of the matched string.
  • Exclude from DataSet/DataRecord (enterprise edition only): This will cause this token to not be saved in the DataRecord from each match of the extractor pattern

Scripts

Overview

screen-scraper has a built-in scripting engine to facilitate dynamically scraping sites and working with data once it's been extracted. Scripts can be helpful for such things as interacting with databases and dynamically determining which files get scraped at when.

Invoking scripts in screen-scraper is similar to other programming languages in that they're tied to events. Just as you might designate a block of code to be run when a button is clicked in Visual Basic, in screen-scraper you might run a script after an HTML file has been downloaded or data has been extracted from a page. For more information see our documentation on scripting triggers.

Depending on your preferences, there are a number of languages that scripts can be written in. You can learn more in the scripting in screen-scraper section of the documentation.

If you haven't done so already, we'd highly recommend taking some time to go through our tutorials in order to get more familiar with how scripts are used.

Managing Scripts

Adding

  • Select New Script from the File menu.
  • Click on the pencil and paper icon in the button bar.
  • Right click on a folder in the objects tree and select New Script.
  • Use the keyboard shortcut Ctrl-L.

Removing

  • Press the Delete key when it is selected in the objects tree.
  • Right-click on the script in the objects tree and select Delete.
  • Click the Delete button in the main pane of the script.

Importing

  • Right-click on the folder in the objects tree that you want to import the script into (other than the root folder) and select Import Into. In the window that opens, navigate to and select the script you want to import.
  • Select Import from the File menu. In the window that opens, navigate to and select the script you want to import.
  • Add the script to the import folder in screen-scraper's install directory

    If screen-scraper is running when you copy the files into the import folder they will be imported and hot-swapped in the next time a scraping session is invoked. They will also be imported if you start or stop screen-scraper.

Exporting

  • Right-click on the script in the objects tree and select Export.
  • Click the Export button in main pane of the script.

Scripts: Main Pane

  • Export: Export the script to a file so that it can be backed up or transferred to other instances of screen-scraper.
  • Delete: Delete the script.
  • Show Script Instances: Display any locations where this script is invoked in the format scraping session: scrapeable file: extractor pattern (opens in a new window).
  • Name: A unique name so that you can easily indicate when it should be invoked.
  • Language: Select the language in which the script is written.
  • Overwrite this script on import (professional and enterprise editions only): Determines whether or not the current script can be overwritten by another that gets imported.

    For example, scripts attached to a scraping session are exported along with it. When you subsequently import that scraping session into another instance of screen-scraper it might overwrite existing scripts in that instance. For more information read our documentation on script overwriting.

  • Script Text: A text box in which to write your script.
  • Find: Opens a search window to help locate text in your script.
  • Wrap text: Determines whether single lines of code should be displayed on multiple lines when they are wider than the Script Text area.

Script Triggers

Overview

You designate a script to be executed by associating it with some event. For example, if you click on a scraping session, you'll notice that you can designate scripts to be invoked either before a scraping session begins or after it completes. Other events that can be used to invoke scripts relate to scrapeable files and extractor patterns.

Available associations (based on object location) are listed with a brief description of how they can be useful.

  • Scraping Session
    • Before scraping session begins - Script to initialize or debug work well here.
    • After scraping session ends - This association is good for closing any open processes or finishing data processes.
    • Always at the end - Forces scripts to run at the end of a scraping session, even if the scraping session is stopped prematurely.
  • Scrapeable File
    • Before file is scraped - Helpful for files used with iterators to get product lists and such.
    • After file is scraped - Good for processing the information scraped in the file.
  • Extractor Pattern
    • Before pattern is applied - Good for giving default values to variables, in case they don't match.
    • After pattern is applied - Good if you want to work with the data set as a whole and it's methods.
    • Once if pattern matches - Simplifies the issue of matching the same link multiple times but only wanting to follow it once.
    • Once if no matches - Helpful in catching and reporting possible errors.
    • After each pattern match - Gives access to data records and their associated methods.

Managing Associations

Adding

All objects that can have scripts associated with them have buttons to add the script association with the exception of scripts. To create a association between scripts you would use the executeScript method of the session object.

Locations to specify script associations are listed below.

Removing

  • Press the Delete key when the association is selected.
  • Right-click the association and select Delete.

Ordering

Script associations are ordered automatically in a natural order based on their relation to the object they are connected to: scripts called after the file is scraped cannot be ordered before associations the are called before the file is scraped. Beyond the natural ordering you can specify the order of the scripts using the Sequence number.

Enable/Disable

You can selectively enable and disable scripts using the Enabled checkbox in the rightmost column. It's often a good practice to create scripts used for debugging that you'll disable once you run scraping sessions in a production environment.

Other Windows

Overview

So far we have explained each of the windows in the workbench of screen-scraper. Here we would like to make you aware of a few other windows that you will likely come across in your work with screen-scraper.

Breakpoint Window

Overview

The breakpoint window opens when the scraping session runs into a session.breakpoint method call in a script. It is a very effective tool when trouble shooting your scrapes.

Breakpoint Window

  • (run): Instructs the scrape to continue from the stop point.
  • (stop): Ends the scrape (as soon as it can).
  • Session variables: Lists all of the session variables that are currently available.

    The value of any variable can be edited here by double clicking on it, changing it, and deselecting or hitting enter.

  • Current script: The script that initiated the breakpoint as well as a count of currently active scripts.
  • Current scrapeable file: The scrapeable file that called the script that initiated the breakpoint.
  • Current data set: Opens a dataset window with the contents of the active data set.
  • Current data record: Lists all of the data record variables that are currently available.

    The value of any variable can be edited here by double clicking on it, changing it, and deselecting or hitting enter.

Compare Last Request and Proxy Transaction

Overview

This feature is only available to Professional and Enterprise editions of screen-scraper.

At times in developing a scraping session a particular scrapeable file may not be giving you the results you're expecting. Even if you generated it from a proxy session parameters or cookies may be different enough that the response from the server is very different than what you were anticipating, including even errors. Generally in cases like this the best approach is to compare the request produced by the scrapeable file in the running scraping session with the request produced by your browser in the proxy session. That is, ideally your scraping session mimics as closely as possible what your web browser does.

The Compare Last Request and Proxy Transaction window facilitates just such a comparison. I can be accessed in the last request tab of the scrapreable file. After clicking the Compare Last Request and Proxy Transaction button, you will be prompted to select the proxy transaction to which the request should be compared. Simply navigate to the proxy session that it is connected to and select the desired transaction and the window will open.

The screen has four tabs to aid in comparing transaction and request: URL, POST data, Cookies, and Headers. Parameters in any of these areas can be controlled using the scrapeableFile object and its methods.

DataSet Window

Overview

The DataSet window displays the values matched by the extractor tokens. It can be view in two basic ways:

  1. Clicking the Apply Pattern to Last Scraped Data button on an extractor pattern or sub-extractor pattern.
  2. Selecting a DataSet or clicking the Current data set button in a breakpoint window.

The DataSet window has two rendering styles. The default is grid view, but you can switch between views using the button at the top of the screen (after view as:).

Grid View

The names of the columns correspond to the tokens that matched data in the most recent scrapeable file's response. The one addition is the Sequence column that is used by screen-scraper to identify the order in which the matches occurred on the page.

If a column is not showing up for an extractor token it is because that token does not match anything in any of the data records.

List View

This view can be a little easier for viewing the matched data in data record groups.

Regular Expressions Editor

Overview

The regular expressions that you can select for extractor tokens are stored in screen-scraper and can be edited in the Regular Expressions Editor window. The window is accessed by selecting Edit Regular Expressions from the Options menu.

This can be helpful if you have a regular expression that you use regularly. You can also edit the provided regular expressions though we encourage you not to do so without good reason. These regular expressions have been tested over time and updated when required; they are very stable expressions.

  • Add Regular Expression: Adds a new regular expression to the list.
  • (list of regular expressions)
    • Identifier: Name for the pattern. This is what will be selected when adding a regular expression to the extractor token.
    • Expression: The regular expression.
    • Description: A brief description of the regular expression. This is primarily to help you remember when you come back to it later.

Listed regular expressions can be edited by double clicking in the field that you would like to edit.

Scripting in screen-scraper

Overview

One of the most powerful features of screen-scraper is its built-in scripting engine. Through scripting web sites can be crawled in a very dynamic way. Scripting also allows you to insert business logic, clean and normalize data, and write data out to external repositories, such as files and databases. This section of the documentation will familiarize you with scripting in screen-scraper, as well as cover specifics on the various scripting languages that screen-scraper supports.

Before reading through this section you might find it helpful to first read the section on using scripts with the scraping engine. Also, if you haven't done so already, we'd highly recommend going through our first few tutorials, which provides several examples of scripting in screen-scraper.

File Path Delimiters

Overview

Because screen-scraper internally uses Java it is important that file paths follow the requirements of Java. That is that file paths follow the Unix/Linux structure (e.g., /usr/local/file.txt). If you are working on a machine that follows these conventions then it will not look any different to you; however, if you are working on a Windows machine this is an important difference to keep in mind.

Windows uses the backslash (\) as a file delimiter but Java uses it as an escape character. That means that on a Windows machine you need to pay closer attention to file paths as they will look a little different.

Windows file paths should use either a forward slash (/) or two backslashes (\\) to delimit file paths.

Windows Example

 try{<br />
 // Set file path<br />
 outputFilePath = "C:/tmp/output/output.txt";<br />
 // "C:\\tmp\\output\\output.txt"; would also work<br />
<br />
 // Create a FileWriter objec<br />
 FileWriter out = new FileWriter(outputFilePath);<br />
<br />
 // Write in the file<br />
 out.write("I love scripting in Screen-Scraper");<br />
<br />
 //Closes the stream.<br />
<br />
 out.close();<br />
<br />
 }<br />
 catch(Exception e)<br />
 {<br />
     // Log error if exists<br />
     session.log(e.getMessage());<br />
 }

Scripting in Interpreted Java

Overview

screen-scraper uses the BeanShell library to allow for scripting in Java. If you've done some programming in C or JavaScript you'll probably find BeanShell's syntax familiar. Documentation for BeanShell is excellent, and we'd recommend referring to it as you program.

Interpreted Java is just a phrase used to mean that it is java that does not require being compiled.

See the using scripts and API pages for details on objects and methods that you can make use of in a script. We also use Interpreted Java in all of our tutorials, which should get you familiar with how it's used in screen-scraper.

It is possible to access Java libraries in screen-scraper. See adding Java libraries for more details.

Example

// This particular example will only work with professional and enterprise editions of screen-scraper
// RunnableScrapingSession is reserved for these editions

// Import the RunnableScrapingSession class.
import com.screenscraper.scraper.*;

// Generate a new "Weather" scraping session.
runnableScrapingSession = RunnableScrapingSession( "Weather" )

// Put the zip code in a session variable so we can reference it later.
runnableScrapingSession.setVariable( "ZIP_CODE", "90001" )

// Tell the scraping session to scrape.
runnableScrapingSession.scrape()

Java Tutorials

We use Java in the screen-scraper tutorials but if you would like to learn more about Java you can look for tutorials online. The following are some good Java resources:

Scripting in JScript

This scripting language is not available by default any more. To use it you will need to edit the AllowUnstableWindowsFeatures in the screen-scraper.properties file.

Overview

Writing scripts in JScript gives you the familiarity of a widely used language, while still providing access to commonly useed Windows libraries. Using JScript within screen-scraper can only be done on a Windows platform, and requires that the JScript runtime be installed. The chances are good that you've already got the JScript runtime on your system.

screen-scraper will automatically detect if the JScript runtime is installed, which you can see by selecting a script from the objects tree in the workbench and clicking on the Language drop-down list. If you don't see JScript in the list then the runtime needs to be installed.

If you do not have JScript runtime on your system you can download it from Microsoft's script downloads page.

Please be aware that because of a bug in the third-party library that allows screen-scraper to integrate with the Microsoft Scripting Engine problems can occur if multiple JScript scripts are run simultaneously. If you're using the professional edition of screen-scraper and plan on running multiple scraping sessions simultaneously you should use Interpreted Java, JavaScript, or Python as a scripting language.

Available Objects

Because screen-scraper uses the native JScript engine, all Active X objects installed on the computer (such as ADO or the FileSystemObject) can be accessed. Additionally, all of the objects mentioned on the Using scripts and API pages are also available.

Example

Java classes can also be instantiated within a script using the CreateBean function. For example, the following script will instantiate a RunnableScrapingSession and run it:

// This particular example will only work with professional and enterprise editions of screen-scraper
// RunnableScrapingSession is reserved for these editions

// Generate a new "Weather" scraping session.
var runnableScrapingSession = CreateBean( "com.screenscraper.scraper.RunnableScrapingSession", "Weather" );

// Put the zip code in a session variable so we can reference it later.
runnableScrapingSession.setVariable( "ZIP_CODE", "90001" );

// Tell the scraping session to scrape.
runnableScrapingSession.scrape();

Scripting in JavaScript

Overview

Mozilla's Rhino scripting engine is used by screen-scraper to allow scripts to be written in JavaScript. Documentation for Rhino is sparse, but the interpreter does adhere strictly to the established ECMAScript standard, so just about any reference on JavaScript could be referred to. If you try writing scripts using JavaScript, and run into difficulties (because of lack of documentation), you may want to consider using Interpreted Java instead, which has very similar syntax and provides significantly better documentation. If you've worked with client-side JavaScript in web programming, you'll probably be comfortable using JavaScript in screen-scraper.

Examples

Classes in standard Java Library

// Declare an ArrayList.
var myArrayList = new java.util.ArrayList();

// Add two elements.
myArrayList.add( "one" );
myArrayList.add( "two" );

// Log the size.
session.log( "Size: " + myArrayList.size() );

Packages outside standard Java Library

These must be prefaced with the Packages keyword.

// Declare a new DataRecord object.
var myDR = new Packages.com.screenscraper.common.DataRecord();

// Give it a key/value pair.
myDR.put( "foo", "bar" );

// Log the value of the key.
session.log( "foo: " + myDR.get( "foo" ) );

Scripting in Perl

This scripting language is not available by default any more. To use it you will need to edit the AllowUnstableWindowsFeatures in the screen-scraper.properties file.

Overview

screen-scraper uses ActiveState's ActivePerl library for scripts written in Perl. Using Perl within screen-scraper can only be done on a Windows platform, and requires that the ActivePerl runtime be installed.

screen-scraper will automatically detect if the ActivePerl runtime is installed, which you can see by selecting a script from objects tree in the workbench and clicking on the Language drop-down. If you don't see Perl in the list then the runtime needs to be installed.

The ActivePerl runtime can be downloaded from ActiveState's download page for free.

Example

Java classes can be instantiated within a script using the CreateBean function. For example, the following script will instantiate a RunnableScrapingSession for the "Weather" scraping session (which is found in the default screen-scraper installation) and run it:

# This particular example will only work with professional and enterprise editions of screen-scraper
# RunnableScrapingSession is reserved for these editions

# Generate a new "Weather" scraping session.
$runnableScrapingSession = CreateBean( "com.screenscraper.scraper.RunnableScrapingSession", "Weather" );

# Put the zip code in a session variable so we can reference it later.
$runnableScrapingSession->setVariable( "ZIP_CODE", "90001" );

# Tell the scraping session to scrape.
$runnableScrapingSession->scrape();

Scripting in Python

Overview

The Jython interpreter is used by screen-scraper to for scripting in Python. Jython is a very fast interpreter, and we'd recommend using it if you're familiar with the Python programming language.

Importing External Java Classes

Importing your externally-compiled classes is as easy as placing them in the ./lib/ext folder of your installation. The Jython interpreter will automatically include that folder on your PythonPath.

Generator Objects

The generator objects are implemented in Jython and the folders lib/ext, lib/jython-lib, and lib/jython-lib/site-packages are included in python's system path.

Example

When scripting in Python all of the standard Java classes can be used. Classes must be imported using the Java package hierarchy of screen-scraper, which is also required if you'd like to create one of screen-scraper's RunnableScrapingSession objects. Here's an example that will run a scraping session called "Weather":

# This particular example will only work with professional and enterprise editions of screen-scraper
# RunnableScrapingSession is reserved for these editions

# Import the RunnableScrapingSession class.
from com.screenscraper.scraper import RunnableScrapingSession

# Generate a new "Weather" scraping session.
runnableScrapingSession = RunnableScrapingSession( "Weather" )

# Put the zip code in a session variable so we can reference it later.
runnableScrapingSession.setVariable( "ZIP_CODE", "90001" )

# Tell the scraping session to scrape.
runnableScrapingSession.scrape()

Notice that before the RunnableScrapingSession class can be used it first must be imported.

Scripting in VBScript

This scripting language is not available by default any more. To use it you will need to edit the AllowUnstableWindowsFeatures in the screen-scraper.properties file.

Overview

If you've programmed in Visual Basic or Active Server Pages you should find scripting in screen-scraper to be similar. Using VBScript within screen-scraper can only be done on a Windows platform, and requires that the VBScript runtime be installed. The chances are good that you've already got the VBScript runtime on your system.

screen-scraper will automatically detect if the VBScript runtime is installed, which you can see by selecting a script from the objects tree in the workbench and clicking on the Language drop-down list. If you don't see VBScript in the list then the runtime needs to be installed.

If you do not have VBScript runtime on your system you can download it from Microsoft's script downloads page.

Please be aware that because of a bug in the third-party library that allows screen-scraper to integrate with the Microsoft Scripting Engine problems can occur if multiple VBScript scripts are run simultaneously. If you're using the professional edition of screen-scraper and plan on running multiple scraping sessions simultaneously you should use Interpreted Java, JavaScript, or Python as a scripting language.

Available Objects

Because screen-scraper uses the native VBScript engine, all Active X objects installed on the computer (such as ADO or the FileSystemObject) can be accessed. Additionally, all of the objects mentioned on the using scripts and API pages are also available.

Example

Java classes can also be instantiated within a script using the CreateBean function. For example, the following script will instantiate a RunnableScrapingSession and run it:

' This particular example will only work with professional and enterprise editions of screen-scraper
' RunnableScrapingSession is reserved for these editions

' Generate a new "Weather" scraping session.
Set runnableScrapingSession = CreateBean( "com.screenscraper.scraper.RunnableScrapingSession", "Weather" )

' Put the zip code in a session variable so we can reference it later.
runnableScrapingSession.SetVariable "ZIP_CODE", "90001"

' Tell the scraping session to scrape.
runnableScrapingSession.Scrape

Web Interface

The web interface is only available for enterprise edition users of screen-scraper.

Overview

The screen-scraper web interface allows you to administer aspects of the scraping process. This includes monitoring running scraping sessions, importing and exporting scraping sessions, and scheduling scraping sessions to be run on a periodic basis.

When screen-scraper is running in server mode, you can access the web interface on your local machine at the following URL: http://localhost:8779/.

If you've changed the Web/SOAP Server port in the workbench or the SOAPPort in the screen-scraper.properties file, you'll need to use the port you designated.

Depending on the operating system you're running, instead of localhost, you may need to use 127.0.0.1 or the IP address of the machine.

Managing Scraping Sessions

Importing

Exporting

Web Interface: Settings

Overview

The web interface settings can be opened by clicking on the settings button in the upper-right corner of the screen.

Settings

  • Timeout: The default number of minutes the scraping session is allowed to run before a request to stop is inserted.

    If this value is blank, 0, or negative, the scraping session will not time out.

  • Time: The default percentage of time whereby two runs of a scraping session may differ without being flagged as a possible error.
  • Record Count: The default percentage of records scraped whereby two runs of a scraping session may differ without being flagged as a possible error.
  • Repeat Every: How often the scrape should be rerun by default.
  • Reload From File: If you directly edit the screen-scraper.properties file this causes the new settings to be reloaded.
  • Save: Save settings and close dialog.
  • Cancel: Close without saving changes to settings.

Flagged scrapes are highlighted in red in the run/running tab.

Web Interface: Runnable tab

Overview

This tab displays all scraping sessions loaded into the current instance of screen-scraper. It will display basic information on scraping sessions that are currently running, as well as scraping sessions that have run in the past. It also allows you to start and schedule scraping sessions.

The runnable tab will display all of the scraping sessions listed alphabetically by name, and the messages from the most recently started instance of the scrape.

  • View as: Change from list to folders view using this drop-down menu.
  • Refresh: Update the contents of the scraping session list.
  • (List of available scraping sessions):
    • Name: The name of the scraping session.
    • Start Time: The date and time the scraping session was last started.
    • Running Time: The amount of time the scraping session has been running (the number will update each time you click the Refresh button at the top right of the table).

      If the scraping session is not currently running it shows is how long is took to run last time it was run.

    • Previous Running Time: The amount of time the scraping session took the last time it ran.

      If the scraping session is not currently running it show the amount of time it took to run two times ago.

    • Num Records: The number of records the scraping session has extracted as recorded by the session.addToNumRecordsScraped method. If the method is never called then this number will always be zero.
    • Previous Num Records: The number of records the scraping scraping session the last time it ran.
    • Status: Indicates the current status of the scraping session. Possibilities include "In Process", "Completed", "Interrupted", and "Error".
    • Export: Exports the scraping session, just as you would from the workbench.
    • Run Now: Runs the scraping session.
    • Schedule: Allows you to schedule the scraping session to be run. See schedule scraping sessions for more information.
    • Remove: Deletes the scraping session from screen-scraper.
    • Notes: Allows you to view the notes specified in the scraping session.

Web Interface: Run/Running tab

Overview

This tab displays information on scraping sessions that are either currently running or have run in the past. You can use this table to compare run times, the number of records scraped, and also to monitor scraping session logs. If scraping sessions have timed out (see settings) the stop button will gray and the status will change to interupted. If a script has flagged a fatal error (see setFatalErrorOccurred) then the error cell will display in red for that scrape.

Scrapes can be ordered in ascending and descending order using any of the fields. This is done by clicking on the column header that you want to sort by.

Run/Running tab

  • Stop Marked Scraping Sessions: Stops the scraping sessions whose rows are checked on the far left.
  • Remove Completed Scraping Sessions: Removes the scraping sessions which have a status of complete.

    Removing records for scraping sessions that have run doesn't remove the scraping sessions themselves, just the records related to the time when they were run.

  • Remove Marked Scraping Sessions: Removes the scraping sessions whose rows are checked on the far left from the run/runnable tab.

    Removing records for scraping sessions that have run doesn't remove the scraping sessions themselves, just the records related to the time when they were run.

  • Auto-refresh: Refreshes the table of running files regularly.
  • Refresh: Refreshes the table of running files.
  • (List of running and completed scraping session runs):
    • Name: The name of the scraping session.
    • Start Time: The date and time the scraping session was last started.
    • Running Time: The amount of time the scraping session has been running (the number will update each time you click the Refresh button at the top right of the table).

      If the scraping session is not currently running it shows is how long is took to run last time it was run.

    • Previous Running Time: The amount of time the scraping session took the last time it ran.

      If the scraping session is not currently running it show the amount of time it took to run two times ago.

    • Num Records: The number of records the scraping session has extracted as recorded by the session.addToNumRecordsScraped method. If the method is never called then this number will always be zero.
    • Previous Num Records: The number of records the scraping scraping session the last time it ran.
    • Status: Indicates the current status of the scraping session. Possibilities include "In Process", "Completed", "Interrupted", and "Error".
    • Error: Indicates whether or not a fatal error has been flagged in the scraping session (see setFatalErrorOccurred).
    • Error Message: In the event of a flagged error, displays the provided message (see setErrorMessage).
    • Peek: Pops up a box that allows you to view the most recent section of the log.
    • Stop: Stops the scraping session.

Web Interface: Scheduled tab

Overview

On this tab you can manage scraping sessions that have been scheduled to be run. The columns can be sorted by clicking on the column headers.

Scheduled Tab

  • Refresh: Reloads the table of scheduled scraping sessions.
  • (List of scheduled runs for scraping sessions):
    • Scraping Session: The name of the scheduled scraping session.
    • Timeout: The amount of time in minutes the scraping session should be allowed to run.

      If this value is 0 or a negative number, the scraping session will not time out.

    • Date/Time: The date and time the scraping session is next scheduled to be run.
    • Session Variables: Any session variables that are to be passed to the scraping session when it runs.
    • Disable/Enable: Allows you to temporarily enable or disable the scheduled run of the scraping session.

      If the run of the scraping session is disabled, it will not run even if it's scheduled to do so.

    • Edit: Pops up a dialog box that allows you to manage the scheduled run of the scraping session.
    • Remove: Removes the scheduled run of the scraping session.

Web Interface: Schedule Scraping Session

Overview

It can be very helpful to have scraping sessions run automatically or on an on going basis. The web interface makes this simple allowing you to schedule and manage multiple scrapes in a single location.

Managing Scheduled Scrapes

Scheduling Run

Editing Scheduled Run

  • You can alter the settings for an already scheduled scraping session by clicking on teh Edit button on the scheduled tab.

Removing Scheduled Run

  • You can remove an already scheduled scraping session by clicking on the Remove button on the scheduled tab.

Schedule Scraping Session: General tab

General Tab

  • Scraping Session: The name of the scheduled scraping session.
  • Timeout: The number of minutes the scraping session is allowed to run before a request to stop is inserted.

    If this value is blank, 0, or negative, the scraping session will not time out.

  • Session Variables: This is a list of session variables that will be passed to the scraping session when it is run.

Schedule Scraping Session: Schedule tab

Schedule Tab

  • Date: The calendar date when the scraping session is to run next. Click the box to bring up a graphical calendar from which you can select the desired date.
  • Time: The time of day when the scraping session is to run next. This should be a 24-hour (military) time.
  • Repeat Every: Use this to set the frequency with which the scraping session is to run. For example, if you enter 2 into the Hours box, the scraping session will run when it is scheduled, then be re-scheduled to run once again two hours from the time it started.

    If these boxes are left blank, the scraping session will run once and not be re-scheduled.

Schedule Scraping Session: Thresholds tab

Thresholds Tab

  • Time: The percentage of time whereby two runs of a scraping session may differ without being flagged as a possible error.
  • Record Count: The percentage of records scraped whereby two runs of a scraping session may differ without being flagged as a possible error.

Flagged scrapes are highlighted in red in the run/running tab.

API

Overview

When writing scripts within screen-scraper, there are a number of objects and methods available to you. The Using Scripts page provides an overview of working with scripts, where this page provides details on specific objects and methods you'll use when scripting within screen-scraper.

The API documentation emphasizes Interpreted Java as Java is the language in which screen-scraper proper is written. That should not deter you from using whatever language you desire; all the methods are available in what ever language you choose.

The examples given here assume you're using Interpreted Java as the scripting language, but there should be very little difference in syntax if you decide to use another language. For example, if you're scripting in VBScript, you would simply omit the semi-colon at the end of each line, and for methods that don't return a value you would precede them with the VBScript keyword Call (either that, or omit the parentheses around the method parameters).

screen-scraper Object APIs

The screen-scraper, internal API has been divided into three groups for convenience.

  1. Scraping Engine: Request, parse, manipulate, and store data according to user defined processes.
  2. Proxy Server: Manipulate browser-server interactions to filter, track, or otherwise control the experience of the user.
  3. Utilities: Helpful screen-scraper objects for processing and storing data.

The two main groups are the scraping engine and the proxy server. The various objects available in these sections are exclusive to running screen-scraper for in one of these two ways. The one exception is the RunnableScrapingSession which has been grouped with the scraping engine simply because it is unlikely to be needed or used with the proxy server.

The utilities are available to scripts run in either the scraping engine or the proxy server and have since been separated from both. These represent classes that we have written to simplify some common tasks that are performed with retrieved data.

Java Libraries/Classes of Note

There are many additional classes that are available through Java Libraries that we did not create/modify that are especially worthy of note. Regardless of the language that you are using to program in screen-scraper you can have access to these.

Other screen-scraper APIs

There are a few other APIs to be aware of. They are particular to dealing with screen-scraper in certain ways or certain versions. Make sure that you understand the implications of using these APIs before you start playing with them.

  1. REST Interface: Issue commands to screen-scraper through GET requests to the server.
  2. Anonymization REST Interface: Configure and run anonymous scrapes through the REST Interface.
  3. Alpha Version: Methods and objects that have been introduced to screen-scraper since the last stable release.

Scraping Engine API

Overview

The scraping engine is the backbone of screen-scraper and provides four built-in objects. These objects are: session, scrapeableFile, dataSet, and dataRecord. We have also included the RunnableScrapingSession class as it best pertains to the engine.

For details on which objects are available to scripts in the context of a scrape see the variable scope section of the documentation.

Objects

  • dataRecord: This gives access to the most recently extracted data record. This will most likely only be used in scripts that get accessed after each time an extractor pattern is applied. This object simply extends Hashtable, and documentation on the Hashtable's methods can be found in Java's documentation.

    The dataRecord object is populated using the names of tokens from extractor patterns.

  • dataSet: The dataSet object holds all data records extracted by an extractor pattern after it has been applied as many times as possible to the HTML retrieved by a scrapeable file. A data set is analogous to a result or record set that would be returned from a database query. A data set contains any number of data records, which are analogous to rows in a database.
  • log: Methods used for logging information.
  • RunnableScrapingSession (com.screenscraper.scraper.RunnableScrapingSession): This is a class that can be instantiated within a script in order to run a scraping session. The Maximum number of concurrent running scraping sessions in the settings dialog box will control how many scraping sessions can be run simultaneously.
  • scrapeableFile: This refers to the scrapeable file that is currently being requested and analyzed.
  • session: This variable refers to the currently running scraping session.
  • sutil: General methods for checking and manipulating data.

dataRecord

Overview

This object gives access to the most recently extracted data record. This will most likely only be used in scripts that get accessed after each time an extractor pattern is applied. This object simply extends Hashtable (documentation on its methods can be found in Java's documentation).

The dataRecord is populated using the token names in the extractor patterns. You'll find a few of the most commonly used methods below. DataRecord objects can also be created from scratch, and subsequently added to DataSet objects using the addDataRecord method.

See example usage: Iterate over DataSets & DataRecords.

DataRecord

DataRecord DataRecord ( )

Description

Create a new DataRecord object.

Parameters

This method does not receive any parameters.

Return Values

Returns DataRecord object.

Change Log

Version Description
4.5 Available for all editions.

Class Location

com.screenscraper.common.DataRecord

Examples

Create New DataRecord

 // Create a new DataRecord object.
 myDataRecord = new DataRecord();

 // Populate it with a few fields.
 myDataRecord.put( "CITY", "Los Angeles" );
 myDataRecord.put( "ZIP", "90001" );
 myDataRecord.put( "STATE", "CA" );

 // Add it to an existing dataSet object.
 dataSet.addDataRecord( myDataRecord );

See additional example usage: Iterate over DataSets & DataRecords.

get

Object dataRecord.get ( Object key )

Description

Get the value of a DataRecord field.

Parameters

  • key The name of the associative key, usually a string but, if you have manually added fields, it can be an integer, etc.

Return Values

Returns the value associated with the specified key. Usually it will be a string but, if you have manually added fields, it can be an integer, boolean, long, or other object.

Change Log

Version Description
4.5 Available for all editions.

Examples

Retrieve DataRecord Information

 // Gets the value of the "CITY" field
 // and outputs it to the log.

 city = dataRecord.get( "CITY" );
 session.log( "City: " + city );

put

Object dataRecord.put ( Object key, Object value )

Description

Add a new field to the DataRecord or update the value of an existing field.

Parameters

  • key The name of the associative key, usually a string but, if you have manually added fields, it can be an integer, etc.
  • value The new value to be associated with the key.

Return Values

Returns the value previously associated with the specified key. If the key did not exist then it will return null.

Change Log

Version Description
4.5 Available for all editions.

Examples

Add/Change DataRecord Field

 // Adds a field called "CITY" with
 // the value "Los Angeles".

 dataRecord.put( "CITY", "Los Angeles" );

See additional example usage: Iterate over DataSets & DataRecords.

remove

Object dataRecord.remove ( Object key )

Description

Remove a field from the DataRecord.

Parameters

  • key The name of the associative key, usually a string but, if you have manually added fields, it can be an integer, etc.

Return Values

Returns the value previously associated with the specified key. If the key did not exist then it will return null.

Change Log

Version Description
4.5 Available for all editions.

Examples

Add/Change DataRecord Field

 // Removes the "CITY" field from the dataRecord.
 dataRecord.remove( "CITY" );

dataSet

Overview

The dataSet object holds all data records extracted by an extractor pattern after it has been applied as many times as possible to the HTML retrieved by a scrapeable file. A data set is analogous to a result or record set that would be returned from a database query. A data set contains any number of data records, which are analogous to rows in a database.

The dataSet object provides methods to aid in getting at the information that has been gathered.

See example usage: Iterate over DataSets & DataRecords.

DataSet

DataSet DataSet ( void )
DataSet DataSet ( ArrayList dataRecords )

Description

Manually create a DataSet.

Parameters

  • dataRecords (optional) Java ArrayList of DataRecord elements.

Return Values

Returns DataSet object.

Change Log

Version Description
4.5 Available for all editions.

Class Location

com.screenscraper.common.DataSet

Examples

Manually Create DataSet

 // Create DataSet
 myDataSet = new DataSet();

 // Create DataRecord
 myDataRecord = new DataRecord();
 myDataRecord.put( "STATE", "AZ");

 // Add DataRecord to DataSet
 myDataSet.addDataRecord( myDataRecord );

Create DataSet from Array List

 // Create Array List
 ArrayList dataRecords = new ArrayList();

 // Create DataRecord
 myDataRecord = new DataRecord();
 myDataRecord.put( "STATE", "AZ");

 // Add DataRecord to "dataRecords" DataSet
 dataRecords.add( myDataRecord );

 // Create DataSet From ArrayList.
 myDataSet = new DataSet( dataRecords );

See additional example usage: Iterate over DataSets & DataRecords.

addDataRecord

void dataSet.addDataRecord ( DataRecord dataRecord )

Description

Add a DataRecord to a DataSet.

Parameters

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

Examples

Add Data Record to DataSet

 // Create DataSet
 myDataSet = DataSet();

 // Create DataRecord
 myDataRecord = new DataRecord();
 myDataRecord.put( "STATE", "AZ");

 // Add DataRecord to DataSet
 myDataSet.addDataRecord( myDataRecord );

See Also

See additional example usage: Iterate over DataSets & DataRecords.

clearDataRecords

void dataSet.clearDataRecords ( )

Description

Remove all DataRecord objects from the DataSet.

Parameters

This method does not receive any parameters.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

Examples

Remove DataRecords from DataSet

 // Removes all DataRecord objects from the dataSet object.
 dataSet.clearDataRecords();

See additional example usage: Iterate over DataSets & DataRecords.

See Also

deleteDataRecord

void dataSet.deleteDataRecord ( int dataRecordNumber )

Description

Remove a DataRecord from the DataSet.

Parameters

  • dataRecordNumber Index of the DataRecord in the DataSet, as an integer. Remember that the DataRecords set is zero based and so the first DataRecord would be at the index of zero.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

Examples

Remove one DataRecords from DataSet

 // Deletes the third data record in the set. Remember that data sets
 // are zero-based.

 dataSet.deleteDataRecord( 2 );

See Also

findValue

Object dataSet.findValue ( String valueToFind, String columnToMatch, String columnToReturn )

Description

Retrieve a field's value in a data set based on another field.

Parameters

  • valueToFind Value being looked for, as a string.
  • columnToMatch Column/token name where the value is being searched for, as a string.
  • columnToReturn Column/token name whose value should be returned, as a string.

Return Values

Returns the value in the returned column, usually a string (unless records have been manually added). If no match is found, null is returned.

Change Log

Version Description
5.0 Added for all editions.

Examples

Get Value of Token based on Another Token

 // Create new DataSet
 DataSet myDataSet = new DataSet();

 // Create DataRecords<
 DataRecord john = new DataRecord();
 john.put("FIRST_NAME", "John");
 john.put("LAST_NAME", "Doe");

 DataRecord jill = new DataRecord();
 jill.put("FIRST_NAME", "Jill");
 jill.put("LAST_NAME", "Smith");

 // Add dataRecords to dataSet
 myDataSet.addDataRecord(john);
 myDataSet.addDataRecord(jill);

 // Search dataSet for "John" in the "FIRST_NAME"
 // field. Return the value of the "LAST_NAME" in
 // the same record
 String result = myDataSet.findValue("John", "FIRST_NAME", "LAST_NAME");

 // Write result to log
 session.log(result); // Logs "Doe"

See Also

  • get() [dataSet] - Get a single piece of data held by a DataRecord in the DataSet.

get

Object dataSet.get ( int dataRecordNumber, String identifier )

Description

Get a single piece of data held by a DataRecord in the DataSet.

Parameters

  • dataRecordNumber Index of the DataRecord in the DataSet, as an integer. Remember that the DataRecords set is zero based and so the first DataRecord would be at the index of zero.
  • identifier The name of the element to retrieve from the DataRecord, as a string.

Return Values

Returns the value associated with the DataRecord identifier. It will be a string unless you have added values to the DataRecord whose values are not strings.

Change Log

Version Description
4.5 Available for all editions.

Examples

Get Token Value From DataRecord

 // Gets the value "CITY_CODE" from the first data record in the
 // data set.

 firstCityCode = dataSet.get( 0, "CITY_CODE" );

See Also

getAllDataRecords

ArrayList dataSet.getAllDataRecords ( )

Description

Get all DataRecords in the DataSet.

Parameters

This method does not receive any parameters.

Return Values

Returns an ArrayList of DataRecord objects.

Change Log

Version Description
4.5 Available for all editions.

This method is provided as a convenience, the recommended way to iterate over data records in a data set is to use getNumDataRecords and getDataRecord.

Examples

Loop Through DataRecords

 // Stores all of the data records in the variable allData.
 allData = dataSet.getAllDataRecords();

 // Loop through each of the data records.
 for( i = 0; i < allData.size(); i++ )
 {
     // Store the current data record in the variable myDataRecord.
     myDataRecord = allData.get( i );

     // Output the "PRODUCT_NAME" value from the data record to the log.
     session.log( "Product name: " + myDataRecord.get( "PRODUCT_NAME" ) );
 }

See Also

getCharacterSet

String dataSet.getCharacterSet ( )

Description

Get the character set being applied the scraped data.

Parameters

This method does not receive any parameters.

Return Values

Returns the character set applied to the scraped data, as a string. If a character set has not been specified then it will default to the character set specified in settings dialog box.

Change Log

Version Description
5.0 Added for all editions.

Examples

Get Character Set

 // Get the character set of the dataSet
 charSetValue = dataSet.getCharacterSet();

See Also

getDataRecord

DataRecord dataSet.getDataRecord ( int dataRecordNumber )

Description

Get one DataRecord in the DataSet.

Parameters

  • dataRecordNumber Index of the DataRecord in the DataSet, as an integer. Remember that the DataRecords set is zero based and so the first DataRecord would be at the index of zero.

Return Values

Returns a DataRecord (Hashtable object). If there is not a DataRecord at the specified index an error will be thrown.

Change Log

Version Description
4.5 Available for all editions.

Examples

Get DataRecords in a Loop

 // Loop through each of the data records.
 for( i = 0; i < dataSet.getNumDataRecords(); i++ )
 {
     // Store the current data record in the variable myDataRecord.
     myDataRecord = dataSet.getDataRecord( i );

     // Output the "PRODUCT_NAME" value from the data record to the log.
     session.log( "Product name: " + myDataRecord.get( "PRODUCT_NAME" ) );
 }

See Also

getFirstValueForKey

Object dataSet.getFirstValueForKey (String key )

Description

Get the first non-null value, in a data set, for a given token.

Parameters

  • key Name of the column whose value is returned, as a string.

Return Values

Returns the first non-null value in the column, usually a string (unless records have been manually added). If none is found, null is returned.

Change Log

Version Description
5.0 Added for all editions.

Examples

Get First Non-null Token Value

 // Gets the value of the first "CITY_CODE" in the
 // data set.

 fieldValue = dataSet.getFirstValueForKey("CITY_CODE");

See Also

  • get() [dataSet] - Get a single piece of data held by a DataRecord in the DataSet.
  • findValue() [dataSet] - Retrieve a field's value in a data set based on another field.

getNumDataRecords

int dataSet.getNumDataRecords ( )

Description

Get the number of DataRecords in the DataSet.

Parameters

This method does not receive any parameters.

Return Values

Returns the number of DataRecords in the DataSet, as an integer.

Change Log

Version Description
4.5 Available for all editions.

Examples

Get the Number of DataRecords in the DataSet

 // Loop through each of the data records.
 for( i = 0; i < dataSet.getNumDataRecords(); i++ )
 {
     // Store the current data record in the variable myDataRecord.
     myDataRecord = dataSet.getDataRecord( i );

     // Output the "PRODUCT_NAME" value from the data record to the log.
     session.log( "Product name: " + myDataRecord.get( "PRODUCT_NAME" ) );
 }

See Also

  • size() [dataSet] - Return the number of dataRecords in the dataSet.

join

void dataSet.join ( DataSet dataSet )

Description

Merge data records from two data sets.

Parameters

  • dataSet Data set whose records are to be merged.

Return Values

Returns void.

Change Log

Version Description
5.0 Added for all editions.

Examples

Merge DataRecords from DataSets

 // Create dataSet
 DataSet dataSet = new DataSet();

 // Load dataSet with information
 for (i = 0; i < 3; ++i)
 {
     DataRecord record = new DataRecord();
     record.put("DATA_SET_ONE", i);
     dataSet.addDataRecord(record);
 }

 // Create another dataSet
 DataSet anotherDataSet = new DataSet();

 // Load dataSet with information
 for (i = 0; i < 2; ++i)
 {
     DataRecord record = new DataRecord();
     record.put("DATA_SET_TWO", i);
     anotherDataSet.addDataRecord(record);
 }

 // Join DataSets
 dataSet.join(anotherDataSet);

 // Write merged DataSet to Log (in dataRecords)
 for (i = 0; i < dataSet.getNumDataRecords(); ++i)
 {
     DataRecord record = dataSet.getDataRecord(i);
     session.log("DataRecord " + i + ": " + record.toString());
 }

 // Log Output:
 // DataRecord 0: {DATA_SET_TWO=0, DATA_SET_ONE=0}
 // DataRecord 1: {DATA_SET_TWO=1, DATA_SET_ONE=1}
 // DataRecord 2: {DATA_SET_ONE=2}

setCharacterSet

void dataSet.setCharacterSet ( String characterSet )

Description

Set the character set to be used for rendering dataSet values.

Parameters

  • characterSet Java recognized character set, as a string. Java provides a list of supported character sets in its documentation.

Return Values

Returns void.

Change Log

Version Description
5.0 Added for all editions.

This will only change the character set on the current data set. If you want it to be changed for all data sets, you would need to change it in the settings dialog box or screen-scraper.properties file.

Examples

Set Character Set

 // Set the character set of the dataSet
 dataSet.setCharacterSet("UTF-8");

See Also

size

int dataSet.size ( )

Description

Get the number of DataRecords in the DataSet.

Parameters

This method does not receive any parameters.

Return Values

Returns the number of DataRecords in the DataSet, as an integer.

Change Log

Version Description
6.0.3a Available for all editions.

Examples

Get the Number of DataRecords in the DataSet

 // Loop through each of the data records.
 for( i = 0; i < dataSet.size(); i++ )
 {
     // Store the current data record in the variable myDataRecord.
     myDataRecord = dataSet.getDataRecord( i );

     // Output the "PRODUCT_NAME" value from the data record to the log.
     log.info( "Product name: " + myDataRecord.get( "PRODUCT_NAME" ) );
 }

See Also

writeToFile

void dataSet.writeToFile ( String fileName ) (professional and enterprise editions only)

Description

Write DataSet string and integer contents to a file. The fields will be tab-delimited and records hard-return delimited.

Parameters

  • fileName File path where the contents of the DataSet should be written. If the file already exists the contents will be appended to the file.

Return Values

Returns void. If the file cannot be written to then an error will be thrown.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

Examples

Write DataSet Contents to a File

 // Writes the data found in the current data set to the file
 // "extracted_data.txt".

 dataSet.writeToFile( "C:/site_data/extracted_data.txt" );

log

Overview

This object contains various methods used to log information about a running scraping session to log files, the workbench "Log" tab, and the web interface.

addAutoProgressBar

void log.addAutoProgressBar ( String name, String ... values ) (enterprise edition only)
void log.addAutoProgressBar ( String name, String[][] values ) (enterprise edition only)
void log.addAutoProgressBar ( String name, Collection<String> values ) (enterprise edition only)
void log.addAutoProgressBar ( String name, String[][] values, int keyIndex ) (enterprise edition only)
void log.addAutoProgressBar ( String name, DataSet values, String key ) (enterprise edition only)

Description

Creates an automatic progress bar and adds it to the progress bars. These progress bars match their progress to a value from a session variable and a list of values. When web messages are output with the webDebug, webInfo, webWarn, or webError methods, a progress bar will be drawn to give a visual representation of the current progress of the scrape.

Note that when using auto progress bars, it is advised to not use any manually monitored ones, as it can cause conflicts. Anytime an auto progress bar has no session variable set for its monitored key, it deletes itself and all children progress bars (including manual ones). As long as you keep that in mind, it should be safe to use both types together.

Parameters

  • name The name of the progress bar, which should match the session variable where the value for updating this bar will be stored
  • values The values this progress bar can have, in the order they will be queried. For example, if the session variable can be "1" or "2", the values should also be "1" and "2"
  • keyIndex (optional) The index in each inner array of the value that will be set in the session variable. This is only applicable when a 2D array is given. When a 2D array is given but no index is given, 0 is used.
  • key (optional) The key in the DataRecords that will be used for the session variable matching name. Used only with the DataSet method option

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in all editions.
5.5.31a Available in enterprise edition.
5.5.43a Moved from session to log class.

Examples

Create an auto progress bar to track the search

 // Searching over each letter of the alphabet
 String[] letters = new String[26];
 for(char c = 'a'; c <= 'z'; c++)
   letters[ c - 'a' ] = "" + c;

 // Using this approach is more convenient when values will get changed in various scripts
 log.addAutoProgressBar("SEARCH_LETTER", letters);

 for(int i = 0; i < letters.length; i++)
 {
   session.setVariable("SEARCH_LETTER", letters[ i ]);
   session.scrapeFile("Search");
   log.logMonitoredValues("Completed Letter");
 }

addMonitoredPostfix

void log.addMonitoredPostfix ( String postfix ) (enterprise edition only)

Description

Watches for all session variables whose keys end with the postfix specified, and will output their values when monitored variables are logged.

Parameters

  • postfix The postfix to monitor

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in all editions.
5.5.42a Moved from session to log class.

Examples

Watch all variables ending with _PARAM and log their values

 log.addMonitoredPostfix("_PARAM");

 // Log the current value of all session variables whose name end with _PARAM
 log.logMonitoredValues();

addMonitoredPrefix

void log.addMonitoredPrefix ( String prefix ) (enterprise edition only)

Description

Watches for all session variables whose keys begin with the prefix specified, and will output their values when monitored variables are logged.

Parameters

  • prefix The prefix to monitor

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in all editions.
5.5.42a Moved from session to log class.

Examples

Watch all variables starting with SEARCH_ and log their values

 log.addMonitoredPrefix("SEARCH_");

 // Log the current value of all session variables whose name starts with SEARCH_
 log.logMonitoredValues();

addMonitoredValue

Object log.addMonitoredValue ( String name, Object value ) (enterprise edition only)

Description

Adds a specific name and value to be logged with the web messages methods or logMonitoredValues method

Parameters

  • name The name for the value being monitored
  • value The value to associate with the given name

Return Value

The previous value associated with the name, or null if there wasn't one

Change Log

Version Description
5.5.29a Available in all editions.
5.5.42a Moved from session to log class.

Examples

Add and log a value

 // Setting a value this way will persist it across scripts.
 // That way a future script could log the set, and any other values set.
 log.addMonitoredValue("The dataSet", dataSet);

 // Each time this method is called, it will log the above dataSet
 log.logMonitoredValues();

addMonitoredVariable

void log.addMonitoredVariable ( String key ) (enterprise edition only)

Description

Watches the value of a session variable, and will output it each time monitored variables are output

Parameters

  • key The key in the session corresponding to a value

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in all editions.
5.5.42a Moved from session to log class.

Examples

Watch a variable and log its value

 log.addMonitoredVariable("NAME");

 // Log the current value of NAME, as well as any other currently monitored values
 log.logMonitoredVariables();

addProgressBar

ProgressBar log.addProgressBar ( String title ) (enterprise edition only)
ProgressBar log.addProgressBar ( String title, String total ) (enterprise edition only)
ProgressBar log.addProgressBar ( String title, double total ) (enterprise edition only)
ProgressBar log.addProgressBarIfNotStopped ( String title ) (enterprise edition only)
ProgressBar log.addProgressBarIfNotStopped ( String title, String total ) (enterprise edition only)
ProgressBar log.addProgressBarIfNotStopped ( String title, double total ) (enterprise edition only)

Description

Adds a new progress bar. If no progress bar exists, this will be set as the root, otherwise it will be the child of the lowest progress bar. When web messages are output with the webDebug, webInfo, webWarn, or webError methods, a progress bar will be drawn to give a visual representation of the current progress of the scrape. The addProgressBarIfNotStopped versions remove the progress bar if the scrape has not been stopped, which is useful for determining when a scrape was stopped.

Parameters

  • title The title for the new progress bar
  • total (optional) The total for the new progress bar. This should be the total number of things this is tracking the progress of. For example, if used when iterating over each letter of the alphabet for a search, the total would be 26 (one for each letter).

Return Value

This method returns a reference to the new progress bar, which can be used to update the current progress

Change Log

Version Description
5.5.29a Available in all editions.
5.5.31a Available in enterprise edition.
5.5.43a Moved from session to log class.

Examples

Track the progress of a search over the alphabet

 import com.screenscraper.util.ProgressBar;

 ProgressBar bar = log.addProgressBar("Letter", 26);
 for(char c = 'a'; c <= 'z'; c++)
 {
   session.setVariable("SEARCH_LETTER", c);
   session.scrapeFile("Search");
   bar.add(1);
   
   // For Professional and Enterprise Editions
   log.webInfo("Completed Search on: " + c);
   
   // For Basic Edition ** Note that this method is available in Professional and Enterprise editions also
   log.logMonitoredValues();
 }

 // Now that we have completed the search, remove the progress bar
 log.removeProgressBar(bar);

appendStatusMessage

boolean log.appendStatusMessage ( String message ) (enterprise edition only)

Description

Appends a status message to be displayed in the web interface.

Parameters

  • message The message to be appended.

Return Values

None

Change Log

Version Description
5.5.32a Available in Enterprise edition.
5.5.43a Moved from session to log class.

Examples

Append a status message

if( scrapeableFile.getExtractorPatternTimedOut() )
{
        log.appendStatusMessage( "Extractor pattern timed out." );
}

cacheFile

File log.cacheFile ( String outputFilenameAndPath, File fileToCache ) (professional and enterprise editions only)

Description

Adds a file to the cache. This can be used to add anything to the cache, from a text file to an image that was downloaded, or any other file that would be useful.

Parameters

  • outputFilenameAndPath The name of the file in the cache, including any directory it should be placed in
  • fileToCache The file that should be cached. This cannot be a directory

Return Value

A File that represents the cached file.

Change Log

Version Description
5.5.29a Available in professional and enterprise editions.
5.5.43a Moved from session to log class.

Examples

Cache a file

 // Set the path in the first parameter so it will show up in a subdirectory in the final output
 log.cacheFile("images/products/" + dataRecord.get("PRODUCT_NAME") + ".jpg", new File("output/downloadedImage.jpg"));

cacheScrapeableFile

File log.cacheScrapeableFile ( ScrapeableFile scrapeableFile ) (professional and enterprise editions only)

Description

Caches the HTML and headers of the scrapeable file. This will include both the request and response headers.

Parameters

  • scrapeableFile The scrapeable file to cache.

Return Value

A File that represents the cached file.

Change Log

Version Description
5.5.29a Available in professional and enterprise editions.
5.5.43a Moved from session to log class.

Examples

Cache the current file

 // Note that this will cause a duplicate file, as with caching enabled this will happen automatically.
 // It may be useful in some cases if file manipulation is going to be performed on the returned File
 log.cacheScrapeableFile(scrapeableFile);

cacheText

File log.cacheText ( String name, String content, String encoding ) (professional and enterprise editions only)
File log.cacheText ( String name, String content ) (professional and enterprise editions only)

Description

Adds text to the cache. This will create a new text file in the cache and store the given content in it.

Parameters

  • name The name of the file in the cache, including any directory it should be placed in
  • content The content to place in the cache
  • encoding The encoding to use for the text, or null to use the default encoding for the session

Return Value

A File that represents the cached file.

Change Log

Version Description
5.5.29a Available in professional and enterprise editions.
5.5.43a Moved from session to log class.

Examples

Cache the extracted section for a DataRecord

 log.cacheText("Datarecord.html", dataRecord.get("DATARECORD"), "UTF-8");

debug

void log.debug ( Object message )

Description

Write message to the log.

Parameters

  • message Message to be written to the log after being converted to a String using String.valueOf( message ).

Return Values

Returns void.

Change Log

Version Description
5.5 Now accepts any Object as a message
4.5 Available for all editions.

When the workbench is running, this will be found under the log tab for the scraping
session. When screen-scraper is running in server mode, the message will get sent to the corresponding .log file found in screen-scraper's log folder. When screen-scraper is invoked from the command
line, the message will get sent to standard out.

Examples

Write to Log

 // Sends the message to the log.
 log.debug( "Inserting extracted data into the database." );

See Also

  • info() [log] - Sends a message to the log as an informative message
  • warn() [log] - Sends a message to the log as an warning message
  • error() [log] - Sends a message to the log as a error message
  • logDebug() [session] - Write message to the log as a debug message
  • log() [log] - Write message to the log
  • log() [session] - Write message to the log

enableCaching

void log.enableCaching ( String description, boolean saveLogs, boolean zipCachedFiles ) (professional and enterprise editions only)

Description

Enables caching for this scrape. When caching is enabled, each time a scrapeable file is downloaded it will be saved to the file system. Once the session is completed the cache will be either zipped or the directory renamed, depending on the conditions that were specified when the cache was enabled. Optionally this will save the log files to the cached location, and will save everything from the error.log file that was added while the cache was enabled.

Parameters

  • description A description to use in the cached file name
  • saveLogs True if logs should be included in the cache
  • zipCachedFiles True if the cached files should be zipped once the scrape ends.

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in professional and enterprise editions.
5.5.32a Renamed from enableCache to enableCaching
5.5.43a Moved from session to log class.

Examples

Cache the pages requested by the scrape

 // No special description is needed, but we want logs to be saved, and the output to be a zipped file
 log.enableCaching("", true, true);

endCaching

void log.endCaching ( ) (professional and enterprise editions only)

Description

Ends the caching for the scrape. This method will be called once all the scripts and files are run/scraped. It can be called in a script to end the caching early (thereby only caching a portion of the scrape). This only deals with saving downloaded content to the file system, not with reading it back in during a scrape.

Parameters

This method takes no parameters

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in professional and enterprise editions.
5.5.32a Renamed from endCache to endCaching.
5.5.43a Moved from session to log class.

Examples

Cache the pages requested by the scrape

 // End the cache manually before the scrape ends
 log.endCaching();

error

void log.error ( Object message )

Description

Write message to the log.

Parameters

  • message Message to be written to the log after being converted to a String using String.valueOf( message ).

Return Values

Returns void.

Change Log

Version Description
5.5 Now accepts any Object as a message
4.5 Available for all editions.

When the workbench is running, this will be found under the log tab for the scraping
session. When screen-scraper is running in server mode, the message will get sent to the corresponding .log file found in screen-scraper's log folder. When screen-scraper is invoked from the command
line, the message will get sent to standard out.

Examples

Write to Log

 // Sends the message to the log.
 log.error( "Inserting extracted data into the database." );

See Also

  • info() [log] - Sends a message to the log as an informative message
  • debug() [log] - Sends a message to the log as a debugging message
  • warn() [log] - Sends a message to the log as an warning message
  • logError() [session] - Sends a message to the log as an error message
  • log() [log] - Write message to the log
  • log() [session] - Sends a message to the log

getCachingEnabled

boolean log.getCachingEnabled ( ) (professional and enterprise editions only)

Description

Returns whether or not the cache is enabled for the scrape. When enabled, it simply means that each ScrapeableFile will save the content it downloads from the server to the file system so it can be viewed later, generally for debugging purposes.

Parameters

This method takes no parameters

Return Value

Returns true if caching is currently enabled for this session

Change Log

Version Description
5.5.29a Available in all editions.
5.5.32a Available enterprise and professional editions (Returns false in basic edition, but doesn't throw an Exception).
Renamed from getCacheEnabled to getCachingEnabled.
5.5.43a Moved from session to log class.

Examples

Log the cache state

 if(log.getCachingEnabled())
 {
   session.log("Currently caching the session.");
 }

getProgressBar

ProgressBar log.getProgressBar ( int index ) (enterprise edition only)
ProgressBar log.getProgressBar ( String title ) (enterprise edition only)

Description

Returns the progress bar specified. If the index if given, returns the progress bar at that index (0 is the root, 1 is the first child, etc...). If the title is given, returns the most recently added progress bar with the given title

Parameters

  • index (optional) The desired ProgressBar's index
  • title (optional) The title to search for

Return Value

The ProgressBar indicated, or null if none was found matching the required criteria

Change Log

Version Description
5.5.29a Available in all editions.
5.5.31a Available in enterprise edition.
5.5.43a Moved from session to log class.

Examples

Track the progress of a search over the alphabet

 import com.screenscraper.util.ProgressBar;

 ProgressBar bar = log.addProgressBar("Letter", 26);
 for(char c = 'a'; c <= 'z'; c++)
 {
   session.setVariable("SEARCH_LETTER", c);
   session.scrapeFile("Search");
   bar.add(1);

   // For Professional and Enterprise Editions
   log.webInfo("Completed Search on: " + c);

   // For Basic Edition ** Note that this method is available in Professional and Enterprise editions also
   log.logMonitoredValues();
 }

 // Now that we have completed the search, remove the progress bar
 log.removeProgressBar(bar);

 // Increment the value of the Category progress bar (created in a separate script).
 // It is generally recommended to save a reference as a session variable rather than using this method
 log.getProgressBar("Category").add(1);

info

void log.info ( Object message )

Description

Write message to the log.

Parameters

  • message Message to be written to the log after being converted to a String using String.valueOf( message ).

Return Values

Returns void.

Change Log

Version Description
5.5 Now accepts any Object as a message
4.5 Available for all editions.

When the workbench is running, this will be found under the log tab for the scraping
session. When screen-scraper is running in server mode, the message will get sent to the corresponding .log file found in screen-scraper's log folder. When screen-scraper is invoked from the command
line, the message will get sent to standard out.

Examples

Write to Log

 // Sends the message to the log.
 log.info( "Inserting extracted data into the database." );

See Also

  • debug() [log] - Sends a message to the log as a debugging message
  • warn() [log] - Sends a message to the log as an warning message
  • error() [log] - Sends a message to the log as a error message
  • logInfo() [session] - Sends a message to the log as an info message
  • log() [log] - Write message to the log
  • log() [session] - Write message to the log

log

void log.log ( Object message )

Description

Write message to the log.

Parameters

  • message Message to be written to the log after being converted to a String using String.valueOf( message ).

Return Values

Returns void.

Change Log

Version Description
5.5 Now accepts any Object as a message
4.5 Available for all editions.

When the workbench is running, this will be found under the log tab for the scraping
session. When screen-scraper is running in server mode, the message will get sent to the corresponding .log file found in screen-scraper's log folder. When screen-scraper is invoked from the command
line, the message will get sent to standard out.

Examples

Write to Log

 // Sends the message to the log.
 log.log( "Inserting extracted data into the database." );

See Also

  • debug() [log] - Sends a message to the log as a debugging message
  • info() [log] - Sends a message to the log as an informative message
  • warn() [log] - Sends a message to the log as an warning message
  • error() [log] - Sends a message to the log as a error message
  • log() [session] - Write message to the log

logDataRecord

void log.logDataRecord ( DataRecord record )
void log.logDataRecord ( DataRecord record, int logLevel )
void log.logDataRecordDebug ( DataRecord record ) (professional and enterprise editions only)
void log.logDataRecordInfo ( DataRecord record ) (professional and enterprise editions only)
void log.logDataRecordWarn ( DataRecord record ) (professional and enterprise editions only)
void log.logDataRecordError ( DataRecord record ) (professional and enterprise editions only)

Description

Logs all the values in a Data Record to the log, with one line per value. If a value in the record is a List, Set, Map, Data Set, Scrapeable File, or Exception, it will have detailed output as well.

Parameters

  • record The Data Record to output to the log
  • logLevel (optional) The level to log the data record at, as an int
    Values are 1-Debug, 2-Info, 3-Warn, 4-Error, or can be obtained from com.screenscraper.common.Notifiable.LEVEL_(DEBUG/INFO/WARN/ERROR)
    When omitted, the log level used is the session logging level.

Return Values

This method returns nothing

Change Log

Version Description
5.5.26a Available in all editions.
5.5.43a Moved from session to log class.

Examples

Log a Data Record

 // Log a scraped data record before saving it to a database
 log.logDataRecord(dataRecord);

The output from the above call might look something like this:

DataRecord
--- A_FLOAT : 3.14159
--- A_LIST : List
------ Element 0 : Value 1
------ Element 1 : Value 2
------ Element 2 : Value 3
------ Element 3 : Set
--------- Element : A value
--------- Element : More value
--------- Element : Other stuff
--- A_MAP : Map
------ KEY_1 : 1
------ KEY_2 : 2
------ KEY_3 : 3
--- A_SET : Set Logged above as "------ Element 3 : "
--- A_STRING : Screen-Scraper
--- AN_INT : 5

logException

void log.logException ( Exception exception )

Description

Logs an Exception, with a full stack trace, at the Error level

Parameters

  • exception The Exception to log

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in all editions.
5.5.43a Moved from session to log class.

Examples

Log an exception

 try
 {
   int result = Integer.parseInt(dataRecord.get("SCRAPED_VALUE"));
 }
 catch(Exception e)
 {
   log.logException(e);
 }

logMonitoredValues

void log.logMonitoredValues ( Object message )
void log.logMonitoredValues ( Object message, int logLevel )
void log.logMonitoredValuesDebug ( Object message ) (professional and enterprise editions only)
void log.logMonitoredValuesInfo ( Object message ) (professional and enterprise editions only)
void log.logMonitoredValuesWarn ( Object message ) (professional and enterprise editions only)
void log.logMonitoredValuesError ( Object message ) (professional and enterprise editions only)

Description

Logs the values of all the currently monitored variables, the progress of the scrape, if known, and puts the message at the top. Also logs any additional values being watched. Logs values at the specified level.

Parameters

  • message A message to output as a header for this log entry
  • logLevel (optional) The level to log at

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in all editions.
5.5.43a Moved from session to log class.

Examples

Log the currently monitored values and progress bars

 log.logMonitoredValues("Record Saved");

logMonitoredValuesClose

void log.logMonitoredValuesClose ( Object message ) (professional and enterprise editions only)

Description

Logs closing values to indicate the scrape is complete and what values were when everything finished. It will log at whatever the highest level logged to was. For instance, if a webWarn had been logged during the scrape, this will log at the warning level.

Parameters

  • message A message to output as a header for this log entry

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in all editions.
5.5.43a Moved from session to log class.

Examples

Log the currently monitored values at the end of the scrape

 log.logMonitoredValuesClose("Scrape Completed");

logObjectByType

void log.logObjectByType ( Object object )
void log.logObjectByType ( Object object, int logLevel )
void log.logObjectByTypeDebug ( Object object ) (professional and enterprise editions only)
void log.logObjectByTypeInfo ( Object object ) (professional and enterprise editions only)
void log.logObjectByTypeWarn ( Object object ) (professional and enterprise editions only)
void log.logObjectByTypeError ( Object object ) (professional and enterprise editions only)

Description

Logs the Object in a semi intelligent way. For example, Maps are logged as key-value pairs, lists are logged with one element per line, all elements of a set are logged, etc... Some objects will just log their value using String.valueOf() if it isn't a standard type of data set/list

Parameters

  • object The Object to write to the log
  • logLevel (optional) The level to log the data record at, as an int

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in all editions.
5.5.43a Moved from session to log class.

Examples

Log the dataSet

 log.logObjectByType(dataSet);

logScreenScraperInformation

void log.logScreenScraperInformation ( )

Description

Logs useful information about the current instance of Screen-Scraper, as well as the Java VM and the General Utility version being used. Information will be logged as an info message in the web interface (when running in server mode) and the log.

Parameters

This method takes no parameters

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in all editions.
5.5.43a Moved from session to log class.

Examples

Log Current Info

 log.logScreenScraperInformation();

removeMonitoredPostfix

void log.removeMonitoredPostfix ( String postfix ) (enterprise edition only)

Description

Stops watching for a postfix in session variables

Parameters

  • postfix Postfix to remove from monitoring

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in all editions.
5.5.43a Moved from session to log class.

Examples

Stop watching and logging the value of all session variables ending with _PARAM

 log.removeMonitoredPostfix("_PARAM");

removeMonitoredPrefix

void log.removeMonitoredPrefix ( String prefix ) (enterprise edition only)

Description

Stops watching for a prefix in session variables

Parameters

  • prefix Prefix to remove from monitoring

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in all editions.
5.5.43a Moved from session to log class.

Examples

Stop watching and logging the value of all session variables starting with SEARCH_

 log.removeMonitoredPrefix("SEARCH_");

removeMonitoredValue

Object log.removeMonitoredValue ( String name ) (enterprise edition only)

Description

Removes a specific name from the manually set values to be logged. Doesn't affect the value of session variables

Parameters

  • name The name for the value being monitored

Return Value

The previous value associated with the name, or null if there wasn't one

Change Log

Version Description
5.5.29a Available in all editions.
5.5.43a Moved from session to log class.

Examples

Remove a value so it won't be logged by logMonitoredValues

 log.removeMonitoredValue("The dataSet");

removeMonitoredVariable

void log.removeMonitoredVariable ( String key ) (enterprise edition only)

Description

Stops watching the specified variable

Parameters

  • key Key for the variable to stop watching

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in all editions.
5.5.43a Moved from session to log class.

Examples

Stop watching and logging the value of the session varaible NAME

 log.removeMonitoredVariable("NAME");

removeProgressBar

void log.removeProgressBar ( ProgressBar progressBar ) (enterprise edition only)
void log.removeProgressBarIfNotStopped ( ProgressBar progressBar ) (enterprise edition only)

Description

Removes the specified progress bar. The removeProgressBarIfNotStopped version removes the progress bar if the scrape has not been stopped, which is useful for determining when a scrape was stopped.

Parameters

  • progressBar The ProgressBar to remove

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in all editions.
5.5.31a Available in enterprise edition.
5.5.43a Moved from session to log class.

Examples

Track the progress of a search over the alphabet

 import com.screenscraper.util.ProgressBar;

 ProgressBar bar = log.addProgressBar("Letter", 26);
 for(char c = 'a'; c <= 'z'; c++)
 {
   log.setVariable("SEARCH_LETTER", c);
   session.scrapeFile("Search");
   bar.add(1);

   // For Professional and Enterprise Editions
   log.webInfo("Completed Search on: " + c);

   // For Basic Edition ** Note that this method is available in Professional and Enterprise editions also
   log.logMonitoredValues();
 }

 // Now that we have completed the search, remove the progress bar
 log.removeProgressBar(bar);

warn

void log.warn ( Object message )

Description

Write message to the log.

Parameters

  • message Message to be written to the log after being converted to a String using String.valueOf( message ).

Return Values

Returns void.

Change Log

Version Description
5.5 Now accepts any Object as a message
4.5 Available for all editions.

When the workbench is running, this will be found under the log tab for the scraping
session. When screen-scraper is running in server mode, the message will get sent to the corresponding .log file found in screen-scraper's log folder. When screen-scraper is invoked from the command
line, the message will get sent to standard out.

Examples

Write to Log

 // Sends the message to the log.
 log.warn( "Inserting extracted data into the database." );

See Also

  • info() [log] - Sends a message to the log as an informative message
  • debug() [log] - Sends a message to the log as a debugging message
  • error() [log] - Sends a message to the log as a error message
  • logWarn() [session] - Write message to the log as a warning message
  • log() [log] - Write message to the log
  • log() [session] - Write message to the log

webClose

void log.webClose ( Object object ) (professional and enterprise editions only)

Description

Logs closing values to indicate the scrape is complete and what values were when everything finished. It will log at whatever the highest level logged to was. For instance, if a webWarn had been logged during the scrape, this will log at the warning level. When running in Professional edition, this simply outputs to the log.

Using this method is preferred over logMonitoredValuesClose (which only logs to the log), because if at a later point the scrape is run in server mode for enterprise edition, a useful message is output in the web interface without needing to modify the scrape.

Parameters

  • object The message to display as a header

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in professional and enterprise editions.
5.5.43a Moved from session to log class.

Examples

Log monitored variables at the end of the scrape

 log.webClose("Scrape Completed");

webDebug

void log.webDebug ( Object object ) (professional and enterprise editions only)
void log.webDebug ( Object object, boolean saveMessage ) (professional and enterprise editions only)
void log.webDebug ( Object object, Object loggable ) (professional and enterprise editions only)
void log.webDebug ( Object object, boolean saveMessage, Object loggable ) (professional and enterprise editions only)

Description

Logs a debug message to the web interface status message area. Uses the message header as the top of the message, and then logs all currently monitored session variables underneath as well as the current progress (if known) of the scrape. Also outputs the message to the log. When running in Professional edition, this simply outputs to the log.

Using this method is preferred over logMonitoredValues (which only logs to the log), because if at a later point the scrape is run in server mode for enterprise edition, a useful message is output in the web interface without needing to modify the scrape.

Parameters

  • object The message to display as a header
  • saveMessage (optional) Whether or not to save this message and continue to display it below future web messages. By default debug messages are not saved.
  • loggable (optional) An additional object to log, most likely a DataRecord. This will only be logged with this message, and not 'monitored' like other values

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in professional and enterprise editions.
5.5.43a Moved from session to log class.

Examples

Log monitored variables and progress

 log.webDebug("Record Saved");

webError

void log.webError ( Object object ) (professional and enterprise editions only)
void log.webError ( Object object, boolean saveMessage ) (professional and enterprise editions only)
void log.webError ( Object object, Object loggable ) (professional and enterprise editions only)
void log.webError ( Object object, boolean saveMessage, Object loggable ) (professional and enterprise editions only)

Description

Logs an error message to the web interface status message area. Uses the message header as the top of the message, and then logs all currently monitored session variables underneath as well as the current progress (if known) of the scrape. Also outputs the message to the log. When running in Professional edition, this simply outputs to the log.

Using this method is preferred over logMonitoredValues (which only logs to the log), because if at a later point the scrape is run in server mode for enterprise edition, a useful message is output in the web interface without needing to modify the scrape.

Parameters

  • object The message to display as a header
  • saveMessage (optional) Whether or not to save this message and continue to display it below future web messages. By default error messages are saved.
  • loggable (optional) An additional object to log, most likely a DataRecord. This will only be logged with this message, and not 'monitored' like other values

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in professional and enterprise editions.
5.5.43a Moved from session to log class.

Examples

Log monitored variables and progress

 log.webError("Record Saved");

webInfo

void log.webInfo ( Object object ) (professional and enterprise editions only)
void log.webInfo ( Object object, boolean saveMessage ) (professional and enterprise editions only)
void log.webInfo ( Object object, Object loggable ) (professional and enterprise editions only)
void log.webInfo ( Object object, boolean saveMessage, Object loggable ) (professional and enterprise editions only)

Description

Logs an info message to the web interface status message area. Uses the message header as the top of the message, and then logs all currently monitored session variables underneath as well as the current progress (if known) of the scrape. Also outputs the message to the log. When running in Professional edition, this simply outputs to the log.

Using this method is preferred over logMonitoredValues (which only logs to the log), because if at a later point the scrape is run in server mode for enterprise edition, a useful message is output in the web interface without needing to modify the scrape.

Parameters

  • object The message to display as a header
  • saveMessage (optional) Whether or not to save this message and continue to display it below future web messages. By default info messages are not saved.
  • loggable (optional) An additional object to log, most likely a DataRecord. This will only be logged with this message, and not 'monitored' like other values

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in professional and enterprise editions.
5.5.43a Moved from session to log class.

Examples

Log monitored variables and progress

 log.webInfo("Record Saved");

webWarn

void log.webWarn ( Object object ) (professional and enterprise editions only)
void log.webWarn ( Object object, boolean saveMessage ) (professional and enterprise editions only)
void log.webWarn ( Object object, Object loggable ) (professional and enterprise editions only)
void log.webWarn ( Object object, boolean saveMessageloggable> ) (professional and enterprise editions only)

Description

Logs a warning message to the web interface status message area. Uses the message header as the top of the message, and then logs all currently monitored session variables underneath as well as the current progress (if known) of the scrape. Also outputs the message to the log. When running in Professional edition, this simply outputs to the log.

Using this method is preferred over logMonitoredValues (which only logs to the log), because if at a later point the scrape is run in server mode for enterprise edition, a useful message is output in the web interface without needing to modify the scrape.

Parameters

  • object The message to display as a header
  • saveMessage (optional) Whether or not to save this message and continue to display it below future web messages. By default warn messages are saved.
  • loggable (optional) An additional object to log, most likely a DataRecord. This will only be logged with this message, and not 'monitored' like other values

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in professional and enterprise editions.
5.5.43a Moved from session to log class.

Examples

Log monitored variables and progress

 log.webWarn("Record Saved");

RunnableScrapingSession

Overview

This is a class that can be instantiated within a script in order to run a scraping session.

Also see:

The Maximum number of concurrent running scraping sessions in the settings dialog box will control how many scraping sessions can be run simultaneously.

RunnableScrapingSession

RunnableScrapingSession RunnableScrapingSession ( String name ) (professional and enterprise editions only)
RunnableScrapingSession RunnableScrapingSession ( String name, ScrapingSession inheritedScrapingSession ) (professional and enterprise editions only)
RunnableScrapingSession RunnableScrapingSession ( String name, ScrapingSession inheritedScrapingSession, boolean inheritHttpState ) (professional and enterprise editions only)

Description

Initiates a RunnableScrapingSession object using the name of an existing scraping session.

Parameters

  • name The name of the scraping session to be run, as a string.
  • inheritedScrapingSession (optional) Scraping session whose session variables should be copied to the new scraping session. If it is left off no session variables will be passed to the new scrape.
  • inheritHttpState (optional) Whether HTTP state information, like cookies, should be inherited. This can be important if you have logged into a site and want the runnable scraping sessions to also be logged in.

Return Values

Returns a RunnableScrapingSession. On failure an error will be thrown.

Change Log

Version Description
5.0 inheritHttpState added as optional parameter.
4.5 Available for professional and enterprise editions.

Class Location

com.screenscraper.scraper

Examples

Creating RunnableScrapingSession

 // Creates a new runnable session for the scraping session "My Session".
 myScrapingSession = new com.screenscraper.scraper.RunnableScrapingSession( "My Session" );

 // Creates a new runnable session for the scraping session "My Session"
 // and passes it the current scraping session from which it will inherit
 // session variables and logging.
 myScrapingSession = new com.screenscraper.scraper.RunnableScrapingSession( "My Session", session );

Catching Error

 // If you renamed a scrape and are worried about someone not having the new one
 // you can use the thrown error to identify a problem that can be solved using
 // the older name
 try {
     // Attempt to create scrape using the new name
     myScrapingSession = new com.screenscraper.scraper.RunnableScrapingSession( "My Session - New" );
 } catch ( error ) {
     session.logWarn( error.toString() );
     session.logWarn( "Attemping to start scrape with old name." );
     myScrapingSession = new com.screenscraper.scraper.RunnableScrapingSession( "My Session" );
 }

getName

String runnableScrapingSession.getName ( ) (professional and enterprise editions only)

Description

Retrieve the name of the scraping session in the runnableScrapingSession.

Parameters

This method does not receive any parameters.

Return Values

Returns a string with the name of the scraping session.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

Examples

Retrieve Scrape Name

 runnableScrapingSession = new com.screenscraper.scraper.RunnableScrapingSession( "My Session" );

 // Stores the name of the scraping session in the variable sessionName.
 sessionName = runnableScrapingSession.getName();

getTimeout

int runnableScrapingSession.getTimeout ( ) (professional and enterprise editions only)

Description

Get the timeout of the session in the runnableScrapingSession.

Parameters

This method does not receive any parameters.

Return Values

Returns a integer representing the timeout length in minutes.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

Examples

Write Timeout to Log

 runnableScrapingSession = new com.screenscraper.scraper.RunnableScrapingSession( "My Session" );

 // Outputs the value of the timeout of the runnable scraping session
 // to the log.
 session.log( "Session timeout: " + runnableScrapingSession.getTimeout() );

See Also

getVariable

Object runnableScrapingSession.getVariable ( String variableName ) (professional and enterprise editions only)

Description

Retrieve the the value of a session variable. This method should be called after scrape method has returned.

Parameters

  • variableName Name of the variable, as a string.

Return Values

Returns the value of the session variable: object, boolean, int, string, etc. If the variable doesn't exists it returns null.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

Examples

Write Variable to Log

 runnableScrapingSession = new com.screenscraper.scraper.RunnableScrapingSession( "My Session" );

 // Ensure scrape will be run before the script continues
 runnableScrapingSession.setDoLazyScrape( false );

 // Start the scrape
 runnableScrapingSession.scrape();

 // Outputs the value of the variable FOO to the log.
 session.log( "FOO: " + runnableScrapingSession.getVariable( "FOO" ) );

See Also

  • setVariable() [RunnableScrapingSession] - Sets a variable for the scraping session

scrape

void runnableScrapingSession.scrape() (professional and enterprise editions only)

Description

Run the session scraping.

Parameters

This method does not receive any parameters.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

The default is for the script to continue executing without waiting for the scraping session to finish. You can use setDoLazyScrape to force the script to wait until the scape finishes before continuing the script.

Examples

Start Scrape in Separate Thread

 runnableScrapingSession = new com.screenscraper.scraper.RunnableScrapingSession( "My Session" );

 // Tells the session to start scraping.
 runnableScrapingSession.scrape();

 // Script continues execution without waiting for end of scrape

Start Scrape in Same Thread

 runnableScrapingSession = new com.screenscraper.scraper.RunnableScrapingSession( "My Session" );

 // Turn off LazyScrape
 runnableScrapingSession.setDoLazyScrape( false );

 // Tells the session to start scraping.
 runnableScrapingSession.scrape();

 // Script halts execution until the scrape is finished

setDoLazyScrape

void runnableScrapingSession.setDoLazyScrape ( boolean doLazyScrape ) (professional and enterprise editions only)

Description

Indicate whether or not the scraping session should run concurrently with (at the same time as) other scraping sessions. The default for doLazyScrape is true.

Parameters

  • doLazyScrape If lazy (concurrent) scraping should be used, as a boolean.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

We recommend not setting this value to false! When running scraping sessions in the workbench, it will cause the interface to freeze up until sessions have completed.

If you'd like to run multiple scraping sessions serially (one after another), the best option is to set the Maximum number of concurrent running scraping sessions to 1 in the settings window.

Examples

Turn off LazyScrape

 runnableScrapingSession = new com.screenscraper.scraper.RunnableScrapingSession( "My Session" );

 // Indicates that the runnable scraping session should not be run
 // in a separate thread.
 runnableScrapingSession.setDoLazyScrape( false );

 // Start the scrape
 runnableScrapingSession.scrape();

setTimeout

void runnableScrapingSession.setTimeout ( int timeout ) (professional and enterprise editions only)

Description

Sets the timeout of the session. That is, after the given number of minutes have passed the session will automatically terminate.

Parameters

  • timeout An integer representing the timeout length in minutes.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

This method must be called before scrape.

Examples

Set Scrape Timeout

 runnableScrapingSession = new com.screenscraper.scraper.RunnableScrapingSession( "My Session" );

 // Sets the timeout of the session to 60 minutes.
 runnableScrapingSession.setTimeout( 60 );

 runnableScrapingSession.scrape();

See Also

  • getTimeout() [RunnableScrapingSession] - Retrieves the timeout for the session

setVariable

void runnableScrapingSession.setVariable ( String identifier, Object value ) (professional and enterprise editions only)

Description

Set the value of a session variable.

Parameters

  • identifier Name of the variable, as a string.
  • value What to store in the variable: object, boolean, int, string, etc.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

Examples

Set Session Variable

 runnableScrapingSession = new com.screenscraper.scraper.RunnableScrapingSession( "My Session" );

 // Sets the session variable "LOGIN_USERNAME" with the value
 // "my_username".
 runnableScrapingSession.setVariable( "LOGIN_USERNAME", "my_username" );

 // Start the scrape
 runnableScrapingSession.scrape();

See Also

  • getVariable() [RunnableScrapingSession] - Retrieves the value of a session variable from the scraping session

scrapeableFile

Overview

The scrapeableFile object refers to the current file being requested from a given server. It houses both the request for a file and response and can be manipulated to meet any necessary requirements: GET and POST parameters, referer information, cookies, FILE parameters, HTTP headers, characterset, and such.

addGETHTTPParameter

void scrapeableFile.addGETHTTPParameter ( String key, String value, int sequence ) (professional and enterprise editions only)

Description

Dynamically adds a GET parameter to the URL of the current scrapeable file. If a parameter with the given sequence already exists, it will be replaced by the one created from this method call. Calling this method is the equivalent in the workbench of adding a parameter under the "Parameters" tab, and designating the type as GET. Once the scraping session is completed the original HTTP parameters (those under the "Parameters" tab in the workbench) will be restored.

Parameters

  • key The key portion of the parameter. For example, if the parameter were foo=bar, the key portion would be "foo".
  • value The value portion of the parameter. For example, if the parameter were foo=bar, the value portion would be "bar".
  • sequence The sequence the parameter (equivalent to the value under the "Sequence" column in the workbench).

Return Values

None

Change Log

Version Description
5.5.32a Available in Professional and Enterprise editions.

Examples

Add a GET HTTP parameter to a scrapeable file

scrapeableFile.addGETHTTPParameter( "searchTerm", "LP player", 3 );

addHTTPHeader

void scrapeableFile.addHTTPHeader ( String key, String value ) (professional and enterprise editions only)

Description

Add an HTTP header to be sent along with the request.

Parameters

  • key Name of the variable, as a string.
  • value Value of the variable, as a string

Return Values

Returns void. If you are not using enterprise edition it will throw an error.

Change Log

Version Description
5.0 Available for professional and enterprise edition.
4.5 Available for enterprise edition.

In certain rare cases it may be necessary to explicitly add a custom header of the POST data of an HTTP request. This may be required in cases where a site is using AJAX, and the POST payload of a request is sent as XML (e.g., using the setRequestEntity method). This method must be invoked before the HTTP request is made (e.g., "Before file is scraped" for a scrapeable file).

Examples

Add AJAX header

 // In a script called "Before file is scraped"

 // Add and set AJAX-Method header to true.
 scrapeableFile.addHTTPHeader( "AJAX-Method", "true" );

See Also

addHTTPParameter

void scrapeableFile.addHTTPParameter ( HTTPParameter parameter )

Description

Dynamically add an HTTPParameter to the current scrapeable file.

Parameters

  • parameter HTTPParameter object.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

The HTTPParameter constructor is as follows: HTTPParameter( String key, String value, int sequence, String type ). Valid types for the constructor are GET, POST, and FILE. Calling this method will have no effect unless it's invoked before the file is scraped.

Examples

Add GET HTTP Parameter

 // This would be in a script called "Before file is scraped"

 // Create HTTP parameter "page" with a value of "3" in the first location (GET is default)
 httpParameter = new com.screenscraper.common.HTTPParameter("page", "3", 1);

 // Adds a new GET HTTP parameter to the current file.
 scrapeableFile.addHTTPParameter( httpParameter );

Add POST HTTP Parameter

 // This would be in a script called "Before file is scraped"

 // Create HTTP parameter "page" with a value of "3" in the first location
 httpParameter = new com.screenscraper.common.HTTPParameter("page", "3", 1, "POST");

 // Adds a new POST HTTP parameter to the current file.
 scrapeableFile.addHTTPParameter( httpParameter );

See Also

  • removeHTTPParameter() [scrapeableFile] - Removes an HTTP Parameter from the request that will be made by the scrapeable file
  • removeAllHTTPParameters() [scrapeableFile] - Remove all the HTTP Parameters from the request that will be made by the scrapeable file

addPOSTHTTPParameter

void scrapeableFile.addPOSTHTTPParameter ( String key, String value ) (professional and enterprise editions only)
void scrapeableFile.addPOSTHTTPParameter ( String key, String value, int sequence )(professional and enterprise editions only)

Description

Dynamically adds a POST parameter to the existing set of POST parameters. If a parameter with the given sequence already exists, it will be replaced by the one created from this method call. If the method call is used that doesn't take a sequence, the new POST parameter will carry a sequence just higher than the highest existing sequence. Calling this method is the equivalent in the workbench of adding a parameter under the "Parameters" tab, and designating the type as POST. Once the scraping session is completed the original HTTP parameters (those under the "Parameters" tab in the workbench) will be restored.

Parameters

  • key The key portion of the parameter. For example, if the parameter were foo=bar, the key portion would be "foo".
  • value The value portion of the parameter. For example, if the parameter were foo=bar, the value portion would be "bar".
  • sequence The sequence the parameter (equivalent to the value under the "Sequence" column in the workbench).

Return Values

None

Change Log

Version Description
5.5.32a Available in Professional and Enterprise editions.

Examples

Add a POST HTTP parameter to a scrapeable file

// Adds a POST parameter to the end of the existing set.
scrapeableFile.addPOSTHTTPParameter( "EVENTTARGET", session.getv( "EVENTTARGET" ) );

// Replaces the existing POST parameter with a sequence of 2 with a new one.
scrapeableFile.addPOSTHTTPParameter( "VIEWSTATE", session.getv( "VIEWSTATE" ), 2 );

extractData

DataSet scrapeableFile.extractData ( String text, String extractorPatternName ) (professional and enterprise editions only)

Description

Manually apply an extractor pattern to a string.

Parameters

  • text The string to which the extractor pattern will be applied.
  • extractorPatternName Name of extractor pattern in the scrapeable file, as a string. Optionally the scraping session and scrapeable file where the extractor pattern can be found can be specified in the form [scraping session:][scrapeable file:]extractor pattern.

Return Values

Returns DataSet on success. Failures will be written out to the log as errors.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

An example of how to manually extract data is available.

Examples

Extract DataSet

 // Applies the "PRODUCT" extractor pattern to the text found in the
 // productDescriptionText variable. The resulting DataSet from
 // extractData is stored in the variable productData.

 DataSet productData = scrapeableFile.extractData( productDescriptionText, "PRODUCT" );

Loop Through DataRecords

 // Expanded example using the "PRODUCT" extractor pattern to the text found in the
 // productDescriptionText variable. The resulting DataSet from
 // extractData is stored in the variable myDataSet, which has multiple dataRecords.
 // Each myDataRecord has a PRICE and a PRODUCT_ID.<br />

 myDataSet = scrapeableFile.extractData( productDescriptionText, "PRODUCT" );
 for (i = 0; i < myDataSet.getNumDataRecords(); i++) {
     myDataRecord = myDataSet.getDataRecord(i);

     session.setVariable("PRICE", myDataRecord.get("PRICE"));
     session.setVariable("PRODUCT_ID", myDataRecord.get("PRODUCT_ID"));
 }

Extractor Pattern from another Scrapeable File

 // Apply extractor pattern "PRODUCT" from "Another scrapeable file"
 // to the variable productDescriptionText

 DataSet productData = scrapeableFile.extractData( productDescriptionText, "Another scrapeable file:PRODUCT" );

Extractor Pattern from another Scraping Session

 // Apply extractor pattern "PRODUCT" from "Another scrapeable file"
 // in "Other scraping session" to the variable productDescriptionText

 DataSet productData = scrapeableFile.extractData( productDescriptionText,
                        "Other scraping session:Another scrapeable file:PRODUCT" );

extractOneValue

String scrapeableFile.extractOneValue ( String text, String extractorPatternName ) (professional and enterprise editions only)
String scrapeableFile.extractOneValue ( String text, String extractorPatternName, String extractorTokenName ) (professional and enterprise editions only)

Description

Manually retrieve the value of a single extractor token.

Parameters

  • text The string to which the extractor pattern will be applied.
  • extractorPatternName Name of extractor pattern in the scrapeable file, as a string. Optionally the scraping session and scrapeable file where the extractor pattern can be found can be specified in the form [scraping session:][scrapeable file:]extractor pattern.
  • extractorTokenName (optional) Extractor token name, as a string, whose matched value should be returned. If left off the matched value for the first extractor token in the data set will be returned.

Return Values

Returns the match from the last data record, as a string, on success. On failure it returns null and writes a error to the log.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

If you want it to be from the first data record you could use getDataRecord.

Examples

Extract Value

 // Applies the extractor pattern "Product Name" to the data found in
 // the variable productDescriptionText. The extracted string is
 // stored in the productName variable.
 // Returns the value found in the first token found in the extractor pattern
 // or null if no token is found.

 productName = scrapeableFile.extractOneValue( productDescriptionText, "Product Name" );

Extract Value of Specified Token

 // Applies the extractor pattern "Product Name" to the data found in
 // the variable productDescriptionText. The extracted string is
 // stored in the productName variable.
 // Returns the value found in the token "NAME" found in the extractor pattern
 // or null if no token is found.

 productName = scrapeableFile.extractOneValue( productDescriptionText, "Product Name", "NAME" );

Extractor Pattern from another Scrapeable File

 // Apply extractor pattern "Product Name" from "Another scrapeable file"
 // to the variable productDescriptionText return the first "NAME"

 String productName = scrapeableFile.extractOneValue( productDescriptionText, "Another scrapeable file:Product Name", "NAME" );

Extractor Pattern from another Scraping Session

 // Apply extractor pattern "Product Name" from "Another scrapeable file"
 // in "Other scraping session" to the variable productDescriptionText
 // return the first "NAME"

 String productName = scrapeableFile.extractData( productDescriptionText,
                        "Other scraping session:Another scrapeable file:Product Name",
                       "NAME" );

getASPXValues

DataRecord scrapeableFile.getASPXValues ( boolean onlyStandard ) (professional and enterprise editions only)

Description

Gets the ASPX .NET values from the string. The standard values are __VIEWSTATE, __EVENTTARGET, __EVENTVALIDATION, and __EVENTARGUMENT. Values will be stored in the returned DataRecord as ASPX_VIEWSTATE, ASPX_EVENTTARGET, etc...

Parameters

  • onlyStandard Sets whether or not to only get the four standard tags, or look for any tags that begin with __

Return Values

A DataRecord object with each ASPX name as ASPX_[NAME] mapped to it's value. Note that when onlyStandard is false, any parameter that starts with the name __ will be returned in this DataRecord

Change Log

Version Description
5.5.26a Available in all editions.

Examples

Get the .NET values for a page

 DataRecord aspx = scrapeableFile.getASPXValues(true);

getAuthenticationPreemptive

boolean scrapeableFile.getAuthenticationPreemptive ( )

Description

Retrieve the authentication expectation of the request.

Parameters

This method does not receive any parameters.

Return Values

Returns whether the scrapeable file expects to have to authenticate and so will send the information initially instead of waiting for the request for it, as a boolean.

Change Log

Version Description
5.0 Available for all editions.

Examples

Write Expectation Status to Log

// Log expectation of authentication
if ( scrapeableFile.getAuthenticationPreemptive() )
{
    session.log( "Expecting Authentication" );
}

See Also

getCharacterSet

String scrapeableFile.getCharacterSet ( )

Description

Get the character set being used in the page response rendering.

Parameters

This method does not receive any parameters.

Return Values

Returns the character set applied to the scraped page, as a string. If a character set has not been specified then it will default to the character set specified in settings dialog box.

Change Log

Version Description
4.5 Available for all editions.

If you are having trouble with characters displaying incorrectly, we encourage you to read about how to go about finding a solution using one of our FAQs.

Examples

Get Character Set

 // Get the character set of the dataSet
 charSetValue = scrapeableFile.getCharacterSet();

See Also

  • setCharacterSet() [scrapeableFile] - Set the character set used to responses to a specific scrapeable file.
  • setCharacterSet() [session] - Set the character set used to render all responses.
  • getCharacterSet() [session] - Gets the character set used to render all responses.

getContentAsString

String scrapeableFile.getContentAsString ( )

Description

Retrieve contents of the response.

Parameters

This method does not receive any parameters.

Return Values

Returns contents of the last response, as a string. If the file has not been scraped it will return an empty string.

Change Log

Version Description
4.5 Available for all editions.

Examples

Log Response

 // In a script run "After file is scraped"

 // Sends the HTML of the current file to the log.
 session.log( scrapeableFile.getContentAsString() );

getContentType

String scrapeableFile.getContentType ( )

Description

Retrieve the POST payload type being used to interpret the page. This can be important with scraping some site's implementation of AJAX, where the payload in explicitly set as xml.

Parameters

This method does not receive any parameters.

Return Values

Returns the content type, as a string (e.g., text/html or text/xml).

Change Log

Version Description
5.0 Available for all editions.

Examples

Write Content Type to Log

// Write to log
session.log( "Content Type: " + scrapeableFile.getContentType( "text/xml" ) );

See Also

getCurrentPOSTData

String scrapeableFile.getCurrentPOSTData ( )

Description

Retrieve the POST data.

Parameters

This method does not receive any parameters.

Return Values

Returns the POST data for the scrapeable file, as a string. If called after the file has been scraped the session variable token will be resolved to their values; otherwise, the tokens will simply be removed from the string.

Change Log

Version Description
4.5 Available for all editions.

Examples

Collect POST data

 // In script called "After file is scraped"

 // Stores the POST data from the scrapeable file in the
 // currentPOSTData variable.

 currentPOSTData = scrapeableFile.getCurrentPOSTData();

getCurrentURL

String scrapeableFile.getCurrentURL ( )

Description

Get the URL of the file.

Parameters

This method does not receive any parameters.

Return Values

Returns the URL of the scrapeable file, as a string. If called after the file has been scraped the session variable tokens will be resolved to their values; otherwise, the tokens will simply be removed from the string.

Change Log

Version Description
4.5 Available for all editions.

Examples

Collect URL

 // In script called "After file is scraped"

 // Stores the current URL in the variable currentURL.
 currentURL = scrapeableFile.getCurrentURL();

getExtractorPatternTimedOut

boolean scrapeableFile.getExtractorPatternTimedOut () (professional and enterprise editions only)

Description

Indicates whether or not the most recent extractor pattern application timed out.

Parameters

None

Return Values

  • true or false

Change Log

Version Description
5.5.36a Available in all editions.

Examples

Find out about the last extractor pattern attempt

if( scrapeableFile.getExtractorPatternTimedOut() )
{
        session.log( "Most recent extractor pattern timed out." );
}

getForceNonBinary

boolean scrapeableFile.getForceNonBinary ( )

Description

Determine whether or not the contents of this response are being forced to be recognized as non-binary.

Parameters

This method does not receive any parameters.

Return Values

Returns true if the scrapeable file is being forced to be treated as non-binary; otherwise, it returns false.

Change Log

Version Description
5.0 Added for all editions.

Examples

Check Binary Status of File

 // Determine if the file is being forced
 // to be recognized as non-binary

 forced = scrapeableFile.getForceNonBinary();

See Also

  • setForceNonBinary() [scrapeableFile] - Sets whether or not the contents of the file are forced to be interpreted as non-binary

getHTTPResponseHeader

String scrapeableFile.getHTTPResponseHeader ( String header ) (professional and enterprise editions only)

Description

Gets the value of the header in the response of the scrapeable file, or returns null if it couldn't be found

Parameters

  • header The header name (case-insensitive) to get

Return Value

The value of the header, or null if not found

Change Log

Version Description
5.5.29a Available in professional and enterprise editions.

Examples

Log the Content-Type

 session.log(scrapeableFile.getHTTPResponseHeader());

getHTTPResponseHeaderSection

String scrapeableFile.getHTTPResponseHeaderSection ( ) (professional and enterprise editions only)

Description

Gets the header section of the HTTP Response

Parameters

This method takes no parameters

Return Value

A String containing the HTTP Response Headers

Change Log

Version Description
5.5.29a Available in professional and enterprise editions.

Examples

Log the headers

 // Split the headers into lines
 String[] headers = scrapeableFile.getHTTPResponseHeaderSection().split("[\\r\\n]");
 for(int i = 0; i < headers.length; i++)
 {
   session.log(headers[i]);
 }

getHTTPResponseHeaders

Map<String, String> scrapeableFile.getHTTPResponseHeaders ( ) (professional and enterprise editions only)

Description

Gets the headers of the HTTP Response as a map, and returns them.

Parameters

This method takes no parameters

Return Value

A Map from header name to header value for the response headers.

Change Log

Version Description
5.5.29a Available in professional and enterprise editions.

Examples

Get the Content-Type header

 Map headers = scrapeableFile.getHTTPResponseHeaders();
 Iterator it = headers.keySet().iterator();
 while(it.hasNext())
 {
   String next = it.next();
   if(next.equalsIgnoreCase("Content-Type"))
     session.log("Content-Type was: " + headers.get(next));
 }

getLastTidyAttemptFailed

boolean scrapeableFile.getLastTidyAttemptFailed ()

Description

Indicates whether or not the most recent attempt to tidy the HTML failed.

Parameters

None

Return Values

  • true or false

Change Log

Version Description
5.5.36a Available in all editions.

Examples

Find out about the last HTML tidy attempt

if( scrapeableFile.getLastTidyAttemptFailed() )
{
        session.log( "Most recent tidy attempt failed." );
}

getMaxRequestAttemptsReached

boolean scrapeableFile.getMaxRequestAttemptsReached () (professional and enterprise editions only)

Description

Indicates whether or not the maximum attempts to request a given scrapeable file were reached.

Parameters

None

Return Values

  • true or false

Change Log

Version Description
5.5.36a Available in all editions.

Examples

Find out about the last request attempt

if( scrapeableFile.getMaxRequestAttemptsReached() )
{
        session.log( "Maximum request attempts were reached." );
}

getMaxResponseLength

int scrapeableFile.getMaxResponseLength ( )

Description

Retrieve the kilobyte limit for information retrieved by the scrapeable file, any additional information will not be retrieved.

Parameters

This method does not receive any parameters.

Return Values

Returns the current kilobyte limit on the response, as an integer.

Change Log

Version Description
5.0 Add for professional and enterprise editions.

Examples

Log Response Size Limit

 // Log Limit
 session.log( "Max Response Length: " + scrapeableFile.getMaxResponseLength() + " KB" );

See Also

  • setMaxResponseLength() [scrapeableFile] - Sets the maximum number of kilobytes that will be retreived by the scrapeable file

getName

String scrapeableFile.getName ( )

Description

Get the name of the scrapeable file.

Parameters

This method does not receive any parameters.

Return Values

Returns the name of the scrapeable file, as a string.

Change Log

Version Description
4.5 Available for all editions.

Examples

Write Scrapeable File Name to Log

 // Outputs the name of the scrapeable file to the log.

 session.log( "Current scrapeable file: " + scrapeableFile.getName() );

getNonTidiedHTML

String scrapeableFile.getNonTidiedHTML ( ) (enterprise edition only)

Description

Retrieve the non-tidied HTML of the scrapeable file.

Parameters

This method does not receive any parameters.

Return Values

Returns the non-tidied contents of the scrapeable file, as a string. On failure it returns null.

Change Log

Version Description
4.5 Available for enterprise edition.

By default non-tidied html is not retained. For this method to return anything other than null you must use setRetainNonTidiedHTML to force non-tidied html to be retained.

Examples

Write Untidied HTML to Log if Retained

 // Outputs the non-tidied HTML from the scrapeable file
 // to the log based on whether it was retained or not.

 if (scrapeableFile.getRetainNonTidiedHTML())
 {
     session.log( "Non-tidied HTML: " + scrapeableFile.getNonTidiedHTML() );
 }
 else
 {
     session.log( "The non-tidied HTML was not retained or the file has not yet been scraped." );
 }

See Also

getRedirectURLs

String[] scrapeableFile.getRedirectURLs ( ) (professional and enterprise editions only)

Description

Gets an array of strings containing the redirect URL's for the current scrapeable file request attempt.

Parameters

This method does not receive any parameters.

Return Values

Returns the array of strings; may be empty.

Change Log

Version Description
6.0.24a Available in Professional and Enterprise editions.

getRetainNonTidiedHTML

boolean scrapeableFile.getRetainNonTidiedHTML ( ) (enterprise edition only)

Description

Determine if the scrapeable file is set to retain non-tidied html.

Parameters

This method does not receive any parameters.

Return Values

Returns boolean flag for non-tidied contents being retained.

Change Log

Version Description
4.5 Available for enterprise edition.

Examples

Write Untidied HTML to Log if Retained

 // Outputs the non-tidied HTML from the scrapeable file
 // to the log if it was retained otherwise just a message.

 if (scrapeableFile.getRetainNonTidiedHTML())
 {
     session.log( "Non-tidied HTML: " + scrapeableFile.getNonTidiedHTML() );
 }
 else
 {
     session.log( "The non-tidied HTML was not retained or the file has not yet been scraped." );
 }

See Also

getRetryPolicy

RetryPolicy scrapeableFile.getRetryPolicy ( ) (professional and enterprise editions only)

Description

Returns the retry policy. Note that in any 'After file is scraped' scripts this is null

Parameters

This method takes no parameters.

Return Value

The Retry Policy that will be used by this scrapeable file

Change Log

Version Description
5.5.29a Available in professional and enterprise editions.

Examples

Check for a retry policy

 if(scrapeableFile.getRetryPolicy() == null)
 {
   session.log(scrapeableFile.getName() + ": Retry policy has been set for this scrapeable file.");
 }

getStatusCode

int scrapeableFile.getStatusCode ( ) (professional and enterprise editions only)

Description

Determine the HTTP status code sent by the server.

Parameters

This method does not receive any parameters.

Return Values

Returns integer corresponding to the HTTP status code of the response.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

Examples

Write warning to log on 404 error

 // Check for a 404 response (file not found).
 if( scrapeableFile.getStatusCode() == 404 )
 {
     url = scrapeablefile.getCurrentURL();
     session.log( "Warning! The server returned a 404 response for the url ( " + url + ")." );
 }

getUserAgent

String scrapeableFile.getUserAgent ( )

Description

Retrieve the name of the user agent making the request.

Parameters

This method does not receive any parameters.

Return Values

Returns the user agent, as a string.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

Examples

Write User Agent to Log

 // write to log
 session.log( scrapeableFile.getUserAgent( ) );

See Also

  • setUserAgent() [scrapeableFile] - Sets the name of the user agent that will make the request

inputOutputErrorOccured

boolean scrapeableFile.inputOutputErrorOccurred ( )

Description

Determine if an input or output error occurred when requesting file.

Parameters

This method does not receive any parameters.

Return Values

Returns true if an error has occurred; otherwise, it returns false.

Change Log

Version Description
5.0 Added for all editions.

This method should be run after the scrapeable file has been scraped.

Examples

End scrape on Error

 // Check for error<br />
 if (scrapeableFile.inputOutputErrorOccurred())
 {
     // Log error occurrence
     session.log("Input/output error occurred.");
     // End scrape
     session.stopScraping();
 }

noExtractorPatternsMatched

boolean scrapeableFile.noExtractorPatternsMatched ( )

Description

Determine whether any extractor patterns associated with the scrapeable file found a match.

Parameters

This method does not receive any parameters.

Return Values

Returns boolean corresponding to whether any extractor pattern matched in the scrapeable file.

Change Log

Version Description
4.5 Available for all editions.

Examples

Warning if no Extractor Patterns matched

 // If no patterns matched, outputs a message indicating such
 // to the session log.

 if( scrapeableFile.noExtractorPatternsMatched() )
 {
     session.log( "Warning! No extractor patterns matched." );
 }

removeAllHTTPParameters

void scrapeableFile.removeAllHTTPParameters ( ) (professional and enterprise editions only)

Description

Remove all of the HTTP parameters from the current scrapeable file.

Parameters

This method does not receive any parameters.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

Examples

Delete HTTP Parameters

 // Removes all of the HTTP parameters from the current scrapeable file.
 scrapeableFile.removeAllHTTPParameters();

See Also

  • removeHTTPParameter() [scrapeableFile] - Removes an HTTP Parameter from the request that will be made by the scrapeable file
  • addHTTPParameter() [scrapeableFile] - Add an HTTP Parameter to the request that will be made by the scrapeable file

removeHTTPHeader

void scrapeableFile.removeHTTPHeader ( String key ) (enterprise edition only)
void scrapeableFile.removeHTTPHeader ( String key, String value ) (enterprise edition only)

Description

Remove an HTTP header from a scrapeable file.

Parameters

  • key The name of the HTTP header to be removed, as a string.
  • value (optional) The value of the HTTP header that is to be removed, as a string. If this is left off then all headers of the specified key will be removed.

Return Values

Returns void.

Change Log

Version Description
5.0.5a Introduced for enterprise edition.

Examples

Remove All Values of a Header

// delete all cookie headers for this scrapeableFile
// this can be done on a global scale
//    using session.clearCookies
scrapeableFile.removeHTTPHeader( "User-Agent" );

See Also

  • addHTTPHeader() [scrapeableFile] - Adds an HTTP Header to the scrapeable file

removeHTTPParameter

void scrapeableFile.removeHTTPParameter ( int sequence )
void scrapeableFile.removeHTTPParameter ( String key ) (professional and enterprise editions only)

Description

Dynamically removes an HTTPParameter. The order of the remaining parameters are adjusted immediately.

Parameters

  • sequence The ordered location of the parameter.
  • key The key identifying the HTTP parameter to be removed.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.
5.5.32a: Added method call that takes a String. Available for Professional and Enterprise editions.

If calling this method more than once in the same script, when used in conjunction with the addHTTPParameter method, it is important to keep track of how the list is reordered before calling either method again.

Calling this method will have no effect unless it's invoked before the file is scraped.

This method can be used for both GET and POST parameters.

Examples

Remove HTTP parameter

 // In a script called "Before file is scraped"

 // Removes the eighth HTTP parameter from the current file.
 scrapeableFile.removeHTTPParameter( 8 );

See Also

  • addHTTPParameter() [scrapeableFile] - Adds an HTTP Parameter to the request that will be made by the scrapeable file
  • removeAllHTTPParameters() [scrapeableFile] - Remove all the HTTP Parameters from the request that will be made by the scrapeable file

resequenceHTTPParameter

void scrapeableFile.resequenceHTTPParameter ( String key, int sequence ) (professional and enterprise editions only)

Description

Resequences an HTTP parameter.

Parameters

  • key The key identifying the HTTP parameter to be resequenced.
  • sequence The new sequence the parameter should have.

Return Values

None

Change Log

Version Description
5.5.32a Available in Professional and Enterprise editions.

Examples

Resequence an HTTP parameter

// Give the "VIEWSTATE" HTTP parameter a sequence of 3.
scrapeableFile.resequenceHTTPParameter( "VIEWSTATE", 3 );

resolveRelativeURL

String scrapeableFile.resolveRelativeURL ( String urlToResolve ) (professional and enterprise editions only)

Description

Resolves a relative URL to an absolute URL based on the current URL of this scrapeable file.

Parameters

  • urlToResolve Relative file path, as a string.

Return Values

Returns string containing the complete url to the file. On failure it will return the relative path and an error will be written to the log.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

Examples

Resolve relative URL into an absolute URL

 // Assuming the URL of the current scrapeable file is
 // "https://www.screen-scraper.com/path/to/file/"
 // the method call would result in the URL
 // "https://www.screen-scraper.com/path/to/file/thisfile.php"
 // begin assigned to the "fullURL" variable.

 fullURL = scrapeableFile.resolveRelativeURL( "thisfile.php" );

saveFileBeforeTidying

void scrapeableFile.saveFileBeforeTidying ( String filePath ) (professional and enterprise editions only)

Description

Write non-tidied contents of the scrapeable file response to a text file.

Parameters

  • filePath File path, as a string, where the file should be saved.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

This method must be called before the file is scraped.

Because the response header are also saved in the file, if the file is anything except a text file it will not be valid (e.g. images, pdfs).

Examples

Save Untidied Request and Response

 // In script called "Before file is scraped"

 // Causes the non-tidied HTML from the scrapeable file
 // to be output to the file path.

 scrapeableFile.saveFileBeforeTidying( "C:/non-tidied.html" );

saveFileOnRequest

void scrapeableFile.saveFileOnRequest ( String filePath ) (enterprise edition only)

Description

Save the file returned from a scrapeable file request.

Parameters

  • filePath Location where the file should be saved as a string.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for enterprise edition.

This method must be called from a scrapeable file before the file is scraped. Do not call this method from a script which is invoked by other means such as after an extractor pattern match or from within another script.

It is preferable to use downloadFile; however, at times you may have to send POST parameters in order to access a file. If that is the case, you would use this method.

This method cannot save local file requests to another location.

Examples

Save requested file

 // In script called "Before file is scraped"

 // When the current file is requested it will be saved to the
 // local file system as "sample.pdf".

 scrapeableFile.saveFileOnRequest( "C:/downloaded_files/sample.pdf" );

setAuthenticationPreemptive

void scrapeableFile.setAuthenticationPreemptive ( boolean preemptiveAuthentication )

Description

Set the authentication expectation of the request.

Parameters

  • preemptiveAuthentication Whether the scrapeable file expects to have to authenticate and so will send the information initially instead of waiting for the request for it, as a boolean.

Return Values

Returns void.

Change Log

Version Description
5.0 Available for all editions.

Examples

Set Preemptive Authentication

// Set expectation of authentication
scrapeableFile.setAuthenticationPreemptive( false );

See Also

setCharacterSet

void scrapeableFile.setCharacterSet ( String characterSet ) (professional and enterprise editions only)

Description

Set the character set used in a specific scrapeable file's response renderings. This can be particularly helpful when the page renders characters incorrectly.

Parameters

  • characterSet Java recognized character set, as a string. Java provides a list of supported character sets in its documentation.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

This method must be called before the file is scraped.

If you are having trouble with characters displaying incorrectly, we encourage you to read about how to go about finding a solution using one of our FAQs.

Examples

Set Character Set of Scrapeable File

 // In script called "Before file is scraped"

 // Sets the character set to be applied to the last response.
 scrapeableFile.setCharacterSet( "ISO-8859-1" );

See Also

  • getCharacterSet() [scrapeableFile] - Gets the character set used to responses to a specific scrapeable file.
  • setCharacterSet() [session] - Set the character set used to render all responses.
  • getCharacterSet() [session] - Gets the character set used to render all responses.

setContentType

void scrapeableFile.setContentType ( String contentType ) (professional and enterprise editions only)

Description

Set POST payload type. This is particularly helpful with scraping some site's implementation of AJAX, where the payload in explicitly set as xml.

Parameters

  • setContentType Desired content type of the POST payload, as a string.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

This method must be called before the file is scraped.

This method is usually used in connection with setRequestEntity as that method specifies the content of the POST data.

Examples

Set Content Type for XML payload in AJAX

 // In script called "Before file is scraped"

 // Sets the type of the POST entity to XML.
 scrapeableFile.setContentType( "text/xml" );

 // Set content of POST data
 scrapeableFile.setRequestEntity( "<person><name>John Smith</name></person>" );

See Also

setForceMultiPart

void scrapeableFile.setForceMultiPart ( boolean forceMultiPart ) (professional and enterprise editions only)

Description

Set content type header to multipart/form-data.

Parameters

  • forceMultiPart Boolean representing whether the request contains multipart data (e.g. images, files) as opposed to plain text. The default is false.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

This method must be called before the file is scraped.

Occasionally a site will expect a multi-part request when a file is not being sent in the request.

If you include a file upload parameter under the parameters tab of the scrapeable file the request will automatically be multi-part.

Examples

Specify that Request contains Files

 // In script called "Before file is scraped"

 // Will cause the request to be made as a multi-part request.
 scrapeableFile.setForceMultiPart( true );

setForceNonBinary

void scrapeableFile.setForceNonBinary ( boolean forceNonBinary )

Description

Set whether or not the contents of this response should be forced to be treated as non-binary. Default forceNonBinary value is false.

Parameters

  • forceNonBinary Whether or not the scrapeable file should be forced to be non-binary.

Return Values

Returns void.

Change Log

Version Description
5.0 Added for all editions.

This is provided in the case where screen-scraper misidentifies a non-binary file as a binary file. It doesn't happen often but is possible.

Examples

Check Binary Status of File

 // Force file to be recognized as non-binary
 scrapeableFile.setForceNonBinary( true );

See Also

  • getForceNonBinary() [scrapeableFile] - Returns whether or not this scrapeable file response will be forced to be treated as non-binary

setForcePOST

void scrapeableFile.setForcePOST ( Boolean forcePOST ) (professional and enterprise editions only)

Description

Determines whether or not a POST request should be forced.

Parameters

  • forcePOST Whether a POST

Return Values

Returns void.

Change Log

Version Description
6.0.14a Available in Professional and Enterprise editions.

setForcedRequestType

void scrapeableFile.setForcedRequestType ( ScrapeableFile.RequestType type ) (professional and enterprise editions only)

Description

Sets the request type to use.

Parameters

  • type The type of request to issue, or null to let screen-scraper decide.

    ScrapeableFile.RequestType is an enum with the following options as values

    • GET
    • POST
    • HEAD
    • DELETE
    • OPTIONS


    If the method sets the request to one of those types, all paramenters set as GET in the paramenters tab will be appended to the url (like normal) and all parameters set as POST parameters will be used to buld the request entity. If there are POST values on a type that doesn't support a request entity an exception will be thrown when the request is issued.

Return Values

Returns void.

Change Log

Version Description
6.0.55a Available in Professional and Enterprise editions.

Examples

Sets the request type

    scrapeableFile.setForcedRequestType(ScrapeableFile.RequestType.PUT)

setLastScrapedData

void scrapeableFile.setLastScrapedData(String) (enterprise edition only)

Description

Overwrite the content of the "last response"

Parameters

  • String Desired new content of the last response

Return Values

Returns void.

This method must be called from an extractor pattern before the pattern is run.

Examples

Replace new line characters with a space

newLastResponse = scrapeableFile.getContentAsString().replaceAll("\\n"," ");
scrapeableFile.setLastScrapedData(newLastResponse );

setMaxResponseLength

void scrapeableFile.setMaxResponseLength ( int maxKBytes ) (professional and enterprise editions only)

Description

Limit the amount of information retrieved by the scrapeable file. This method can be useful in cases of very large responses where the desired information is found in the first portion of the response. It can also help to make the scraping process more efficient by only downloading the needed information.

Parameters

  • maxKBytes Kilobytes to be downloaded, as an integer.

Return Values

Returns void.

Change Log

Version Description
5.0 Add for professional and enterprise editions.

This method must be called before the file is scraped.

Examples

Limit Response Size

 // In script called "Before file is scraped"

 // Only download the first 50 KB
 scrapeableFile.setMaxResponseLength(50);

See Also

  • getMaxResponseLength() [scrapeableFile] - Returns the maximum response length that is read by the scrapeable file

setReferer

void scrapeableFile.setReferer ( String url ) (professional and enterprise editions only)

Description

Set referer HTTP header.

Parameters

  • url URL of the referer, as a string.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

This method must be called before the file is scraped.

Examples

Specify that Request contains Files

 // In script called "Before file is scraped"

 // Sets the value of url as the HTTP header
 // referer for the current scrapeable file.

 scrapeableFile.setReferer( "http://www.foo.com/" );

setRequestEntity

void scrapeableFile.setRequestEntity ( String requestEntity ) (professional and enterprise editions only)

Description

Set POST payload data. This is particularly helpful with scraping some site's implementation of AJAX, where the payload in explicitly set as xml.

Parameters

  • requestEntity Desired content of the POST payload, as a string.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

This method must be called before the file is scraped.

This method is usually used in connection with setContentType as that method specifies the content of the POST data.

Though you can set plain text POST data using this method it is preferable to use the addHTTPParameter method for this task.

Examples

Set POST data as XML

 // In script called "Before file is scraped"

 // Sets the type of the POST entity to XML.
 scrapeableFile.setContentType( "text/xml" );

 // Set content of POST data
 scrapeableFile.setRequestEntity( "<person><name>John Smith</name></person>" );

setRetainNonTidiedHTML

void scrapeableFile.setRetainNonTidiedHTML ( boolean retainNonTidiedHTML ) (enterprise edition only)

Description

Set whether or not non-tidied HTML is to be retained for the current scrapeable file.

Parameters

  • retainNonTidiedHTML Whether the non-tidied HTML should be retained, as a boolean. The default is false.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for enterprise edition.

If, after the file is scraped, you want to be able to use getNonTidiedHTML this method has to be called before the file is scraped.

Examples

Retain Non-tidied HTML

 // In script called "Before file is scraped"

 // Tells screen-scraper to retain tidied HTML for the current
 // scrapeable files.

 scrapeableFile.setRetainNonTidiedHTML( true );

See Also

setRetryPolicy

void scrapeableFile.setRetryPolicy ( RetryPolicy policy ) (professional and enterprise editions only)

Description

Sets a Retry Policy that will be run to check if a page should be re-downloaded or not. The policy will be checked after all the extractors have run, and will check for an error on the page based on a set of conditions. If the policy shows an error on the page, it can run scripts or other code to attempt to remedy the situation, and then it will rescrape the file.

The file will be re-downloaded without rerunning any of the scripts that run before the file is downloaded, and before any of the scripts marked to run after the file is scraped. If there is any change that needs to be made to session variables/headers, etc... they should be made in the script or runnable that will be executed. Also, the policy can specify that session variables should be restored to their previous values before the file is rescraped. If it does, they will be reset after the error checking portion of the policy but before the policy runs the code to make changes before a retry.

The retry policy should be set in a script run 'Before file is scraped', but can also be set by a script on an extractor pattern. It it is set on an extractor pattern, session variables will not be restored if the retry is required

Parameters

  • policy The policy that should be run. See the RetryPolicyFactory for standard policies, or one can be created by implementing the RetryPolicy interface

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in professional and enterprise editions.

Examples

Set a basic retry policy

 import com.screenscraper.util.retry.RetryPolicyFactory;

 // Use a policy that will retry up to 5 times, and on each failed attempt to load
 // the page, it will execute the "Get new Proxy" script

 scrapeableFile.setRetryPolicy(RetryPolicyFactory.getBasicPolicy(5, "Get new Proxy"));

setUserAgent

void scrapeableFile.setUserAgent ( String userAgent ) (professional and enterprise editions only)

Description

Explicitly state the user agent making the request.

Parameters

  • userAgent User agent name, as a string. There are a lot of possible user agents, a list is maintained by User-Agents.org. The default is Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322).

Return Values

Returns void.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

This method must be called before the file is scraped.

Examples

Set User Agent

 // In script called "Before file is scraped"

 // Causes screen-scraper to identify itself as Firefox
 // running on Linux.

 scrapeableFile.setUserAgent( "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020826" );

See Also

  • getUserAgent() [scrapeableFile] - Returns the name of the user agent that will make the request

wasErrorOnRequest

boolean scrapeableFile.wasErrorOnRequest ( )

Description

Determine if an error occurred with the request. Errors are considered to be server timeouts as well as any status code outside of the range 200-399.

Parameters

This method does not receive any parameters.

Return Values

Returns true for server timeouts as well as any status code outside of the range 200-399; otherwise, it returns false.

Change Log

Version Description
4.5 Available for all editions.

This method must be called after the file is scraped.

If you want to know what the status code was you can use getStatusCode.

Examples

Check for Request Errors

 // In script called "After file is scraped"

 // If an error occurred when the file was requested, an error
 // message indicating such gets output to the log.

 if( scrapeableFile.wasErrorOnRequest() )
 {
     session.log( "Connection error occurred." );
 }

session

Overview

This object refers to the current scraping session that is running. To make the methods a little easier to sort through they have been grouped into related methods. The groups have been named to ease in finding them when they are needed.

Anonymization

Overview

The following methods are provided to aid you in setting up an anonymous scraping session. If you are using your own server proxy pool you will use the methods to allow screen-scraper to interact with and manage your proxy pool. If you are using automatic anonymization then the only method you will use is currentProxyServerIsBad as screen-scraper will manage the servers using the anonymization settings from your setup.

See an example of Anonymization via Manual Proxy Pools.

currentProxyServerIsBad

void session.currentProxyServerIsBad ( ) (professional and enterprise editions only)

Description

Remove proxy server from proxy pool. This is only used with anonymization and indicates that one server in the pool is bad and should be removed.

Parameters

This method does not receive any parameters.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

If you are using automatic anonymization or manual proxy pools, a new proxy server will be created as a result of the method call.

When checking if a request you have made is invalid it is best not to rely on the HTTP status code (eg. 404) alone as the status codes are not always accurate. It is recommended that you also scrape a known string (eg. "Not found") from the response HTML that validates the status code.

Examples

Flag Proxy Server

 // Indicates that the current proxy server is bad.
 session.currentProxyServerIsBad();

getCurrentProxyServerFromPool

ProxyServer session.getCurrentProxyServerFromPool ( )

Description

Get the current proxy server from the proxy server pool.

Parameters

This method does not receive any parameters.

Return Values

Returns the current proxy server being used.

Change Log

Version Description
4.5 Available for all editions.

Examples

Write Proxy Server Description to Log

 // Get Proxy Server
 proxyServer = session.getCurrentProxyServerFromPool();

 // Log Server Description
 session.log( "Proxy Server: " + proxyServer.getDescription() );

getProxyServerPool

void session.getProxyServerPool ()

Description

Holds the proxy server pool object that allows proxies to be cycled through.

Parameters

  • This method does not receive any parameters.

Return Values

Returns true if there is an available proxy server pool.

Change Log

Version Description
4.5 Available for all editions.

Examples

Check if ProxyServerPool object exists.

 // If ProxyServerPool does not exist
 // Create a new ProxyServerPool object.
 if ( !session.getProxyServerPool() )
 {
  // The ProxyServerPool object will
  // control how screen-scraper interacts with proxy servers.
 
  proxyServerPool = new ProxyServerPool();
 
  // We give the current scraping session a reference to
  // the proxy pool. This step should ideally be done right
  // after the object is created (as in the previous step).

  session.setProxyServerPool( proxyServerPool );
 }

getTerminateProxiesOnCompletion

boolean session.getTerminateProxiesOnCompletion ( )

Description

Determine whether proxies are set to be terminated when the scrape ends.

Parameters

This method does not receive any parameters.

Return Values

Returns true if a proxy will be terminated; otherwise, it returns false.

Change Log

Version Description
5.0 Available for all editions.

Examples

Check Termination Setting

// Log whether proxies are being terminated or not
if ( session.getTerminateProxiesOnCompletion() )
{
    session.log( "Anonymous Proxies are set to be terminated with the scrape." );
}
else
{
    session.log( "Anonymous Proxies are set to continue running after the scrape is finished." );
}

See Also

getUseProxyFromPool

boolean session.getUseProxyFromPool ( )

Description

Determine whether proxies are being used from proxy pool.

Parameters

This method does not receive any parameters.

Return Values

Returns true if a proxy pool is being used; otherwise, it returns false.

Change Log

Version Description
4.5 Available for all editions.

Examples

Turn On Proxy Pool Usage If Not Running

 // Are proxies being used from a pool
 if ( !session.getUseProxyFromPool() )
 {
     session.setUseProxyFromPool( true );
 }

See Also

  • setUseProxyFromPool() [session] - Sets whether a proxy from the proxy pool should be used when making a request

setProxyServerPool

void session.setProxyServerPool ( ProxyServerPool proxyServerPool )

Description

Associate a proxy pool with a scraping session.

Parameters

  • proxyServerPool A ProxyServerPool object.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

Examples

Associate Proxy Pool with Scraping Session

 // Create a new ProxyServerPool object. This object will
 // control how screen-scraper interacts with proxy servers.

 proxyServerPool = new ProxyServerPool();

 // We give the current scraping session a reference to
 // the proxy pool. This step should ideally be done right
 // after the object is created (as in the previous step).

 session.setProxyServerPool( proxyServerPool );

setTerminateProxiesOnCompletion

void session.setTerminateProxiesOnCompletion ( boolean terminateProxies )

Description

Manually set the outcome of proxies when the scrape ends.

Parameters

  • terminateProxies Whether proxies should be terminated at the end of the session or not, as a boolean.

Return Values

Returns void.

Change Log

Version Description
5.0 Available for all editions.

Examples

Make Sure Proxies are Deleted on Scrape Completion

// Test
if ( session.getTerminateProxiesOnCompletion() )
{
    session.log( "Anonymous Proxies are set to be terminated with the scrape." );
}
else
{
    // Set proxies to be terminated with the scrape
    session.setTerminateProxiesOnCompletion( true );
    session.log( "Anonymous Proxies updated to be terminated with the scrape." );
}

See Also

setUseProxyFromPool

void session.setUseProxyFromPool ( boolean useProxyFromPool )

Description

Determine if proxies from a proxyServerPool be used when making scrapeable file request.

Parameters

  • useProxyFromPool Whether proxies in the proxyServerPool should be used, as a boolean.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

Examples

Anonymize Scrapeable Files

 // Create a new ProxyServerPool object. This object will
 // control how screen-scraper interacts with proxy servers.

 proxyServerPool = new ProxyServerPool();

 // We give the current scraping session a reference to
 // the proxy pool. This step should ideally be done right
 // after the object is created (as in the previous step).

 session.setProxyServerPool( proxyServerPool );

 ... Proxy Server Pool Setup ...

 // This is the switch that tells the scraping session to make
 // use of the proxy servers. Note that this can be turned on
 // and off during the course of the scrape. You may want to
 // anonymize some pages, but not others.
 session.setUseProxyFromPool( true );

See Also

  • getUseProxyFromPool() [session] - Returns whether or not a proxy from the proxy pool will be used upon making a request

External Proxy Settings

Overview

If you are already going through a proxy server, screen-scraper must be told the credentials in order to get out to the internet. These methods are all provided to manually tell screen-scraper how to get through your external proxy.

If you always go through the same external proxy you would probably want to set the credentials in screen-scraper's proxy settings so that you don't have to specify them in all of your scrapes.

getExternalNTProxyDomain

string session.getExternalNTProxyDomain ( )

Description

Retrieve the external NT proxy domain.

Parameters

This method does not receive any parameters.

Return Values

Returns the external NT domain, as a string.

Change Log

Version Description
5.0 Added for all editions.

Examples

Log External NT Proxy Settings

// Log External Proxy Settings
session.log( "Username: " + session.getExternalNTProxyUsername( ) );
session.log( "Password: " + session.getExternalNTProxyPassword( ) );
session.log( "Domain: " + session.getExternalNTProxyDomain( ) );
session.log( "Host: " + session.getExternalNTProxyHost( ) );

See Also

getExternalNTProxyHost

string session.getExternalNTProxyHost ( )

Description

Retrieve the external NT proxy host.

Parameters

This method does not receive any parameters.

Return Values

Returns the external NT host, as a string.

Change Log

Version Description
5.0 Added for all editions.

Examples

Log External NT Proxy Settings

// Log External Proxy Settings
session.log( "Username: " + session.getExternalNTProxyUsername( ) );
session.log( "Password: " + session.getExternalNTProxyPassword( ) );
session.log( "Domain: " + session.getExternalNTProxyDomain( ) );
session.log( "Host: " + session.getExternalNTProxyHost( ) );

See Also

getExternalNTProxyPassword

string session.getExternalNTProxyPassword ( )

Description

Retrieve the external NT proxy password.

Parameters

This method does not receive any parameters.

Return Values

Returns the external NT password, as a string.

Change Log

Version Description
5.0 Added for all editions.

Examples

Log External NT Proxy Settings

// Log External Proxy Settings
session.log( "Username: " + session.getExternalNTProxyUsername( ) );
session.log( "Password: " + session.getExternalNTProxyPassword( ) );
session.log( "Domain: " + session.getExternalNTProxyDomain( ) );
session.log( "Host: " + session.getExternalNTProxyHost( ) );

See Also

getExternalNTProxyUsername

string session.getExternalNTProxyUsername ( )

Description

Retrieve the external NT proxy username.

Parameters

This method does not receive any parameters.

Return Values

Returns the external NT username, as a string.

Change Log

Version Description
5.0 Added for all editions.

Examples

Log External NT Proxy Settings

// Log External Proxy Settings
session.log( "Username: " + session.getExternalNTProxyUsername( ) );
session.log( "Password: " + session.getExternalNTProxyPassword( ) );
session.log( "Domain: " + session.getExternalNTProxyDomain( ) );
session.log( "Host: " + session.getExternalNTProxyHost( ) );

See Also

getExternalProxyHost

string session.getExternalProxyHost ( )

Description

Retrieve the external proxy host.

Parameters

This method does not receive any parameters.

Return Values

Returns the external host, as a string.

Change Log

Version Description
5.0 Available for all editions.

Examples

Log External Proxy Settings

// Log External Proxy Settings
session.log( "Username: " + session.getExternalProxyUsername( ) );
session.log( "Password: " + session.getExternalProxyPassword( ) );
session.log( "Host: " + session.getExternalProxyHost( ) );
session.log( "Port: " + session.getExternalProxyPort( ) );

See Also

getExternalProxyPassword

string session.getExternalProxyPassword ( )

Description

Retrieve the external proxy password.

Parameters

This method does not receive any parameters.

Return Values

Returns the external password, as a string.

Change Log

Version Description
5.0 Available for all editions.

Examples

Log External Proxy Settings

// Log External Proxy Settings
session.log( "Username: " + session.getExternalProxyUsername( ) );
session.log( "Password: " + session.getExternalProxyPassword( ) );
session.log( "Host: " + session.getExternalProxyHost( ) );
session.log( "Port: " + session.getExternalProxyPort( ) );

See Also

getExternalProxyPort

string session.getExternalProxyPort ( )

Description

Retrieve the external proxy port.

Parameters

This method does not receive any parameters.

Return Values

Returns the external port, as a string.

Change Log

Version Description
5.0 Available for all editions.

Examples

Log External Proxy Settings

// Log External Proxy Settings
session.log( "Username: " + session.getExternalProxyUsername( ) );
session.log( "Password: " + session.getExternalProxyPassword( ) );
session.log( "Host: " + session.getExternalProxyHost( ) );
session.log( "Port: " + session.getExternalProxyPort( ) );

See Also

getExternalProxyUsername

string session.getExternalProxyUsername ( )

Description

Retrieve the external proxy username.

Parameters

This method does not receive any parameters.

Return Values

Returns the external username, as a string.

Change Log

Version Description
5.0 Available for all editions.

Examples

Log External Proxy Settings

// Log External Proxy Settings
session.log( "Username: " + session.getExternalProxyUsername( ) );
session.log( "Password: " + session.getExternalProxyPassword( ) );
session.log( "Host: " + session.getExternalProxyHost( ) );
session.log( "Port: " + session.getExternalProxyPort( ) );

See Also

setExternalNTProxyDomain

void session.setExternalNTProxyDomain ( String domain )

Description

Manually set external NT proxy domain.

Parameters

  • domain Domain for the external NT proxy, as a string.

Return Values

Returns void.

Change Log

Version Description
5.0 Added for all editions.

If you are using this method on all of your scripts you might want to set it in screen-scraper's external NT proxy settings.

If you are using NTLM (Windows NT) authentication you'll need to designate settings for both the standard external proxy as well as the external NT proxy.

Examples

Manually Setup External NT Proxy

 // Setup External Proxy
 session.setExternalNTProxyUsername( "guest" );
 session.setExternalNTProxyPassword( "guestPassword" );
 session.setExternalNTProxyDomain( "Group" );
 session.setExternalNTProxyHost( "proxy.domain.com" );

See Also

setExternalNTProxyHost

void session.setExternalNTProxyHost ( String host )

Description

Manually set external NT proxy host/domain.

Parameters

  • host Host/domain for the external NT proxy, as a string.

Return Values

Returns void.

Change Log

Version Description
5.0 Added for all editions.

If you are using this method on all of your scripts you might want to set it in screen-scraper's external NT proxy settings.

If you are using NTLM (Windows NT) authentication you'll need to designate settings for both the standard external proxy as well as the external NT proxy.

Examples

Manually Setup External NT Proxy

 // Setup External Proxy
 session.setExternalNTProxyUsername( "guest" );
 session.setExternalNTProxyPassword( "guestPassword" );
 session.setExternalNTProxyDomain( "Group" );
 session.setExternalNTProxyHost( "proxy.domain.com" );

See Also

setExternalNTProxyPassword

void session.setExternalNTProxyPassword ( String password )

Description

Manually set external NT proxy password.

Parameters

  • password Password for the external NT proxy, as a string.

Return Values

Returns void.

Change Log

Version Description
5.0 Added for all editions.

If you are using this method on all of your scripts you might want to set it in screen-scraper's external NT proxy settings.

If you are using NTLM (Windows NT) authentication you'll need to designate settings for both the standard external proxy as well as the external NT proxy.

Examples

Manually Setup External NT Proxy

 // Setup External Proxy
 session.setExternalNTProxyUsername( "guest" );
 session.setExternalNTProxyPassword( "guestPassword" );
 session.setExternalNTProxyDomain( "Group" );
 session.setExternalNTProxyHost( "proxy.domain.com" );

See Also

setExternalNTProxyUsername

void session.setExternalNTProxyUsername ( String username )

Description

Manually set external NT proxy username.

Parameters

  • username Username for the external NT proxy, as a string.

Return Values

Returns void.

Change Log

Version Description
5.0 Added for all editions.

If you are using this method on all of your scripts you might want to set it in screen-scraper's external NT proxy settings.

If you are using NTLM (Windows NT) authentication you'll need to designate settings for both the standard external proxy as well as the external NT proxy.

Examples

Manually Setup External NT Proxy

 // Setup External Proxy
 session.setExternalNTProxyUsername( "guest" );
 session.setExternalNTProxyPassword( "guestPassword" );
 session.setExternalNTProxyDomain( "Group" );
 session.setExternalNTProxyHost( "proxy.domain.com" );

See Also

setExternalProxyHost

void session.setExternalProxyHost ( String host )

Description

Manually set external proxy host/domain.

Parameters

  • host Host/domain for the external proxy, as a string.

Return Values

Returns void.

Change Log

Version Description
5.0 Added for all editions.

If you are using this method on all of your scripts you might want to set it in screen-scraper's external proxy settings.

Examples

Manually Setup External Proxy

 // Setup External Proxy
 session.setExternalProxyUsername( "guest" );
 session.setExternalProxyPassword( "guestPassword" );
 session.setExternalProxyHost( "proxy.domain.com" );
 session.setExternalProxyPort( "80" );

See Also

setExternalProxyPassword

void session.setExternalProxyPassword ( String password )

Description

Manually set external proxy password.

Parameters

  • password Password for the external proxy, as a string.

Return Values

Returns void.

Change Log

Version Description
5.0 Added for all editions.

If you are using this method on all of your scripts you might want to set it in screen-scraper's external proxy settings.

Examples

Manually Setup External Proxy

 // Setup External Proxy
 session.setExternalProxyUsername( "guest" );
 session.setExternalProxyPassword( "guestPassword" );
 session.setExternalProxyHost( "proxy.domain.com" );
 session.setExternalProxyPort( "80" );

See Also

setExternalProxyPort

void session.setExternalProxyPort ( String port )

Description

Manually set external proxy port.

Parameters

  • port Port for the external proxy, as a string.

Return Values

Returns void.

Change Log

Version Description
5.0 Added for all editions.

If you are using this method on all of your scripts you might want to set it in screen-scraper's external proxy settings.

Examples

Manually Setup External Proxy

 // Setup External Proxy
 session.setExternalProxyUsername( "guest" );
 session.setExternalProxyPassword( "guestPassword" );
 session.setExternalProxyHost( "proxy.domain.com" );
 session.setExternalProxyPort( "80" );

See Also

setExternalProxyUsername

void session.setExternalProxyUsername ( String username )

Description

Manually set external proxy username.

Parameters

  • username Username for the external proxy, as a string.

Return Values

Returns void.

Change Log

Version Description
5.0 Added for all editions.

If you are using this method on all of your scripts you might want to set it in screen-scraper's external proxy settings.

Examples

Manually Setup External Proxy

 // Setup External Proxy
 session.setExternalProxyUsername( "guest" );
 session.setExternalProxyPassword( "guestPassword" );
 session.setExternalProxyHost( "proxy.domain.com" );
 session.setExternalProxyPort( "80" );

See Also

Logging

Overview

Use of log is a great tool to ensure that your scrapes are working correctly as well as troubleshooting problems that arise. Though logging large amounts of information may slow down a scrape, the best way around this is not to remove log writing requests but rather change the verbosity of the logging when running the scrape in a production environment. If you do this, know that you make it harder to troubleshoot some problems should they arise.

The number of methods provided is merely to enhance your ability to log information according to importance.

See Also

  • debug() [log] - Sends a message to the log as an debug message
  • info() [log] - Sends a message to the log as an info message
  • warn() [log] - Sends a message to the log as an warning message
  • error() [log] - Sends a message to the log as a error message

getLogFileName

String session.getLogFileName ( ) (professional and enterprise editions only)

Description

Get the name of the current log file.

Parameters

This method does not receive any parameters.

Return Values

Returns the name of the log file, as a string.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

This method can be very helpful when screen-scraper is running in server mode and you are tracking the log where the scrape of a record is located, or for tracking the location of errors in larger scrapes.

Examples

Get Log's File Name

 // Output the name of the log file to the session log.
 logName =  session.getLogFileName();

log

void session.log ( Object message )

Description

Write message to the log.

Parameters

  • message Message to be written to the log after being converted to a String using String.valueOf( message ).

Return Values

Returns void.

Change Log

Version Description
5.5 Now accepts any Object as a message
4.5 Available for all editions.

When the workbench is running, this will be found under the log tab for the scraping session. When screen-scraper is running in server mode, the message will get sent to the corresponding .log file found in screen-scraper's log folder. When screen-scraper is invoked from the command line, the message will get sent to standard out.

Examples

Write to Log

 // Sends the message to the log.
 session.log( "Inserting extracted data into the database." );

See Also

  • logDebug() [session] - Sends a message to the log as a debugging message
  • logInfo() [session] - Sends a message to the log as an informative message
  • logWarn() [session] - Sends a message to the log as a warning
  • logError() [session] - Sends a message to the log as an error message
  • log() [log] - Write message to the log

logCurrentDateAndTime

void session.logCurrentDateAndTime ( ) (professional and enterprise editions only)

Description

Write current date and time to log (at most verbose level). It is formatted to be human readable.

Parameters

This method does not receive any parameters.

Return Values

Returns void. If an error occurs, an error will be thrown.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

Examples

Log Date and Time

 // Output the current date and time to the log.
 session.logCurrentDateAndTime();

logCurrentTime

void session.logCurrentTime ( ) (professional and enterprise editions only)

Description

Write current time to log (at most verbose level). The time is formatted to be human readable.

Parameters

This method does not receive any parameters.

Return Values

Returns void. If an error occurs, an error will be thrown.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

Examples

Log Formatted Time

 // Output the current date and time to the log.
 session.logCurrentTime();

logDebug

void session.logDebug ( Object message ) (professional and enterprise editions only)

Description

Write message to the log, at the the debug level (most verbose).

Parameters

  • message Message to be written to the log after being converted to a String using String.valueOf( message ).

Return Values

Returns void.

Change Log

Version Description
5.5 Now accepts any Object as a message
4.5 Available for professional and enterprise editions.

Examples

Write to Log at Debug level

 // Sends the message to the lowest level of logging.
 session.logDebug( "Index: " + session.getVariable( "INDEX" ) );

  • log() [session] - Sends a message to the log as a debugging message
  • logInfo() [session] - Sends a message to the log as an informative message
  • logWarn() [session] - Sends a message to the log as a warning
  • logError() [session] - Sends a message to the log as an error message
  • debug() [log] - Sends a message to the log as a debug message

logElapsedRunningTime

void session.logElapsedRunningTime ( ) (professional and enterprise editions only)

Description

Write scrape run time to the log (at most verbose level). It is formatted to be human readable, including breaking it into days, hours, minutes, and seconds.

Parameters

This method does not receive any parameters.

Return Values

Returns void. If an error occurs, an error will be thrown.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

Examples

Log Time the Scrape has been Running

 // Output the running time to the log.
 session.logElapsedRunningTime();

See Also

logError

void session.logError ( Object message ) (professional and enterprise editions only)

Description

Write message to the log, at the the error level (least verbose).

Parameters

  • message Message to be written to the log after being converted to a String using String.valueOf( message ).

Return Values

Returns void. If an error occurs, an error will be thrown.

Change Log

Version Description
5.5 Now accepts any Object as a message
4.5 Available for professional and enterprise editions.

Examples

Write to Log at Error level

 // Sends the message to the highest level of logging.
 session.logError( "Error parsing date: " + session.getVariable( "DATE" ) );

  • log() [session] - Sends a message to the log as a debugging message
  • logDebug() [session] - Sends a message to the log as a debugging message
  • logInfo() [session] - Sends a message to the log as an informative message
  • logWarn() [session] - Sends a message to the log as a warning
  • error() [log] - Sends a message to the log as an error message

logInfo

void session.logInfo ( Object message ) (professional and enterprise editions only)

Description

Write message to the log, at the the info level (second most verbose).

Parameters

  • message Message to be written to the log after being converted to a String using String.valueOf( message ).

Return Values

Returns void. If an error occurs, an error will be thrown.

Change Log

Version Description
5.5 Now accepts any Object as a message
4.5 Available for professional and enterprise editions.

Examples

Write to Log at Info level

 // Sends the message to the second lowest level of logging.
 session.logInfo( "Traversing search results pages..." );

  • log() [session] - Sends a message to the log as a debugging message
  • logDebug() [session] - Sends a message to the log as a debugging message
  • logWarn() [session] - Sends a message to the log as a warning
  • logError() [session] - Sends a message to the log as an error message
  • info() [log] - Sends a message to the log as an info message

logVariables

void session.logVariables ( ) (professional and enterprise editions only)

Description

Write all session variables to log.

Parameters

This method does not receive any parameters.

Return Values

Returns void.

Change Log

Version Description
5.0 Added for all editions.

Examples

Log All Session Variables

 // Write Variables to Log
 session.logVariables();

See Also

  • berakpoint [dataSet] - Pause scrape and display breakpoint window.

logWarn

void session.logWarn ( Object message ) (professional and enterprise editions only)

Description

Write message to the log, at the the warn level (third most verbose).

Parameters

  • message Message to be written to the log after being converted to a String using String.valueOf( message ).

Return Values

Returns void. If an error occurs, an error will be thrown.

Change Log

Version Description
5.5 Now accepts any Object as a message
4.5 Available for professional and enterprise editions.

Examples

Write to Log at Info level

 // Sends the message to the third level of logging.
 session.logWarn( "Warning! Received a 404 response."  );

  • log() [session] - Sends a message to the log as a debugging message
  • logDebug() [session] - Sends a message to the log as a debugging message
  • logInfo() [session] - Sends a message to the log as an informative message
  • logError() [session] - Sends a message to the log as an error message
  • warn() [log] - Sends a message to the log as an warning message

Web Interface Interactions

Overview

These methods are used in connection with the web interface of screen-scraper. Their use will provide the interface with more detailed information regarding the state of a running scrape. If you are not running the scrapes using the web interface then these methods are not particularly helpful to you.

As the web interface is an enterprise edition feature, these methods are only available in enterprise edition users.

addToNumDuplicateRecordsScraped

void session.addToNumDuplicateRecordsScraped ( Object value ) (enterprise edition only)

Description

Add to the value of duplicate records scraped. (As opposed to new or error records.)

Parameters

  • value Value to be added to the count. Usually a integer but if it is given a string (e.g. "10") it will try to transform it into an integer before adding.

Return Values

Returns void.

Change Log

Version Description
7.0 Available for enterprise edition.

Examples

Record New Records Scraped

 // Adds 10 to the value of new records scraped.
 session.addToNumDuplicateRecordsScraped(10);

Have session record each time a new record saved to the database

// In script called "After each pattern match"
import java.sql.PreparedStatement;
import java.sql.ResultSet;

dm = session.getv("_DM");
con = dm.getConnection();

try
{
        String sql = "SELECT id FROM table WHERE did = ?";
        PreparedStatement pstmt = con.prepareStatement(sql);
        pstmt.setString(1, dataRecord.get("ID"));
        ResultSet rs = pstmt.executeQuery();
        if (rs.next())
        {
                log.log("---Already in DB");
                session.addToNumDuplicateRecordsScraped(1);
        }
        else
        {
                session.scrapeFile("Results");
        }
}
catch (Exception e)
{
        log.logError(e);
        session.setFatalErrorOccurred(true);
        session.setErrorMessage(e);    
}
finally
{
        con.close();   
}

addToNumErrorRecordsScraped

void session.addToNumErrorRecordsScraped ( Object value ) (enterprise edition only)

Description

Add to the value error records. (As opposed to duplicate or new records.)

Parameters

  • value Value to be added to the count. Usually a integer but if it is given a string (e.g. "10") it will try to transform it into an integer before adding.

Return Values

Returns void.

Change Log

Version Description
7.0 Available for enterprise edition.

Examples

Record New Records Scraped

// Adds 10 to the value of new records scraped.
session.addToNumErrorRecordsScraped(10);

Have session record each time a dataRecord is missing a vital datam

// In script called "After each pattern match"
if (sutil.isNullOrEmptyString(dataRecord.get("VITAL_DATUM")))
{
    log.logError("Missing VITAL_DATUM");
    session.addToNumErrorRecordsScraped(1);
}

addToNumNewRecordsScraped

void session.addToNumNewRecordsScraped ( Object value ) (enterprise edition only)

Description

Add to the value of new records scraped. (As opposed to duplicate or error records.)

Parameters

  • value Value to be added to the count. Usually a integer but if it is given a string (e.g. "10") it will try to transform it into an integer before adding.

Return Values

Returns void.

Change Log

Version Description
7.0 Available for enterprise edition.

Examples

Record New Records Scraped

 // Adds 10 to the value of new records scraped.
 session.addToNumNewRecordsScraped(10);

Have session record each time a new record saved to the database

// In script called "After each pattern match"
dm = session.getv("_DM");
dm.addData("db_table", dataRecord);
dm.commit("db_table");
if (dm.flush())
{
        session.addToNumNewRecordsScraped(1);
}

addToNumRecordsScraped

void session.addToNumRecordsScraped ( Object value ) (enterprise edition only)

Description

Add to the value of number of records scraped.

Parameters

  • value Value to be added to the count. Usually a integer but if it is given a string (e.g. "10") it will try to transform it into an integer before adding.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for enterprise edition.

Examples

Record Number of Records Scraped

 // Adds 10 to the value of the number of records scraped.
 session.addToNumRecordsScraped( 10 );

Have session record each time a DataRecord exists

 // In script called "After file is scraped"

 // Adds number of DataRecords in DataSet
 // to the value of the number of records scraped.

 session.addToNumRecordsScraped( dataSet.getNumDataRecords() );

See Also

appendErrorMessage

void session.appendErrorMessage ( String errorMessage ) (enterprise edition only)

Description

Append an error message to any existing error messages.

Parameters

  • errorMessage Error message that should be added, as a string.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for enterprise edition.

Examples

User Specified Error

 // First set the flag indicating that an error occurred.
 session.setFatalErrorOccurred( true );

 // Append an error message.
 session.appendErrorMessage( "An error occurred in the scraping session." );

See Also

getErrorMessage

String session.getErrorMessage ( ) (enterprise edition only)

Description

Get the current error message.

Parameters

This method does not receive any parameters.

Return Values

Returns current error message, as a string.

Change Log

Version Description
4.5 Available for enterprise edition.

Examples

Write Error Message to the Log

 // Output the current error message to the log.
 session.log( "Error message: " + session.getErrorMessage() );

See Also

getFatalErrorOccurred

boolean session.getFatalErrorOccurred ( ) (enterprise edition only)

Description

Determine the fatal error status of the scrape.

Parameters

This method does not receive any parameters.

Return Values

Returns whether a fatal error has occurred, as a boolean .

Change Log

Version Description
4.5 Available for enterprise edition.

Examples

Write Fatal Error State to Log

 // Output the "fatal error" state to the log.
 session.log( "Fatal error occurred: " + session.getFatalErrorOccurred() );

See Also

getNumRecordsScraped

int session.getNumRecordsScraped ( ) (enterprise edition only)

Description

Get the number of records that have been scraped.

Parameters

This method does not receive any parameters.

Return Values

Returns number of records scraped, as a integer.

Change Log

Version Description
4.5 Available for enterprise edition.

Examples

Write Number of Records Scraped to Log

 // Outputs the number of records that have been scraped to the log.
 session.log( "Num records scraped so far: " + session.getNumRecordsScraped() );

See Also

resetNumRecordsScraped

void session.resetNumRecordsScraped ( ) (enterprise editions only)

Description

Reset the count on the number of scraped records.

Parameters

This method does not receive any parameters.

Return Values

Returns void.

Change Log

Version Description
5.0 Available for all editions.

Examples

Reset Count

// Clear number of records scraped
session.resetNumRecordsScraped();

See Also

setErrorMessage

void session.setErrorMessage ( String errorMessage ) (enterprise edition only)

Description

Set the current error message.

Parameters

  • errorMessage Desired error message, as a string.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for enterprise edition.

Examples

Specify an Error Message

 // First set the flag indicating that an error occurred.
 session.setFatalErrorOccurred( true );

 // Append an error message.
 session.setErrorMessage( "An error occurred in the scraping session." );

Web Interface Feedback

 // Append an error message. Without flagging it as an error.
 // This will hijack the error message so it is more just a
 // status message. Don't hijack if there was a fatal error.

 if ( !session.getFatalErrorOccurred() )
 {
     session.appendErrorMessage( "Scraping Page: " + session.getv( "PAGE" ) );
 }

See Also

setFatalErrorOccurred

void session.setFatalErrorOccurred ( boolean fatalErrorOccurred ) (enterprise edition only)

Description

Set the fatal error status of the scrape.

Parameters

  • fatalErrorOccurred Desired fatal error status to set, as a boolean.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for enterprise edition.

Examples

Set Fatal Error Flag

 // Set the flag indicating that an error occurred.
 session.setFatalErrorOccurred( true );

See Also

setNumRecordsScraped

void session.setNumRecordsScraped ( Object value ) (enterprise edition only)

Description

Set the number of records that have been scraped.

Parameters

  • value Value to set the count of the number of records scraped.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for enterprise edition.

Examples

Set the Number of Records Scraped

 // Sets the value of the number of records scraped to 10.
 session.setNumRecordsScraped( 10 );

See Also

addEventCallback

void session.addEventCallback ( EventFireTime eventTime, EventHandler callback ) (professional and enterprise editions only)
void session.addEventCallbackWithPriority ( EventFireTime eventTime, EventHandler callback, int priority ) (professional and enterprise editions only)

Description

Add a runnable that will be executed at the given time.

Note: session.addEventCallback is automatically executed at a priority of 0.

Parameters

  • eventTime The time to execute a callback.
  • callback The callback to execute.
  • priority The prority for this callback. Lower numbers are higher priority.

Return Values

Returns void.

Change Log

Version Description
6.0.55a Introduced for pro and enterprise editions.

Examples

Sets a handler to do something after the scripts set to run at the end of the session have run.

   // using the default callback with the priority being 0.
   session.addEventCallback(SessionEventFireTime.AfterEndScripts, handler);
   
   // if we need to set the priority to be something else (or variable) use the second option
   // in this case the priority could still be set to 0 if you wanted to.
   session.addEventCallbackWithPriority(SessionEventFireTime.AfterEndScripts, handler, 3);

More Examples

EventFireTime

The EventFireTime is an interface which defines the methods that a fire time must have and so the addEventCallback method can take different types of fire times.

A number of different types of classes based on this interface have been defined for you which call out the various parts of a scrape that you can add event handlers to. Those are defined below.

ExtractorPatternEventFireTime

ExtractorPatternEventFireTime

Enum

  • BeforeExtractorPattern Before an extractor is applied (including before any scripts on it run). The returned value should be a boolean and indicates whether the extractor should be run or not. Any non-boolean result is the same as true. Also note that regardless of whether the extractor will be run or not, the event for after extractor pattern will still be fired.
  • AfterExtractorPatternAppliedButBeforeScripts After an extractor is applied (but before any scripts on it run &emdash; including the after apparent match scripts).
  • AfterEachExtractorMatch After each match of an extractor. This will be applied before any of the "After each pattern match" scripts are applied.
  • AfterExtractorPattern After an extractor is applied (including any scripts on it run).

Change Log

Version Description
6.0.55a Introduced for pro and enterprise editions.

Examples

How to use the EventFireTime with the session.addEventcallback method.

    session.addEventCallback(ExtractorPatternEventFireTime.AfterEachExtractorMatch, handler);

ScrapeableFileEventFireTime

ScrapeableFileEventFireTime

Enum

  • BeforeScrapeableFile Before a scrapeable file is launched (inlcuding before any scripts on it run).
  • BeforeHttpRequest Fired right before the http request (after any "before scrapeable fie" scripts, and wil fire each time the request is retired). If it returns a non-null String, that will be used as the response instead of issuing a request. This response will still get passed into the AfterHttpRequest even, but it will not pass through any tidying.
  • AfterHttpRequest Fire right after the http response and running tidy, if set, but before anything else happens. Returns the data that should be used as the response data.
  • AfterScrapeableFile After a scrapeable file is completed (including afer any scripts on it run).
  • OnHttpRedirect* Called when a redirect will occur, and returns true if a redirect should occur or false if it should not (any non boolean results in no chanage).

*Note: When using the Async HTTP client you will have access to the request builder from ScrapeableFileEventData.getRedirectRequestBuilder() which can be used to modify and adjust the request before it is sent. If you use the Apache HTTP client the getRedirectRequestBuilder() method will always return null.

Change Log

Version Description
6.0.55a Introduced for pro and enterprise editions.

Examples

How to use the EventFireTime with the session.addEventcallback method.

    session.addEventCallback(ScrapeableFileEventFireTime.BeforeScrapeableFile, handler);

getRedirectToURL

String scrapeableFileEventData.getRedirectToURL ( )

Description

Returns the RedirectToURL value for the object.

Parameters

This method does not receive any parameters.

Return Values

Returns the RedirectToURL value for the object.

Change Log

Version Description
6.0.55a Available for all editions.

Examples

Get the redirect URL

    public Object handleEvent(EventFireTime fireTime, ScrapeableFileEventData data) {
        String url = data.getRedirectToURL();
       
        // do something
    }

ScriptEventFireTime

ScriptEventFireTime

Enum

  • AfterScript After a script is executed
  • BeforeScript Before a script is executed
  • OnScriptEnd Run when the script finishes executing. The difference between AfterScript and this is that AfterScript fires after the script is done running, and this runs after all the developer code has run but the script engine is still active. The return value is an injected string to execute, or null (or the empty string) to do nothing aside from execute the script code.
  • OnScriptError Executes when an error occurs in a script.
  • OnScriptStart Run when the script beings to execute. The difference between BeforeScript and this is that BeforeScript fires as preparation is made to launch a script, and this runs after all the default pre-script code is executed by the script engine, but before the developer code in the script. The return value is an injected string to execute, or null (or the empty string) to do nothing aside from execute the script code.

Change Log

Version Description
6.0.55a Introduced for pro and enterprise editions.

Examples

How to use the EventFireTime with the session.addEventcallback method.

    session.addEventCallback(ScriptEventFireTime.OnScriptEnd, handler);

SessionEventFireTime

SessionEventFireTime

Enum

  • AfterEndScripts After the scrape finishes and all
  • NumRecordsSavedModified When the ScrapingSession.addToNumRecordsScraped(Object) is called, this will also be called. The returned value will be the actual value to add.
  • StopScrapingCalled When the session is stopped, either by calling the stopScraping method or clicking the stop scraping button in the workbench.
  • SessionVariableSet* Called whenever a session variable is set. This is called before the value is actually set. The variable value passed in will be the new value to be set, and the return value of the handler will be the actual value returned.
  • SessionVariableRetrieved* Called whenever a session variable is retrieved. This is called after the value is retrieved. The variable value passed in will be the current value, and the return value of the handler will be the actual value returned.

*Note: Calling a setVariable or getVariable method in here WILL trigger the events for those again. Avoid infinite recursion please!

Change Log

Version Description
6.0.55a Introduced for pro and enterprise editions.

Examples

How to use the EventFireTime with the session.addEventcallback method.

    session.addEventCallback(SessionEventFireTime.AfterEndScripts, handler);

StringOperationEventFireTime

StringOperationEventFireTime

Enum

  • HttpParameterEncodeKey Called when an http parameter key (GET or POST) is encoded. The input string will be the value that is already encoded, and the return value should be the value to actually use.
  • HttpParameterEncodeValue Called when an http parameter value (GET or POST) is encoded. The input string will be the value that is already encoded, and the return value should be the value to actually use.

Change Log

Version Description
6.0.55a Introduced for pro and enterprise editions.

Examples

How to use the EventFireTime with the session.addEventcallback method.

    session.addEventCallback(StringOperationEventFireTime.HttpParameterEncodeKey, handler);

EventHandler

EventHandler EventHandler ( ) (professional and enterprise editions only)

Description

Creates an EventHandler callback object which will be called when the event triggers

Change Log

Version Description
6.0.55a Introduced for pro and enterprise editions.

Examples

Define a handler for the session.addEventCallback to use.

    // Create an EventHandler object which will be called when the event triggers
    EventHandler handler = new EventHandler()
    {
        /**
        * Returns the name of the handler.  This method doens't need to be implemented
        * but helps with debugging (on error executing the callback it will output this)
        */

        public String getHandlerName()
        {
            return "A test event handler";
        }

        /**
        * Processes the event, and potentially returns a useful value modifying something
        * in the internal code
        *
        * @param fireTime The fire time of the event. This helps when using the same handler
        * for multiple event times, to determine which was called
        * @param data The actual data from the event. Based on the event time this
        * will be a different type. It could be SessionEventData, ScrapeableFileEventData,
        * ScriptEventData, StringEventData, etc...  It will match the fire time class name
        *
        * @return A value indicating how to proceed (or sometimes the value is ignored)
        */

        public Object handleEvent(EventFireTime fireTime, AbstractEventData data)
        {
            // While you can specifically grab any data from the data object,
            // if this is a method that has a return value that matters,
            // it's best to get it as the last return value, so that multiple
            // events can be chained together.  The input data object
            // will always have the original values for all the other getters
            Object returnValue = data.getLastReturnValue();

            // Do stuff...

            // The EventFireTime values describe in the documentation what the return
            // value will do, or says nothing about it if the value is ignored
            // If you don't intend to modify the return, always return data.getLastReturnValue();
            return returnValue;
        }
    };

getHandlerName

String getHandlerName ( )

Description

Returns the name of the handler. This method doesn't need to be implemented but helps with debugging.

Parameters

This method does not receive any parameters.

Return Values

Returns the name of the handler. This method doesn't need to be implemented but helps with debugging.

Change Log

Version Description
6.0.55a Available for all editions.

Examples

    // Create an EventHandler object which will be called when the event triggers
    EventHandler handler = new EventHandler()
    {
        /**
         * Returns the name of the handler.  This method doens't need to be implemented
         * but helps with debugging (on error executing the callback it will output this)
         */

        public String getHandlerName()
        {
            return "A test event handler";
        }

        public Object handleEvent(EventFireTime fireTime, AbstractEventData data)
        {
            // do something
        }
    };

See Also

handleEvent

Object handleEvent ( EventFireTime fireTime, AbstractEventData data )

Description

Processes the event, and potentially returns a useful value modifying something in the internal code as defined by the EventFireTime used to launch this event.

Parameters

  • fireTime Defines the methods that a fire time must have.
  • data Allows for the accessing of various data values found within ScreenScraper dependent on the class used.

Return Values

Returns a value based on which AbstractEventData class is used.

Change Log

Version Description
6.0.55a Available for all editions.

    EventHandler handler = new EventHandler()
    {  
        public String getHandlerName()
        {
            // return something
        }

        /**
         * Processes the event, and potentially returns a useful value modifying something
         * in the internal code
         *
         * @param fireTime The fire time of the event. This helps when using the same handler
         * for multiple event times, to determine which was called
         * @param data The actual data from the event. Based on the event time this
         * will be a different type. It could be SessionEventData, ScrapeableFileEventData,
         * ScriptEventData, StringEventData, etc...  It will match the fire time class name
         *
         * @return A value indicating how to proceed (or sometimes the value is ignored)
         */

        public Object handleEvent(EventFireTime fireTime, AbstractEventData data)
        {
            // While you can specifically grab any data from the data object,
            // if this is a method that has a return value that matters,
            // it's best to get it as the last return value, so that multiple
            // events can be chained together.  The input data object
            // will always have the original values for all the other getters
            Object returnValue = data.getLastReturnValue();

            // Do stuff...

            // The EventFireTime values describe in the documentation what the return
            // value will do, or says nothing about it if the value is ignored
            // If you don't intend to modify the return, always return data.getLastReturnValue();
            return returnValue;
        }
    };

See Also

AbstractEventData

The AbstractEventData class is an abstract class which allows for the accessing of various data values found within ScreenScraper. Below are the various classes that extend AbstractEventData

AbstractEventData is extended by the following classes and it is those classes that should be used in place of AbstractEventData.

getLastReturnValue

Object getLastReturnValue ( )

Description

Returns the LastReturnValue for the object. This is the value previously returned by another callback. This can be null, if no callbacks have been fired yet for this event. A null value is also the default return value for the given event.

Parameters

This method does not receive any parameters.

Return Values

Returns the LastReturnValue for the object.

Change Log

Version Description
6.0.55a Available for all editions.

Examples

Write to Log

   // In practice AbstractEventData is just the abstract class.
   // You must actually use one of the classes that extend it.
    public Object handleEvent(EventFireTime fireTime, AbstractEventData data) {
        // While you can specifically grab any data from the data object,
                // if this is a method that has a return value that matters,
                // it's best to get it as the last return value, so that multiple
                // events can be chained together.  The input data object
                // will always have the original values for all the other getters
                Object returnValue = data.getLastReturnValue();

       
        // do something
       
        // The EventFireTime values describe in the documentation what the return
                // value will do, or says nothing about it if the value is ignored
                // If you don't intend to modify the return, always return data.getLastReturnValue();
        return data.getLastReturnValue();
    }

setLastReturnValue

void setLastReturnValue ( Object lastReturnValue )

Description

Sets the LastReturnValue fro the object.

Parameters

  • lastReturnValue The new value for the LastReturnValue

Return Values

Returns void.

Change Log

Version Description
6.0.55a Available for all editions.

Examples

   // In practice AbstractEventData is just the abstract class.
   // You must actually use one of the classes that extend it.
    public Object handleEvent(EventFireTime fireTime, AbstractEventData data) {
               
        Object foo = // something here;
        data.setLastReturnValue(foo);
       
        // do something
       
        // The EventFireTime values describe in the documentation what the return
                // value will do, or says nothing about it if the value is ignored
                // If you don't intend to modify the return, always return data.getLastReturnValue();
        return data.getLastReturnValue();
    }

ExtractorPatternEventData

ExtractorPatternEventData extends AbstractEventData

This contains the data for various extractor pattern operations

Inherits the following methods from AbstractEventData

See Also

extractorPatternTimedOut

boolean extractorPatternEventData.extractorPatternTimedOut ( )

Description

Returns the status of the extractor pattern timeout. Returns true if and only if the extractor pattern was applied and timed out while doing so. Otherwise it will return false.

Parameters

This method does not receive any parameters.

Return Values

Returns a boolean value representing the status of the extractor pattern timeout.

Change Log

Version Description
6.0.55a Available for all editions.

Examples

Determine if an extractor pattern has timed out.

    public Object handleEvent(EventFireTime fireTime, ExtractorPatternEventData data) {
        if (data.extractorPatternTimeOut()) {
            // do something
        }
    }

getDataRecord

DataRecord extractorPatternEventData.getDataRecord ( )

Description

Returns the DataRecord value for the object.

Parameters

This method does not receive any parameters.

Return Values

Returns the DataRecord value for the object.

Change Log

Version Description
6.0.55a Available for all editions.

Examples

Get the current DataRecord.

    public Object handleEvent(EventFireTime fireTime, ExtractorPatternEventData data) {
        DataRecord dr = data.getDataRecord();
       
        // do something
    }

getDataSet

DataSet extractorPatternEventData.getDataSet ( )

Description

Returns the DataSet value for the object.

Parameters

This method does not receive any parameters.

Return Values

Returns the DataSet value for the object.

Change Log

Version Description
6.0.55a Available for all editions.

Examples

Get the current DataSet.

    public Object handleEvent(EventFireTime fireTime, ExtractorPatternEventData data) {
        DataSet ds = data.getDataSet();
       
        // do something
    }

getExtractorPattern

ExtractorPattern extractorPatternEventData.getExtractorPattern ( )

Description

Returns the ExtractorPattern value for the object.

Parameters

This method does not receive any parameters.

Return Values

Returns the ExtractorPattern value for the object.

Change Log

Version Description
6.0.55a Available for all editions.

Examples

Get the current ExtractorPattern.

    public Object handleEvent(EventFireTime fireTime, ExtractorPatternEventData data) {
        ExtractorPattern pattern = data.getExtractorPattern();
       
        // do something
    }

getScrapeableFile

ScrapeableFile extractorPatternEventData.getScrapeableFile ( )

Description

Returns the Scrapeablefile value for the object.

Parameters

This method does not receive any parameters.

Return Values

Returns the Scrapeablefile value for the object.

Change Log

Version Description
6.0.55a Available for all editions.

Examples

Get the current ScrapeableFile.

    public Object handleEvent(EventFireTime fireTime, ExtractorPatternEventData data) {
        ScrapeableFile sf = data.getScrapeableFile();
       
        // do something
    }

getSession

ScrapingSession extractorPatternEventData.getSession ( )

Description

Returns the Session value for the object.

Parameters

This method does not receive any parameters.

Return Values

Returns the Session value for the object.

Change Log

Version Description
6.0.55a Available for all editions.

Examples

Get the current Session.

    public Object handleEvent(EventFireTime fireTime, ExtractorPatternEventData data) {
        ScrapingSession _session = data.getSession();
       
        // do something
    }

ScrapeableFileEventData

ScrapeableFileEventData extends AbstractEventData

This contains the data for various scrapeable file operations

Inherits the following methods from AbstractEventData

See Also

getHttpResponseData

String scrapeableFileEventData.getHttpResponseData ( )

Description

Returns the HttpResponseData for the object.

Parameters

This method does not receive any parameters.

Return Values

Returns the HttpResponseData for the object.

Change Log

Version Description
6.0.55a Available for all editions.

Examples

Get the HttpResponseData

    public Object handleEvent(EventFireTime fireTime, ScrapeableFileEventData data) {
        String responseData = data.getHttpResponseData();
       
        // do something
    }

getRedirectRequestBuilder

ScrapingRequest.Builder scrapeableFileEventData.getRedirectRequestBuilder ( )

Description

Returns the RedirectRequestBuilder for the object. Use this to add headers, etc... for the redirect. It can be null depending on the HTTP client being used, and whether or not it supports manually playing with the redirect.

Parameters

This method does not receive any parameters.

Return Values

Returns the RedirectRequestBuilder for the object.

Change Log

Version Description
6.0.55a Available for all editions.

Examples

Get the Request Builder in order to modify it.

    public Object handleEvent(EventFireTime fireTime, ScrapeableFileEventData data) {
        ScrapingRequest.Builder builder = data.getRedirectRequestBuilder();
       
        // do something
    }

getScrapeableFile

ScrapeableFile scrapeableFileEventData.getScrapeableFile ( )

Description

Returns the Scrapeablefile value for the object.

Parameters

This method does not receive any parameters.

Return Values

Returns the Scrapeablefile value for the object.

Change Log

Version Description
6.0.55a Available for all editions.

Examples

Get the current ScrapeableFile.

    public Object handleEvent(EventFireTime fireTime, ScrapeableFileEventData data) {
        ScrapeableFile sf = data.getScrapeableFile();
       
        // do something
    }

getSession

ScrapingSession scrapeableFileEventData.getSession ( )

Description

Returns the Session value for the object.

Parameters

This method does not receive any parameters.

Return Values

Returns the Session value for the object.

Change Log

Version Description
6.0.55a Available for all editions.

Examples

Get the current Session.

    public Object handleEvent(EventFireTime fireTime, ScrapeableFileEventData data) {
        ScrapingSession _session = data.getSession();
       
        // do something
    }

ScriptEventData

ScriptEventData extends AbstractEventData

This contains the data for various script operations

Inherits the following methods from AbstractEventData

See Also

getDataRecord

DataRecord scriptEventData.getDataRecord ( )

Description

Returns the DataRecord value for the object.

Parameters

This method does not receive any parameters.

Return Values

Returns the DataRecord value for the object.

Change Log

Version Description
6.0.55a Available for all editions.

Examples

Get the current DataRecord.

    public Object handleEvent(EventFireTime fireTime, ScriptEventData data) {
        DataRecord dr = data.getDataRecord();
       
        // do something
    }

getDataSet

DataSet scriptEventData.getDataSet ( )

Description

Returns the DataSet value for the object.

Parameters

This method does not receive any parameters.

Return Values

Returns the DataSet value for the object.

Change Log

Version Description
6.0.55a Available for all editions.

Examples

Get the current DataSet.

    public Object handleEvent(EventFireTime fireTime, ScriptEventData data) {
        DataSet ds = data.getDataSet();
       
        // do something
    }

getScrapeableFile

ScrapeableFile scriptEventData.getScrapeableFile ( )

Description

Returns the Scrapeablefile value for the object.

Parameters

This method does not receive any parameters.

Return Values

Returns the Scrapeablefile value for the object.

Change Log

Version Description
6.0.55a Available for all editions.

Examples

Get the current ScrapeableFile.

    public Object handleEvent(EventFireTime fireTime, ScriptEventData data) {
        ScrapeableFile sf = data.getScrapeableFile();
       
        // do something
    }

getScriptException

java.lang.Exception scriptEventData.getScriptException ( )

Description

Returns the ScriptException for the object.

Parameters

This method does not receive any parameters.

Return Values

Returns the ScriptException for the object.

Change Log

Version Description
6.0.55a Available for all editions.

Examples

Get the script exception

    public Object handleEvent(EventFireTime fireTime, ScriptEventData data) {
        java.lang.Exception e = data.getScriptException();
       
        // do something
    }

getScriptName

String scriptEventData.getScriptName ( )

Description

Returns the ScriptName value for the object.

Parameters

This method does not receive any parameters.

Return Values

Returns the ScriptName value for the object.

Change Log

Version Description
6.0.55a Available for all editions.

Examples

Get the script name

    public Object handleEvent(EventFireTime fireTime, ScriptEventData data) {
         String name = data.getScriptName();
       
        // do something
    }

getSession

ScrapingSession scriptEventData.getSession ( )

Description

Returns the Session value for the object.

Parameters

This method does not receive any parameters.

Return Values

Returns the Session value for the object.

Change Log

Version Description
6.0.55a Available for all editions.

Examples

Get the current Session.

    public Object handleEvent(EventFireTime fireTime, ScriptEventData data) {
        ScrapingSession _session = data.getSession();
       
        // do something
    }

SessionEventData

SessionEventData extends AbstractEventData

This contains the data for various session operations

Inherits the following methods from AbstractEventData

See Also

getIncrementRecordsAmount

Object sessionEventData.getIncrementRecordsAmount ( )

Description

Returns the IncrementRecordsAmount value for the object.

Parameters

This method does not receive any parameters.

Return Values

Returns the IncrementRecordsAmount value for the object.

Change Log

Version Description
6.0.55a Available for all editions.

Examples

Get the current increment records amount.

    public Object handleEvent(EventFireTime fireTime, SessionEventData data) {
        Object recordsAmt = data.getIncrementRecordsAmount();
       
        // do something
    }

getSession

ScrapingSession sessionEventData.getSession ( )

Description

Returns the Session value for the object.

Parameters

This method does not receive any parameters.

Return Values

Returns the Session value for the object.

Change Log

Version Description
6.0.55a Available for all editions.

Examples

Get the current Session.

    public Object handleEvent(EventFireTime fireTime, SessionEventData data) {
        ScrapingSession _session = data.getSession();
       
        // do something
    }

getVariableName

String sessionEventData.getVariableName ( )

Description

Returns the VariableName value for the object.

Parameters

This method does not receive any parameters.

Return Values

Returns the VariableName value for the object.

Change Log

Version Description
6.0.55a Available for all editions.

Examples

Get the variable name.

    public Object handleEvent(EventFireTime fireTime, SessionEventData data) {
        String name = data.getVariableName();
       
        // do something
    }

getVariableValue

Object sessionEventData.getVariableValue ( )

Description

Returns the VariableValue value for the object.

Parameters

This method does not receive any parameters.

Return Values

Returns the VariableValue value for the object.

Change Log

Version Description
6.0.55a Available for all editions.

Examples

Get the current Session.

    public Object handleEvent(EventFireTime fireTime, SessionEventData data) {
        Object value = data.getVariableValue();
       
        // do something
    }

StringEventData

StringEventData extends AbstractEventData

This contains the data for various string operations

Inherits the following methods from AbstractEventData

See Also

getInput

String stringEventData.getInput ( )

Description

Returns the Input value for the object.

Parameters

This method does not receive any parameters.

Return Values

Returns the Input value for the object.

Change Log

Version Description
6.0.55a Available for all editions.

Examples

Write to Log

    public Object handleEvent(EventFireTime fireTime, StringEventData data) {
        String str = data.getInput();
       
        // do something
    }

addToVariable

void session.addToVariable ( String variable, int value ) (professional and enterprise editions only)

Description

Add to the value of a session variable.

Parameters

  • variable Key of the variable, as a string.
  • value Value to be added to the variable, as a integer.

Return Values

Returns void. If the variable doesn't exist, or is not a string or integer, a message will be added to the log. If it cannot add to the variable for any other reason it will write an error to the log.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

Examples

Increment Variable

 // Increments the session variable "PAGE_NUM" by one.
 session.addToVariable( "PAGE_NUM", 1 )

See Also

  • getVariable() [session] - Returns the value of a session variable
  • getv() [session] - Returns the value of a session variable (alias of getVariable)
  • setVariable() [session] - Sets the value of a session variable
  • setv() [session] - Sets the value of a session variable (alias of setVariable)

breakpoint

void session.breakpoint ( ) (professional and enterprise editions only)

Description

Pause scrape and display breakpoint window. If the scrape is running in server mode, to avoid the break, logVariables will be called in place of breakpoint.

Parameters

This method does not receive any parameters.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

Examples

Open BreakPoint Window

 // Causes the breakpoint window to be displayed.
 session.breakpoint();

clearAllSessionVariables

void session.clearAllSessionVariables ( )

Description

Remove all session variables.

Parameters

This method does not receive any parameters.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

Examples

Clear Session Variables

 // Clear all session variables.
 session.clearAllSessionVariables();

See Also

  • setVariable() [session] - Sets the value of a session variable
  • setv() [session] - Sets the value of a session variable (alias of setVariable)

clearCookies

void session.clearCookies ( ) (enterprise edition only)

Description

Clear stored cookies.

Parameters

This method does not receive any parameters.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for enterprise edition.

Examples

Clear Cookies

 // Clear all current cookies,
 session.clearCookies();

See Also

  • getCookies() [session] - Gets all the cookies currently stored by this scraping session
  • setCookie() [session] - Sets the value of a cookie

clearVariables

void session.clearVariables ( Map variables ) (professional and enterprise editions only)
void session.clearVariables ( Collection variables ) (professional and enterprise editions only)

Description

Clears the value of all session variables that match the keys in the Map. This will ignore a key of DATARECORD.

This method is provided using a Map or Collection rather than a List or Set to work easier with the setSessionVariables method.

Parameters

  • Map The map to use when clearing the session variables.
  • Collection The collection to use when clearing the session variables.

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in all editions.
5.5.43a Changed from session.removeSessionVariablesInMap to session.clearVariables.

Examples

Clear the ASPX values for a .NET site after scraping the next page

 DataRecord aspx = scrapeableFile.getASPXValues();
 
 session.setSessionVariables(aspx);
 session.scrapeFile("Next Results");
 session.clearVariables(aspx);

convertHTMLEntitiesInVariable

void session.convertHTMLEntitiesInVariable ( String variable )

Description

Decode HTML Entities on a session variable.

Parameters

  • variable Session variable whose HTML Entities will be converted to characters.

Return Values

Returns void.

Change Log

Version Description
5.0 Added for all editions.

Examples

Decode HTML Entities In Variable

// Set variable
session.setv( "LOCATION", "Angela&apos;s Room" );

// Convert HTML entities
session.convertHTMLEntitiesInVariable( "LOCATION" );

// Write to Log
session.log( session.getv( "LOCATION" ) ); //logs Angela's Room

See Also

downloadFile

boolean session.downloadFile ( String url, String fileName ) (professional and enterprise editions only)
boolean session.downloadFile ( String url, String fileName, int maxNumAttempts ) (professional and enterprise editions only)
boolean session.downloadFile ( String url, String fileName, int maxNumAttempts, boolean doLazy ) (enterprise edition only)

Description

Downloads the file to the local file system.

Parameters

  • url URL reference to the desired file, as a string.
  • fileName Local file path when the file should be saved, as a string.
  • maxNumAttempts (optional) Number of times the file will be requested without success, as an integer. Defaults to 3.
  • doLazy (optional) Whether the file should be downloaded in a separate thread, as a boolean. Defaults to false.

Return Values

Returns true on successful download of the file otherwise it return false.

Change Log

Version Description
4.5 Available for professional and enterprise editions. Lazy scrape only available for enterprise edition.

If the file to download requires that POST data is sent in order to get the file you would use saveFileOnRequest with a scrapeable file.

Using this method in a script takes the place of requesting the target URL as a scrapeable file.

Examples

Download File in a Separate Thread

 // Downloads the image pointed to by the URL to the local C: drive.
 // A maximum number of 5 attempts will be made to download the file,
 // and the file will be downloaded in its own thread.

 session.downloadFile( "http://www.foo.com/imgs/puppy_image.gif", "C:/images/puppy.gif", 5, true );

executeScript

void session.executeScript ( String scriptName ) (professional and enterprise editions only)

Description

Manual start the execution of a script.

Parameters

  • scriptName Name of the script to execute, as a string. The script has to be on the same instance of screen-scraper as the scraping session.

Return Values

Returns void. If the file doesn't exist a message will be written to the log. If the called script has an error in it a warning will be written to the log.

Change Log

Version Description
5.0 Scripts called using this method are now exported with the scraping session.
4.5 Available for professional and enterprise editions.

Examples

Execute Script

 // Executes the script "My Script".
 session.executeScript( "My Script" );

executeScriptWithContext

void session.executeScriptWithContext ( String scriptName ) (professional and enterprise editions only)

Description

Executes the named script, but preserves the current context (dataRecord, scrapeableFile, etc...)

Parameters

  • scriptName The name of the script to execute.

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in professional and enterprise editions.

Examples

Execute a script, but preserve the context

 // Execute the 'Do more stuff' script, but give it access to the scrapeableFile this script has access to.
 session.executeScriptWithContext("Do more stuff");

getCharacterSet

String session.getCharacterSet ( )

Description

Get the general character set being used in page response renderings.

Parameters

This method does not receive any parameters.

Return Values

Returns the character set applied to the scraping session's files, as a string. If a character set has not been specified then it will default to the character set specified in settings dialog box.

Change Log

Version Description
4.5 Available for all editions.

If you are having trouble with characters displaying incorrectly, we encourage you to read about how to go about finding a solution using one of our FAQs.

Examples

Get Character Set

 // Get the character set of the dataSet
 charSetValue = session.getCharacterSet();

See Also

  • setCharacterSet() [session] - Set the character set used to render all responses.
  • getCharacterSet() [scrapeableFile] - Get the character set used to responses to a specific scrapeable file.
  • setCharacterSet() [scrapeableFile] - Set the character set used to responses to a specific scrapeable file.

getConnectionTimeout

int session.getConnectionTimeout ( )

Description

Retrieve the timeout value for scrapeable files in the session.

Parameters

This method does not receive any parameters.

Return Values

Returns the timeout value in milliseconds, as an integer.

Change Log

Version Description
5.0.1a Introduced for all editions.

Examples

Retrieve Connection Timeout

 // set variable to connection timeout
 timeout = session.getConnectionTimeout( );

See Also

getCookies

Cookie[] session.getCookies ( )

Description

Get the current cookies.

Parameters

This method does not receive any parameters.

Return Values

Returns an array of the cookies in the session.

Change Log

Version Description
5.0 Available for all editions.

Examples

Add Cookie If Missing

// Get cookies
cookies = session.getCookies();

// Cookie Information
cookieDomain = "mydomain.com";
cookieName = "cookie_test";
cookieValue = "please_accept_for_session";

// Exists Flag
cookieExists = false;

// Loop through cookies
for (i = 0; i < cookies.length; i++) {
    cookie = cookies[i];

    // Check if this is the cookie
    if (cookie.getName().equals(cookieName) && cookie.getValue().equals(cookieValue)&&cookie.getDomain().equals(cookieDomain)) {
        //if the cookie matches then it exists
        cookieExists = true;
        // Log search status
        session.log( "+++Cookie Exists" );
        // Stop searching
        break;
    }
}

// Add cookie, if it doesn't exist
if ( !cookieExists ) {
    session.log( "+++Cookie Does NOT Exists: Setting Cookie" );
    session.setCookie( cookieDomain, cookieName, cookieValue);
}

Write Cookies to Log

// Get cookies
cookies = session.getCookies();

// Loop through Cookies
for (i = 0; i < cookies.length; i++) {
    cookie = cookies[i];

    // Write Cookie information to the Log
    session.log( "COOKIE #" + i );
    session.log( "Name: " + cookie.getName() );
    session.log( "Value: " + cookie.getValue() );
    session.log( "Path: " + cookie.getPath() );
    session.log( "Domain: " + cookie.getDomain() );
    // Only log expiration if it is set
    if (cookie.getExpiryDate() != null) {
        session.log( "Expiration: " + cookie.getExpiryDate().toString() );
    }
}

See Also

  • clearCookies() [session] - Clears all the cookies from this scraping session
  • setCookie() [session] - Sets the value of a cookie

getDebugMode

boolean session.getDebugMode ( )

Description

Checks to see if this is currently set to run in debug mode. This is useful for developing scrapes, as enabling debug mode logs a warning message, so it is easier to notice a scrape with hard-coded values used for development. Also logs a warning in the web interface or log each time monitored variables are logged with the logMonitoredValues or webMessage methods are called.

Parameters

This method takes no parameters.

Return Value

True if debug mode is enabled, false otherwise.

Change Log

Version Description
5.5.29a Available in all editions.

Examples

Set some hardcoded values to use when the scrape is being developed

 // Comment out the line below for production
 session.setDebugMode(true);
 
 if(session.getDebugMode())
 {
   session.setVariable("SEARCH_TERM", "DVDs");
   session.setVariable("USERNAME", "some user");
   session.setVariable("PASSWORD", "the password");
 }

getDefaultRetryPolicy

RetryPolicy session.getDefaultRetryPolicy ( ) (professional and enterprise editions only)

Description

Gets the default retry policy to be used by each scrapeable file when one wasn't set for it.

Parameters

This method takes no parameters

Return Value

The default return policy, or null if there isn't one

Change Log

Version Description
5.5.29a Available in professional and enterprise editions.

Examples

Check for a default RetryPolicy

 if(session.getDefaultRetryPolicy() == null)
 {
   session.logWarn("No default retry policy specified");
 }

getElapsedRunningTime

long session.getElapsedRunningTime ( ) (professional and enterprise editions only)

Description

Get how long the current session has been running.

Parameters

This method does not receive any parameters.

Return Values

Returns number of milliseconds the scrape has been running, as a long (8-byte integer).

Change Log

Version Description
4.5 Available for professional and enterprise editions.

If you would like to log the running time of the scraping session you should use logElapsedRunningTime.

Examples

Generic Scrape Timeout

 // On pagination iterator

 // Setup length to run
 timeout = 1000*60*60*24; // 1 day

 // Check how long scrape has been running
 if (session.getElapsedRunningTime() >= timeout )
 {
     session.stopScraping();
 }

See Also

getLoggingLevel

int session.getLoggingLevel ( )

Description

Get the logging level of the scrape.

Parameters

This method does not receive any parameters.

Return Values

Returns the logging level, as an integer. Currently there are four levels: 1 = Debug, 2 = Info, 3 = Warn, 4 = Error.

Change Log

Version Description
5.0.1a Introduced for all editions.

Examples

Set Logging Level If Low

// get logging level
logLevel = session.getLoggingLevel();

if (logLevel < Notifiable.LEVEL_WARN )
{
    session.setLoggingLevel( Notifiable.LEVEL_WARN );
}

See Also

getMaxConcurrentFileDownloads

int session.getMaxConcurrentFileDownloads ( ) (professional and enterprise editions only)

Description

Retrieve the maximum number of concurrent file downloads being allowed.

Parameters

This methods does not receive any parameters.

Return Values

Returns the max number of concurrent file downloads allowed, as an integer.

Change Log

Version Description
5.0 Added for professional and enterprise editions.

Examples

Check Max Concurrent File Downloads

 // How many concurrent downloads are permitted
 maxConcurrentDownloads = session.getMaxConcurrentFileDownloads();

See Also

getMaxHTTPRequests

int session.getMaxHTTPRequests ( ) (professional and enterprise editions only)

Description

Retrieve the number of attempts that scrapeable files should make to get the requested page.

Parameters

This method does not receive any parameters.

Return Values

Returns the number of attempts that will be made, as a integer.

Change Log

Version Description
5.0 Available for all editions.

Examples

Retrieve the Retry Value

// Write retries to log
session.log( "Retries per file: " + session.getMaxHTTPRequests() );

See Also

  • setMaxHTTPRequests() [session] - Sets the number of attempts a scrapeable file will make to get the requested page

getMaxScriptsOnStack

int session.getMaxScriptsOnStack ( )

Description

Get the total number of scripts allowed on the stack before the scraping session is forcibly stopped.

Parameters

This method does not receive any parameters.

Return Values

Returns max number of scripts that can be running at a time, as an integer.

Change Log

Version Description
5.0 Added for all editions.

Examples

Check If More Scripts Can Be Run

 import java.math.*;

 // Get Number of Scripts (running and max)
 BigDecimal numRunningScripts = new BigDecimal(session.getNumScriptsOnStack());
 BigDecimal maxAllowedScripts = new BigDecimal(session.getMaxScriptsOnStack());

 // Calculate percentage used
 BigDecimal percentageUsedBD = numRunningScripts.divide(maxAllowedScripts, 2, BigDecimal.ROUND_HALF_UP);

 double percentageUsed = percentageUsedBD.doubleValue();

 if (percentageUsed < 90)
 {
     session.log(percentageUsed.toString() + "% of max scripts used");
 }
 else
 {
     session.logWarn("90% max scripts threshold has been reached.");
 }

See Also

getName

String session.getName ( )

Description

Get the name of the current scraping session.

Parameters

This method does not receive any parameters.

Return Values

Returns the name of the scraping session, as a string.

Change Log

Version Description
4.5 Available for all editions.

Examples

Write Scraping Session Name to Log

 // Outputs the name of the scraping session to the log.
 session.log( "Current scraping session: " + session.getName() );

getNumScriptsOnStack

int session.getNumScriptsOnStack ( )

Description

Get the number of scripts currently running.

Parameters

This method does not receive any parameters.

Return Values

Returns number of running scripts, as an integer.

Change Log

Version Description
5.0 Added for all editions.

Examples

Check If More Scripts Can Be Run

 import java.math.*;

 // Get Number of Scripts (running and max)
 BigDecimal numRunningScripts = new BigDecimal(session.getNumScriptsOnStack());
 BigDecimal maxAllowedScripts = new BigDecimal(session.getMaxScriptsOnStack());

 // Calculate percentage used
 BigDecimal percentageUsedBD = numRunningScripts.divide(maxAllowedScripts, 2, BigDecimal.ROUND_HALF_UP);

 double percentageUsed = percentageUsedBD.doubleValue();

 if (percentageUsed < 90)
 {
     session.log(percentageUsed.toString() + "% of max scripts used");
 }
 else
 {
     session.logWarn("90% max scripts threshold has been reached.");
 }

See Also

getRetainNonTidiedHTML

boolean session.getRetainNonTidiedHTML ( ) (enterprise edition only)

Description

Determine whether or not non-tidied HTML is to be retained for all scrapeable files in this scraping session.

Parameters

This method does not receive any parameters.

Return Values

Returns whether non-tidied HTML is be retained for all scrapeable files or not, as a boolean.

Change Log

Version Description
4.5 Available for enterprise edition.

Examples

Determine if Non-tidied HTML is Being Retained

 // Outputs the non-tidied HTML from the scrapeable file
 // to the log if it was retained otherwise just a message.

 if (session.getRetainNonTidiedHTML())
 {
     session.log( "All scrapeable files will retain non-tidied HTML" );
 }
 else
 {
     session.log( "Non-tidied HTML will not be not retained." );
 }

See Also

getScrapeableSessionID

int session.getScrapeableSessionID ( ) (enterprise edition only)

Description

Get the unique identifier for the scraping session.

Parameters

This method does not receive any parameters.

Return Values

Returns unique session id for the scraping session, as an integer.

Change Log

Version Description
5.0 Added for enterprise edition.

Examples

Retrieve Unique ID

 // Get Unique ID
 int i = session.getScrapeableSessionID();

getStartTime

long session.getStartTime ( )

Description

Retrieve the time at which the scrape started.

Parameters

This method does not receive any parameters.

Return Values

Returns the start time of the scrape in milliseconds, as a long.

Change Log

Version Description
4.5 Available for all editions.

Examples

Get Session Start Time

// Retrieves the start time and places it
// in the variable "start".

start = session.getStartTime();

getTimeZone

TimeZone session.getTimeZone ( )

Description

Gets the current time zone of the Scraping Session

Parameters

This method takes no parameters.

Return Value

The time zone this scrape is set to.

Change Log

Version Description
5.5.29a Available in all editions.

Examples

Get the current Time Zone in use

 TimeZone currentTimeZone = session.getTimeZone();

getVariable

Object session.getVariable ( String identifier )

Description

Retrieve the value of a saved session variable.

Parameters

  • identifier The name of the variable whose value is to be retrieved, as a string.

Return Values

Returns the value of the session variable. This will be a string unless you have used setVariable to place something other than a string into a session variable.

Change Log

Version Description
4.5 Available for all editions.

Examples

Retrieve Session Variable

 // Places the session variable "CITY_CODE" in the local
 // variable "cityCode".

 cityCode = session.getVariable( "CITY_CODE" );

See Also

  • addToVariable() [session] - Adds an integer to the value of a session variable.
  • getv() [session] - Retrieve the value of a saved session variable (alias of getVariable).
  • setv() [session] - Set the value of a session variable (alias of setVariable).
  • setVariable() [session] - Set the value of a session variable.

getv

Object session.getv ( String identifier )

Description

Retrieve the value of a saved session variable (alias of getVariable).

Parameters

  • identifier The name of the variable whose value is to be retrieved, as a string.

Return Values

Returns the value of the session variable. This will be a string unless you have used setVariable to place something other than a string into a session variable.

Change Log

Version Description
4.5 Added for all editions.

Examples

Retrieve Session Variable

 // Places the session variable "CITY_CODE" in the local
 // variable "cityCode".

 cityCode = session.getv( "CITY_CODE" );

See Also

  • addToVariable() [session] - Adds an integer to the value of a session variable.
  • getVariable() [session] - Retrieve the value of a saved session variable.
  • setv() [session] - Set the value of a session variable (alias of setVariable).
  • setVariable() [session] - Set the value of a session variable.

isRunningFromCommandLine

boolean session.isRunningFromCommandLine ( )

Description

Returns whether or not we are currently running in the command line. This is a convenience method for doing something different in a script when running in the command line as opposed to other modes

Parameters

This method does not receive any parameters.

Return Values

Returns true if and only if the scrape is currently running in the command line.

Change Log

Version Description
6.0.37a Introduced for all editions.

Examples

Retrieve Connection Timeout

 if (session.isRunningFromCommandLine()) {
    // do something only done in the command line
 }

isRunningInServer

boolean session.isRunningInServer ( )

Description

Returns whether or not we are currently running in the server. This is a convenience method for doing something different in a script when running in the server as opposed to other modes

Parameters

This method does not receive any parameters.

Return Values

Returns true if and only if the scrape is currently running in the server.

Change Log

Version Description
6.0.37a Introduced for all editions.

Examples

Retrieve Connection Timeout

 if (session.isRunningInServer()) {
    // do something only done in the server
 }

isRunningInWorkbench

boolean session.isRunningInWorkbench ( )

Description

Returns whether or not we are currently running in the workbench. This is a convenience method for doing something different in a script when running in the workbench as opposed to other modes

Parameters

This method does not receive any parameters.

Return Values

Returns true if and only if the scrape is currently running in the workbench.

Change Log

Version Description
6.0.37a Introduced for all editions.

Examples

Retrieve Connection Timeout

 if (session.isRunningInWorkbench()) {
    // do something only done in workbench
 }

loadStateFromString

boolean session.loadStateFromString ( String stateXML ) (professional and enterprise editions only)

Description

Loads the state that would have been previously saved by invoking the session.saveStateToString method.

Parameters

  • stateXML A string representing session state.

Return Values

None

Change Log

Version Description
5.5.30a Available in Professional and Enterprise editions.

Examples

Load state in from a file

import org.apache.commons.io.FileUtils;

File f = new File( "session_state.xml" );
sessionState = FileUtils.readFileToString( f, session.getCharacterSet() );

session.loadStateFromString( sessionState );

loadVariables

void session.loadVariables ( String fileToReadFrom ) (enterprise edition only)

Description

Load session variables from a file.

Parameters

  • fileToReadFrom File path of the file that contains the session variables, as a string.

Return Values

Returns void. If there is a problem retrieving the file contents an I/O error will be written to the log.

Change Log

Version Description
4.5 Available for enterprise edition.

See also: saveVariables.

If you want to create your own file of session variables, the format is a hard return-delimited list of name/value pairs. Both the key and value should be URL-encoded.

Examples

Load Session Variables from File

 // Reads in variables from the file located at "C:\myvars.txt".
 // Note that a forward slash is used instead of a back slash
 // as a folder delimiter. If back slashes were used, they
 // would need to be doubled so that they're properly escaped
 // out for the script interpreter.

 session.loadVariables( "C:/myvars.txt" );

Sample Variables File

BIRTHDAY=12%2F25
NAME=Santa
AGE=Unknown

See Also

saveStateToString

boolean session.saveStateToString ( boolean saveCookies, boolean saveVariables ) (professional and enterprise editions only)

Description

Saves the current state of the scraping session to a string. An example use case for this method would be a scraping session that logs in to a site, extracts some information, and then is stopped, saving its state out to a file. A second scraping session could then be run, loading the state back in from the file, which would keep the session logged in so that other information could be obtained without logging in once again. By default the scraping session will save out information such as the URL to use as a referer. More information can be saved using the boolean flags described below.

Parameters

  • saveCookies Whether or not cookies should be saved.
  • saveVariables Whether or not session variables should be saved.

Return Values

None

Change Log

Version Description
5.5.30a Available in Professional and Enterprise editions.

Examples

Save out state to a file

// Put the current state in a local variable.
sessionState = session.saveStateToString( true, true );

// Write the state out to a file.
sutil.writeValueToFile( sessionState, "session_state.xml", session.getCharacterSet() );

saveVariables

void session.saveVariables ( String fileToSaveTo ) (enterprise edition only)

Description

Saves all current string and integer variables to a file.

Parameters

  • fileToSaveTo File path where the file should be saved, as a string.

Return Values

Returns void. If there is a problem retrieving the file contents an I/O error will be written to the log.

Change Log

Version Description
4.5 Available for enterprise edition.

Examples

Save Session Variables to File System

 // Saves the current session variables out to C:\myvars.txt.
 // Note that a forward slash is used instead of a back slash
 // as a folder delimiter. If back slashes were used, they
 // would need to be doubled so that they're properly escaped
 // out for the script interpreter.

 session.saveVariables( "C:/myvars.txt" );

See Also

scrapeFile

void session.scrapeFile ( String scrapeableFileIdentifier )

Description

Manually scrape a scrapeable file.

Parameters

  • scrapeableFileIdentifier Name of the scrapeable file, as a string.

Return Values

Returns void. If there is a problem accessing the scrapeable file an message will be written to the log.

Change Log

Version Description
4.5 Available for all editions.

Examples

Scrape File Manually

 // Causes the scrapeable file "Login" to be requested.
 session.scrapeFile( "Login" );

scrapeString

boolean session.scrapeString ( String scrapeableFileName, String content ) (professional and enterprise editions only)

Description

Invokes a scrapeable file using a string of content instead of a web page or local file.

Parameters

  • scrapeableFileName The scrapeable file to be invoked.
  • content The content to load.

Return Values

None

Change Log

Version Description
5.5.13a Available in all editions.

Examples

Invoke a scrapeable file using a string

content = session.getv( "PARTIAL_PAGE_CONTENT" );
session.scrapeString( "My Scrapeable File", content );

sendDataToClient

void session.sendDataToClient ( String key, Object value ) (enterprise edition only)

Description

Send data to the external script that initiated the scrape. This isn't currently supported with all drivers (e.g., remote scraping session), check the documentation on the language of the external script for more information.

Parameters

  • key Name of the information being sent, as a string.
  • value Data to be processed by external script, supported types are Strings, Integers, DataRecords, and DataSets.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for enterprise edition.

Examples

Send dataRecord to Client

 // Causes the current DataRecord object to be sent to the client
 // for processing.

 session.sendDataToClient( "MyDataRecord", dataRecord );

setCharacterSet

void session.setCharacterSet ( String characterSet )

Description

Set the general character set used in page response renderings. This can be particularly helpful when the pages render characters incorrectly.

Parameters

  • characterSet Java recognized character set, as a string. Java provides a list of supported character sets in its documentation.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

This method must be invoked before the session starts.

If you are having trouble with characters displaying incorrectly, we encourage you to ready about how to go about finding a solution using one of our FAQs.

Examples

Set Character Set of All Scrapeable Files

 // In script called "Before scraping session begins"

 // Sets the character set to be applied to the last responses
 // of all scrapeable files in session.

 session.setCharacterSet( "ISO-8859-1" );

See Also

  • getCharacterSet() [session] - Gets the character set used to render all responses.
  • getCharacterSet() [scrapeableFile] - Get the character set used to responses to a specific scrapeable file.
  • setCharacterSet() [scrapeableFile] - Set the character set used to responses to a specific scrapeable file.

setConnectionTimeout

void session.setConnectionTimeout ( int timeout )

Description

Set the timeout value for scrapeable files in the session.

Parameters

  • timeout The length of the timeout in seconds, as an integer.

Return Values

Returns void.

Change Log

Version Description
5.0.1a Introduced for all editions.

Examples

Set Connection Timeout

 // set connection timeout to 15 seconds
 session.setConnectionTimeout( 15 );

See Also

setCookie

void session.setCookie ( String domain, String key, String value ) (professional and enterprise editions only)

Description

Manually set a cookie in the current session state.

Parameters

  • domain The domain to which the cookie pertains, as a string.
  • key The name of the cookie, as a string.
  • value The value of the cookie, as a string.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for professional and enterprise editions.

This method should be rarely used as screen-scraper automatically manages cookies. In cases where cookies are set via JavaScript, this function might be necessary.

Examples

Manually Set Cookie

 // Sets a cookie associated with "mydomain.com", using the
 // key "user" and the value "John Smith".

 session.setCookie( "mydomain.com", "user", "John Smith" );

See Also

  • clearCookies() [session] - Clear all cookies from this scraping session
  • getCookies() [session] - Gets all the cookies currently stored by this scraping session

setDebugMode

void session.setDebugMode ( boolean debugMode )

Description

Sets the debug state for the scrape. Enabled debug mode simply outputs a warning periodically while running, to help prevent running a production scrape in debug mode.

Parameters

  • debugMode True to enable debug mode, false to disable it.

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in all editions.

Examples

Set some hardcoded values to use when the scrape is being developed

 // Comment out the line below for production
 session.setDebugMode(true);

 if(session.getDebugMode())
 {
   session.setVariable("SEARCH_TERM", "DVDs");
   session.setVariable("USERNAME", "some user");
   session.setVariable("PASSWORD", "the password");
 }

setDefaultRetryPolicy

void session.setDefaultRetryPolicy ( RetryPolicy retryPolicy ) (professional and enterprise editions only)

Description

Sets a retry policy that will affect all files in the scrape. This policy will be used by all scrapeable files that do not have a retry policy set for them. If a retry policy was manually set for them, this one will not be used.

Parameters

  • retryPolicy The retry policy to use by default, if no other retry policy is set.

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in professional and enterprise editions.

Examples

Create a defaul RetryPolicy

 import com.screenscraper.util.retry.RetryPolicyFactory;

 // Use a retry policy that will rotate the proxy if there was an error on request
 session.setDefaultRetryPolicy(RetryPolicyFactory.getBasicPolicy(5, "Get new proxy"));

setKeyStoreFilePath

void session.setKeyStoreFilePath ( String filePath ) (professional and enterprise editions only)

Description

Sets the path to the keystore file. Some web sites require a special type of authentication that requires the use of a keystore file. See our blog entry on Using Client Certificates for more detail. Calling this method is the equivalent of setting the corresponding value under the "Advanced" tab for the scraping session in the workbench.

Parameters

  • filePath The path to the keystore file.

Return Values

None

Change Log

Version Description
5.5.10a Available in all editions.

Examples

Set the path to the keystore file

// Set the path.
session.setKeyStoreFilePath( "~/key_files/my_key.crt" );

// Output the current path.
session.log( "Keystore file path is: " + session.getKeyStoreFilePath() );

setKeyStorePassword

void session.setKeyStorePassword ( String password ) (professional and enterprise editions only)

Description

Sets the password for the keystore file. Some web sites require a special type of authentication that requires the use of a keystore file. See our blog entry on Using Client Certificates for more detail. Calling this method is the equivalent of setting the corresponding value under the "Advanced" tab for the scraping session in the workbench.

Parameters

  • filePath The password for the keystore file.

Return Values

None

Change Log

Version Description
5.5.10a Available in all editions.

Examples

Set the path to the keystore file

// Set the password.
session.setKeyStorePassword( "My_password" );

// Output the current password.
session.log( "Keystore password is: " + session.getKeyStorePassword() );

setLoggingLevel

void session.setLoggingLevel ( int loggingLevel )

Description

Set the logging level of the scrape.

Parameters

  • loggingLevel Level of logging that should be used, as an integer. It works best if you use the Notifiable interface in case levels are ever changed.

Return Values

Returns void.

Change Log

Version Description
5.0.1a Introduced for all editions.

Examples

Set Logging Level

// get logging level
logLevel = session.getLoggingLevel();

if (logLevel < Notifiable.LEVEL_WARN )
{
    session.setLoggingLevel( Notifiable.LEVEL_WARN );
}

See Also

setMaxConcurrentFileDownloads

void session.setMaxConcurrentFileDownloads ( int maxConcurrentFileDownloads ) (professional and enterprise editions only)

Description

Set the maximum number of concurrent file downloads to a allow.

Parameters

  • maxConcurrentFileDownloads The maximum number of downloads to allow, as an integer.

Return Values

Returns void.

Change Log

Version Description
5.0 Added for professional and enterprise editions.

Examples

Set Max for Concurrent File Downloads

 // Limit the number of concurrent file downloads to 10
 session.setMaxConcurrentFileDownloads( 10 );

See Also

setMaxHTTPRequests

void session.setMaxHTTPRequests ( int maxAttempts ) (professional and enterprise editions only)

Description

Set the number of attempts that scrapeable files should make to get the requested page.

Parameters

  • maxAttempts The number of attempts that will be made, as a integer.

Return Values

Returns void.

Change Log

Version Description
5.0 Available for all editions.

Examples

Set the Retry Value

// Set retries for files
session.setMaxHTTPRequests( 3 );

See Also

  • getMaxHTTPRequests() [session] - Returns the maximum number of attempts a scrapeable file will make to retrieve the file

setMaxScriptsOnStack

void session.setMaxScriptsOnStack ( int maxScriptsOnStack ) (enterprise edition only)

Description

Get the total number of scripts that can be running concurrently. Default value for maxScriptsOnStack is 50.

Parameters

  • maxScriptsOnStack Number of scripts to be allowed to run concurrently, as an integer.

Return Values

Returns void.

Change Log

Version Description
5.0 Added for enterprise edition.

Before you start upping the value of the number of scripts that can be on the stack you should make sure that your scrape is not eating more then it should. One thing to consider is recursion instead of iterating. This is discussed in more details on our blog or in the Tips, Tricks, and Samples section of this site.

Examples

Allocate More Resources to Scrape

 // Allow for 100 scripts (instead of 50)
 session.setMaxScriptsOnStack(100);

See Also

setRandomizeUserAgent

void session.setRandomizeUserAgent ( boolean randomizeUserAgent ) (professional and enterprise editions only)

Description

Causes the "User-Agent" header sent by screen-scraper to be randomized. The user agent strings from which screen-scraper will select are found in the "resource\conf\user_agents.txt" file.

Parameters

  • randomizeUserAgent true or false

Return Values

None

Change Log

Version Description
5.5.34a Available in Professional and Enterprise editions.

Examples

Randomize the user-agent header

session.setRandomizeUserAgent( true );

// You can also access the current value like so:
session.log( "Randomize user agent: " + session.getRandomizeUserAgent() );

setRetainNonTidiedHTML

void session.setRetainNonTidiedHTML ( boolean retainNonTidiedHTML ) (enterprise edition only)

Description

Set whether or not non-tidied HTML is to be retained for all scrapeable files.

Parameters

  • retainNonTidiedHTML Whether the non-tidied HTML should be retained, as a boolean. The default is false.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for enterprise edition.

If, after the file is scraped, you want to be able to use getNonTidiedHTML this method has to be called before a file is scraped.

Examples

Retain Non-tidied HTML

 // Tell screen-scraper to retain tidied HTML for the all
 // scrapeable files.

 session.setRetainNonTidiedHTML( true );

See Also

setSessionVariables

void session.setSessionVariables ( Map variables) (professional and enterprise editions only)(professional and enterprise editions only)
void session.setSessionVariables ( Map variables, boolean ignoreLowerCaseKeys)(professional and enterprise editions only)

Description

Sets the value of all session variables that match the keys in the Map to the values in the Map. This will ignore a key of DATARECORD.

Parameters

  • Map The map to use when setting the session variables.
  • ignoreLowerCase True if keys with lowercase characters should be ignored. This would include A_KEy

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in all editions.
5.5.43a Changed from session.setSessionVariablesFromMap to session.setSessionVariables.

Examples

Set the ASPX values for a .NET site before scraping the next page

 DataRecord aspx = scrapeableFile.getASPXValues();
 
 session.setSessionVariables(aspx);
 session.scrapeFile("Next Results");

setStatusMessage

void session.setStatusMessage ( String message ) (enterprise edition only)

Description

Sets a status message to be displayed in the web interface.

Parameters

  • message The message to be set.

Return Values

None

Change Log

Version Description
5.5.32a Available in Enterprise edition.

Examples

Append a status message

if( scrapeableFile.getMaxRequestAttemptsReached() )
{
        session.setStatusMessage( "Maximum requests reached for scrapeable file: " + scrapeableFile.getName() );
       
        // Output the current status message.
        session.log( "Current status message: " + session.getStatusMessage() );
}

setStopScrapingOnExtractorPatternTimeout

void session.setStopScrapingOnExtractorPatternTimeout ( boolean stopScrapingOnExtractorPatternTimeout ) (professional and enterprise editions only)

Description

If this method is passed the value of true, it will cause screen-scraper to stop the current scraping session if an extractor pattern timeout occurs.

Parameters

  • stopScrapingOnExtractorPatternTimeout true or false

Return Values

None

Change Log

Version Description
5.5.36a Available in Professional and Enterprise editions.

Examples

Indicate that the scraping session should be stopped when an extractor pattern timeout occurs

session.setStopScrapingOnExtractorPatternTimeout( true );

// You can also access the current value like so:
session.log( "Stop scraping on extractor pattern timeout: " + session.getStopScrapingOnExtractorPatternTimeout() );

setStopScrapingOnMaxRequestAttemptsReached

void session.setStopScrapingOnMaxRequestAttemptsReached ( boolean stopScrapingOnMaxRequestAttemptsReached ) (professional and enterprise editions only)

Description

If this method is passed the value of true, it will cause screen-scraper to stop the current scraping session if the maximum attempts to request a file is reached.

Parameters

  • stopScrapingOnMaxRequestAttemptsReached true or false

Return Values

None

Change Log

Version Description
5.5.36a Available in Professional and Enterprise editions.

Examples

Indicate that the scraping session should be stopped if the maximum attempts to request a file is reached

session.setStopScrapingOnMaxRequestAttemptsReached( true );

// You can also access the current value like so:
session.log( "Stop scraping on max attempts reached: " + session.getStopScrapingOnMaxRequestAttemptsReached() );

setStopScrapingOnScriptError

void session.setStopScrapingOnScriptError ( boolean stopScrapingOnScriptError ) (professional and enterprise editions only)

Description

If this method is passed the value of true, it will cause screen-scraper to stop the current scraping session if a script error occurs.

Parameters

  • stopScrapingOnScriptError true or false

Return Values

None

Change Log

Version Description
5.5.36a Available in Professional and Enterprise editions.

Examples

Indicate that the scraping session should be stopped if a script error occurs

session.setStopScrapingOnScriptError( true );

// You can also access the current value like so:
session.log( "Stop scraping on script error: " + session.getStopScrapingOnScriptError() );

setTimeZone

void session.setTimeZone ( String timeZone )
void session.setTimeZone ( TimeZone timeZone )

Description

Sets the time zone that will be used when using a method that returns a time formatted as a string.

Parameters

  • timeZone The new timezone to use. If null is given, the local timezone will be used.

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in all editions.

Examples

Set the time zone

 session.setTimeZone("America/Denver");

setUseServerCharacterSet

void session.setUseServerCharacterSet ( boolean useServerCharacterSet ) (professional and enterprise editions only)

Description

If this method is passed the value of true, it will cause screen-scraper to utilize whatever character set is specified by the server in its "Content-Type" response header. If no such header exists, screen-scraper will default to either the character set indicated for the scraping session or the global character set (in that order).

Parameters

  • useServerCharacterSet true or false

Return Values

None

Change Log

Version Description
5.5.11a Available in all editions.

Examples

Indicate that the server character set should be used

session.setUseServerCharacterSet( true );

// You can also access the current value like so:
session.log( "Use server character set: " + session.getUseServerCharacterSet() );

setUserAgent

void session.setUserAgent ( String userAgent ) (professional and enterprise editions only)

Description

Sets the user agent to be used for all requests.

Parameters

  • userAgent true or false

Return Values

None

Change Log

Version Description
5.5.23a Available in Professional and Enterprise editions.

Examples

Set the user agent

session.setUserAgent( "Opera/9.64(Windows NT 5.1; U; en) Presto/2.1.1" );

// You can also access the current value like so:
session.log( "Session user agent: " + session.getUserAgent() );

setVariable

void session.setVariable ( String identifier, Object value )

Description

Set the value of a session variable.

Parameters

  • identifier Name of the session variable, as a string.
  • value Value of the session variable. This can be any Java object, including (but not llimited to) a String, DataSet, or DataRecord.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

Examples

Set Session Variable

 // Sets the session variable "CITY_CODE" with the value found
 // in the first dataRecord (at index 0) pointed to by the
 // identifier "CITY_CODE".

 session.setVariable( "CITY_CODE", dataSet.get( 0, "CITY_CODE" ) );

See Also

  • addToVariable() [session] - Adds an integer to the value of a session variable.
  • getv() [session] - Retrieve the value of a saved session variable (alias of getVariable).
  • getVariable() [session] - Retrieve the value of a saved session variable.
  • setv() [session] - Set the value of a session variable (alias of setVariable).

setv

void session.setv ( String identifier, Object value )

Description

Set the value of a session variable (alias of setVariable).

Parameters

  • identifier Name of the session variable, as a string.
  • value Value of the session variable. This can be any Java object, including (but not llimited to) a String, DataSet, or DataRecord.

Return Values

Returns void.

Change Log

Version Description
5.0 Added for all editions.

Examples

Set Session Variable

 // Sets the session variable "CITY_CODE" with the value found
 // in the first dataRecord (at index 0) pointed to by the
 // identifier "CITY_CODE".

 session.setv( "CITY_CODE", dataSet.get( 0, "CITY_CODE" ) );

See Also

  • addToVariable() [session] - Adds an integer to the value of a session variable.
  • getv() [session] - Retrieve the value of a saved session variable (alias of getVariable).
  • getVariable() [session] - Retrieve the value of a saved session variable.
  • setVariable() [session] - Set the value of a session variable.

shouldStopScraping

boolean session.shouldStopScraping ( )

Description

Determine if the scrape has been stopped. This can be done using the stop button in the workbench or the stop scraping button on the web interface (for enterprise users).

Parameters

This method does not receive any parameters.

Return Values

Returns true if the scrape has been requested to stop; otherwise, it returns false.

Change Log

Version Description
5.0 Added for enterprise edition.

Examples

Stop Iterator if Scrape is Stopped

 for (int i = 0; i < dataSet.getNumDataRecords(); ++i)
 {
     // check during every iteration to see if we should exit early.
      // Without this check, the iteration will continue even
     // if the stop scraping button were to be pressed.
     if ( session.shouldStopScraping() )
      {
         break;
     }

     session.setVariable( "URL", dataSet.get( i, "NEXT_PAGE_URL" ) );
     session.scrapeFile( "NEXT_PAGE" );
 }

stopScraping

void session.stopScraping ( )

Description

Stop the current scraping session.

Parameters

This method does not receive any parameters.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

Examples

Stop Scrape on Scrapeable File Request Error

 // Stops scraping if an error response was received
 // from the server.
 if( scrapeableFile.wasErrorOnRequest() )
 {
     session.stopScraping();
 }

waitForFileDownloadsToComplete

void session.waitForFileDownloadsToComplete() (enterprise edition only)

Description

Waits for any file downloads to complete before returning. This should be used in tandem with the session.downloadFile method call that takes the "doLazy" paraameter.

Parameters

None

Return Values

None

Change Log

Version Description
5.5.43a Available in Enterprise edition.

Examples

Set the user agent

// Download five image files concurrently.
for( i = 0; i < 5; i++ )
{
        session.downloadFile( "http://www.mysite.com/images/image" + i + ".jpg", "output/image" + i + ".jpg", 5, true );
}

// Wait for all of the images to finish downloading before continuing.
session.waitForFileDownloadsToComplete();

sutil

Overview

The sutil class provides general functions used to manipulate and work with extracted data. It also allows you to get information regarding screen-scraper such as its memory usage or version.

Images

Overview

In the course of a scrape it you might want to gather images associated with the other information being gathered. These methods are provided to not only download the images but to gather size information and resize to your desired size.

These methods are only available to enterprise edition users.

getImageHeight

int sutil.getImageHeight ( String imagePath ) (enterprise edition only)

Description

Get the height of an image.

Parameters

  • imagePath File path to the image, as a string.

Return Values

Returns the height in pixels of the image file, as an integer. If the file doesn't exist or is not an image an error will be thrown and -1 will be returned.

Change Log

Version Description
5.0 Moved from session to sutil.
4.5 Available for enterprise edition.

Examples

Write Image Height to Log

 // Output the height of the image to the log.
 session.log( "Image height: " + sutil.getImageHeight( "C:/my_image.jpg" ) );

getImageWidth

int sutil.getImageWidth ( String imagePath ) (enterprise edition only)

Description

Get the width of an image.

Parameters

  • imagePath File path to the image, as a string.

Return Values

Returns the width in pixels of the image file, as an integer. If the file doesn't exist or is not an image an error will be thrown and -1 will be returned.

Change Log

Version Description
5.0 Moved from session to sutil.
4.5 Available for enterprise edition.

Examples

Write Image Width to Log

 // Output the width of the image to the log.
 session.log( "Image height: " + sutil.getImageWidth( "C:/my_image.jpg" ) );

resizeImage

Overview

Internally, only one function is used to resize all images; however, to facilitate the resizing of images, we have provided you with three methods. Each method will help you specify what measurement is most important (width or height) and whether the image should retain its aspect ratio.

  1. resizeImageFixHeight() [sutil] - Resize image, retaining aspect ratio, based on specified height.
  2. resizeImageFixWidth() [sutil] - Resize image, retaining aspect ratio, based on specified width.
  3. resizeImageFixWidthAndHeight() [sutil] - Resize image to a specified size (will not check aspect ratio).

resizeImageFixHeight

void sutil.resizeImageFixHeight ( String originalFile, String newFile, int newHeightSize, boolean deleteOriginalFile ) (enterprise edition only)

Description

Resize image, retaining aspect ratio, based on specified height.

Parameters

  • originalFile File path of the image to be resized, as a string.
  • newFile File path when the new image should be created, as a string.
  • newHeightSize The height of the resized image in pixels, as a integer.
  • deleteOriginalFile Whether the origionalFile should be retained, as a boolean.

Return Values

Returns void. If an error is encountered it will be thrown.

Change Log

Version Description
5.0 Moved from session to sutil.
4.5 Available for enterprise edition.

Examples

Resize Image to Specified Height

 // Resizes a JPG to 100 pixels high, maintaining the
 // aspect ratio. After the image is resized, the original
 // will be deleted.

 sutil.resizeImageFixHeight( "C:/my_image.jpg", "C:/my_image_thumbnail.jpg", 100, true );

resizeImageFixWidth

void sutil.resizeImageFixWidth ( String originalFile, String newFile, int newWidthSize, boolean deleteOriginalFile ) (enterprise edition only)

Description

Resize image, retaining aspect ratio, based on specified width.

Parameters

  • originalFile File path of the image to be resized, as a string.
  • newFile File path when the new image should be created, as a string.
  • newWidthSize The width of the resized image in pixels, as a integer.
  • deleteOriginalFile Whether the origionalFile should be retained, as a boolean.

Return Values

Returns void. If an error is encountered it will be thrown.

Change Log

Version Description
5.0 Moved from session to sutil.
4.5 Available for enterprise edition.

Examples

Resize Image to Specified Width

 // Resizes a JPG to 100 pixels wide, maintaining the
 // aspect ratio. After the image is resized, the original
 // will be deleted.

 sutil.resizeImageFixWidth( "C:/my_image.jpg", "C:/my_image_thumbnail.jpg", 100, true );

resizeImageFixWidthAndHeight

void sutil.resizeImageFixWidth ( String originalFile, String newFile, int newWidthSize, int newHeightSize, boolean deleteOriginalFile ) (enterprise edition only)

Description

Resize image to a specified size.

Parameters

  • originalFile File path of the image to be resized, as a string.
  • newFile File path when the new image should be created, as a string.
  • newWidthSize The width of the resized image in pixels, as a integer.
  • newHeightSize The height of the resized image in pixels, as a integer.
  • deleteOriginalFile Whether the origionalFile should be retained, as a boolean.

Return Values

Returns void. If an error is encountered it will be thrown.

Change Log

Version Description
5.0 Moved from session to sutil.
4.5 Available for enterprise edition.

This method can cause distortions of the image if the aspect ratio of the original and target images are different.

Examples

Resize Image to Specified Size

 // Resizes a JPG to 100x100 pixels.
 // After the image is resized, the original
 // will be deleted.

 sutil.resizeImageFixWidthAndHeight( "C:/my_image.jpg", "C:/my_image_thumbnail.jpg", 100, 100, true );

DecodedImage

Overview

To be used in conjunction with the ImageDecoder class.

This class represents decoded images. The objects can be queried for the text that was in the image, as well as any error that occurred while the image was being decoded. When the returned text is incorrect, there is a method that can be used to report it as bad. This can be used for sites like decaptcher.com, where refunds are given for incorrectly interpreted images.

getError

String getError ( )

Description

Gets any error message, or returns null if there was no error

Parameters

This method takes no parameters

Return Value

The error message returned

Error messages

  • OK Nothing went wrong
  • BALANCE_ERROR Insufficient funds with paid service
  • NETWORK_ERROR General network error (timeout, lost connection, server busy, etc...)
  • INVALID_LOGIN Credentials are invalid
  • GENERAL_ERROR General error, possibly image was bad or the site couldn't resolve it. See the error message for details
  • UNKNOWN Unknown error

Change Log

Version Description
5.5.29a Available in all editions.

Examples

Convert an image to text

 import com.screenscraper.util.images.*;

 // Assuming an ImageDecoder was created in a different location and saved in "IMAGE_DECODER"
 ImageDecoder decoder = session.getVariable("IMAGE_DECODER");
 DecodedImage result = decoder.decodeFile("someFile.jpg");

 if(result.wasError())
 {
   session.logWarn("Error converting image to text: " + result.getError());
 }
 else
 {
   session.log("Decoded Text: " + result.getResult());
 }

 // If the result was bad
 result.reportAsBad();

getResult

Object getResult ( )

Description

Gets the result from decoding the image. Most likely this will be a String, but each implementation could return a specific object type.

Parameters

This method takes no parameters

Return Value

The text extracted from the image, or null if there was an error

Change Log

Version Description
5.5.29a Available in all editions.

Examples

Convert and image to text

 import com.screenscraper.util.images.*;
 
 // Assuming an ImageDecoder was created in a different location and saved in "IMAGE_DECODER"
 ImageDecoder decoder = session.getVariable("IMAGE_DECODER");
 DecodedImage result = decoder.decodeFile("someFile.jpg");
 
 if(result.wasError())
 {
   session.logWarn("Error converting image to text: " + result.getError());
 }
 else
 {
   session.log("Decoded Text: " + result.getResult());
 }

 // If the result was bad
 result.reportAsBad();

reportAsBad

void reportAsBad ( )

Description

Handles an incorrectly resolved image. Some types of decoders won't have anything here

Parameters

This method takes no parameters

Return Value

This method returns void.

Change Log

Version Description
5.5.29a Available in all editions.

Examples

Convert and image to text

 import com.screenscraper.util.images.*;

 // Assuming an ImageDecoder was created in a different location and saved in "IMAGE_DECODER"
 ImageDecoder decoder = session.getVariable("IMAGE_DECODER");
 DecodedImage result = decoder.decodeFile("someFile.jpg");

 if(result.wasError())
 {
   session.logWarn("Error converting image to text: " + result.getError());
 }
 else
 {
   session.log("Decoded Text: " + result.getResult());
 }

 // If the result was bad
 result.reportAsBad();

wasError

String wasError ( )

Description

Returns true if there was an error, false otherwise. Also returns false if the image has not been resolved yet

Parameters

This method takes no parameters

Return Value

True if there was an error, false otherwise

Change Log

Version Description
5.5.29a Available in all editions.

Examples

Convert and image to text

 import com.screenscraper.util.images.*;

 // Assuming an ImageDecoder was created in a different location and saved in "IMAGE_DECODER"
 ImageDecoder decoder = session.getVariable("IMAGE_DECODER");
 DecodedImage result = decoder.decodeFile("someFile.jpg");

 if(result.wasError())
 {
   session.logWarn("Error converting image to text: " + result.getError());
 }
 else
 {
   session.log("Decoded Text: " + result.getResult());
 }

 // If the result was bad
 result.reportAsBad();

ImageDecoder

Overview

Class to convert images to text for interacting with CAPTCHA challenges. There are currently two implementations:

  • ManualDecoder: Creates a pop-up window for a user to enter in the text they read from the image
  • DecaptcherDecoder: Interface for the paid service decaptcher.com

When a reference to an image is passed to an instance of this class, it returns a DecodedImage object that can be queried for the resulting text, errors, and can report an image as poorly converted.

See example attached.

DecaptcherDecoder

void DecaptcherDecoder (ScrapingSession session, String username, String password, int port)
void DecaptcherDecoder (ScrapingSession session, String username, String password, String port)
void DecaptcherDecoder (ScrapingSession session, String username, String password, String port, String apiUrl)
void DecaptcherDecoder (ScrapingSession session, String username, String password, int port, String apiUrl)

Description

Requires an account with decaptcher.com.

Type of ImageDecoder in the com.screenscraper.util.images package that uses the decaptcher.com service to convert images to text. The constructor is DecaptcherDecoder(ScrapingSession session, String username, String password) or DecaptcherDecoder(ScrapingSession session, String username, String password, String apiUrl).

Parameters

  • session Name of currently running scraping session.
  • username Username used to log in to decaptcher.com service.
  • password Password used to log in to decaptcher.com service.
  • port The port given by De-captcher.com to access your account on their site.
  • apiUrl (optional) URL used to access decaptcher.com service. This setting will override the default URL.

Return Values

Returns void. If it runs into any problems accessing the decaptcher.com service an error will be thrown.

Change Log

Version Description
5.5.29a Available in all editions
5.5.40a Added the port parameter. The service now requires the correct port in order to authenticate.

Examples

Initialization script

import com.screenscraper.util.images.*;

ImageDecoder decoder;

decoder = new DecaptcherDecoder(session, "username", "password", 12345, "http://api.de-captcher.com");

session.setVariable("IMAGE_DECODER", decoder);

ManualDecoder

void ManualDecoder (ScrapingSession session)

Description

Type of ImageDecoder in the com.screenscraper.util.images package that uses a popup window prompting the user to enter the text read from an image. Useful for debugging purposes, as the input text should always be correct (so long as it is typed correctly). Helpful during testing to avoid costs associated with paid-for CAPTCHA decoding services such as decaptcher.com.

Parameters

  • session Name of currently running scraping session.

Return Values

Returns void. If it runs into any problems decoding an image an error will be thrown.

Change Log

Version Description
5.5.29a Available in all editions

Examples

Initialize script

import com.screenscraper.util.images.*;

ImageDecoder decoder;

decoder = new ManualDecoder(session);

session.setVariable("IMAGE_DECODER", decoder);

decodeFile

DecodedImage decodeFile ( String file )
DecodedImage decodeFile ( File file )

Description

Converts the image given to a DecodedImage that will handle it. Does not delete the file.

Parameters

  • file The image file

Return Value

A DecodedImage used to get the text, errors, and possibly report a result as bad.

Change Log

Version Description
5.5.29a Available in all editions.

Examples

image = decoder.decodeFile("path to the image file");

decodeURL

DecodedImage decodeURL ( String url )

Description

Converts the image at the given URL to a DecodedImage that will handle it. Temporarily saves the file in the screen-scraper root folder, but deletes it once it has been decoded. By default, this will use the scraping session's HttpClient to request the URL.

Parameters

  • url The url to the image

Return Value

A DecodedImage used to get the text, errors, and possibly report a result as bad.

Change Log

Version Description
5.5.29a Available in all editions.

Examples

DecodedImage image = decoder.decodeURL(dataRecord.get("IMAGE_URL"));

applyXPathExpression

convertDateToString

String sutil.convertDateToString ( Date date ) (professional and enterprise editions only)
String sutil.convertDateToString ( Date date, String format ) (professional and enterprise editions only)

Description

Converts the Date given to a string in a specified format, or in the "MM/dd/yyyy HH:mm:ss.SS zzz" if no format is given.

Parameters

  • date The date to convert
  • format (optional) A String representation (as a SimpleDateFormat) for the output

Return Values

A String representing the date given

Change Log

Version Description
5.5.26a Available in all editions.

Examples

// Log the current time
Date now = new Date();
session.log(sutil.convertDateToString(now, "MM/dd/yyyy HH:mm:ss zzz"));

convertHTMLEntities

void sutil.convertHTMLEntities ( String value )

Description

Decode HTML Entities.

Parameters

  • value String whose HTML Entities will be converted to characters.

Return Values

Returns string with decoded HTML entities.

Change Log

Version Description
5.0 Added for all editions.

Examples

Decode HTML Entities

 // Returns Angela's Room
 sutil.convertHTMLEntities( "Angela&apos;s Room" );

See Also

convertStringToDate

Date sutil.convertStringToDate ( String dateString, String format ) (professional and enterprise editions only)

Description

Converts a String to a Date object using the given format. If null is given as a format, "MM/dd/yyyy HH:mm:ss.SS zzz" is used

Parameters

  • dateString The date string
  • format The format of the date, following SimpleDateFormat formatting.

Return Values

The Date object matching the date given in the String, or null if it couldn't be parsed with the given format

Change Log

Version Description
5.5.26a Available in all editions.

Examples

// Convert an input value to a date for later use
Date lastUpdate = sutil.convertStringToDate(session.getVariable("LAST_RUN_DATE"), "yyyy-MM-dd");

if(lastUpdate == null)
{
  session.logError("No last run specified, stopping scrape");
  session.stopScraping();
}

convertUTFWhitespace

String sutil.convertUTFWhitespace (String input ) (enterprise edition only)

Description

Replaces the UTF variants on whitespace with a regular space character.

Parameters

  • input The input string.

Return Values

Returns the converted string.

Change Log

Version Description
6.0.55a Available in all editions.

Examples

Tidying a string from a site that has non-uniform ways of returning strings.

    // useful when tidying a string
    String cleanedInput = sutil.convertUTFWhitespace(input);
    cleanedInput = cleanedInput.replaceAll("\\s{2,}", " ").trim();

dateIsWithinDays

boolean sutil.dateIsWithinDays ( Date date1, Date date2, int days ) (professional and enterprise editions only)

Description

Checks to see if one date is within a certain number of days of another.

Parameters

  • date1 The first date.
  • date2 The second date.
  • days The maximum number of days that can be between the two dates.

Return Values

  • True if the dates are close than or the number of days apart, false otherwise.

Change Log

Version Description
5.5.13a Available in all editions.

Examples

Check the proximity of one date to another

date1 = sutil.convertStringToDate( "2012-02-15", "yyyy-MM-dd" );
date2 = sutil.convertStringToDate( "2012-02-24", "yyyy-MM-dd" );

days = 5;
session.log( "First date is within 5 days of second date: " + sutil.dateIsWithinDays( date1, date2, days ) );

days = 15;
session.log( "First date is within 15 days of second date: " + sutil.dateIsWithinDays( date1, date2, days ) );(

equalsIgnoreCase

boolean sutil.equalsIgnoreCase ( String stringOne, String stringTwo )

Description

Compare two strings ignoring case.

Parameters

  • stringOne First string.
  • stringTwo Second string.

Return Values

Returns true if the values of the two strings are equal when case is not considered; otherwise, it returns false.

Change Log

Version Description
5.0 Added for all editions.

Examples

Compare Two Strings (Case Insensitive)

 // Compare strings without regard to case
 sutil.equalsIgnoreCase( "aBc123","ABc123" );

formatNumber

String sutil.formatNumber ( String number ) (professional and enterprise editions only)
String sutil.formatNumber ( String number, int decimals, boolean padDecimals ) (professional and enterprise editions only)

Description

Returns a number formatted in such a way that it could be parsed as a Float, such as xxxxxxxxx.xxxx. It attempts to figure out if the number is formatted as European or American style, but if it cannot determine which it is, it defaults to American. If the number is something with a k on the end, it will convert the k to thousand (as 000). It will also try to convert m for million and b for billion. It also assumes that you won't have a number like 3.123k or 3.765m, however 3.54m is fine. It figures if you wanted all three of those digits you would have specified it as 3765k or 3,765k

Parameters

  • number String containing the number.
  • decimals (optional) The number of maximum number of decimal places to include in the result. When this value is omitted, any decimals are retained, but none are added
  • padDecimals (optional) Sets whether or not to pad the decimals (convert 5.1 to 5.10 if 2 decimals are specified)

Return Values

Returns a String formatted as a phone number, such as +1 (123) 456-7890x2, or null if the input was null

Change Log

Version Description
5.5.26a Available in all editions.

Examples

Format a scraped abbreviated number as a dollar amount

 // Format a number to two decimal places
 String dollars = sutil.formatNumber("3.75k", 2, true);
 // This would set dollars to the String "3750.00"

 // Format the amount without cents.
 String dollarsNoCents = sutil.formatNumber("3.75m");
 // This would set dollars to the String "3750000"

Format a European number to be inserted in a MySQL statement

 String number = sutil.formatNumber("3.275,10", 2, false);
 // number would now be "3275.1"

formatUSPhoneNumber

String sutil.formatUSPhoneNumber ( String number ) (professional and enterprise editions only)

Description

Converts a String to a US formatted phone number, as +1 (123) 456-7890x2. Expects a 7 digit or 10+ digit phone number. The extension is optional, and will be any digits found after an x. This allows for extensions listed as ext, x, or extension.

Parameters

  • number String containing the phone number. The only digits in this String should be the digits of the phone number.

Return Values

Returns a String formatted as a phone number, such as +1 (123) 456-7890x2, or null if the input was null

Change Log

Version Description
5.5.26a Available in all editions.

Examples

Format a scraped phone number

 // Formats the phone number extracted
 String phone = sutil.formatUSPhoneNumber(dataRecord.get("PHONE_NUMBER"));
 
 // If the extracted value had been "13334445678 ext. 23" the returned value "+1 (333) 444-5678x23"

formatUSZipCode

String sutil.formatUSZipCode ( String zip ) (professional and enterprise editions only)

Description

Formats and returns a US style zip code as 12345-6789. If the given zip code isn't 5 or 9 digits, will log a warning, but it will put 5 digits before the - and anything else (if any) after the -

Parameters

  • zip String to format as a zip code, either 5 or 9 digits

Return Values

Zip code formatted String, such as 12345-6789 or 12345

Change Log

Version Description
5.5.26a Available in all editions.

Examples

 // Format a number to a nicer looking zip code
 String zip = sutil.formatUSZipCode(" 001011458");
 
 // zip would be "00101-1458"

getCurrentDate

String sutil.getCurrentDate ( String format )

Description

Returns the current date in a specified format, or uses the "MM/dd/yyyy HH:mm:ss.SS zzz" if null is given. Uses the session's timezone.

Parameters

  • format The format for the output string

Return Values

A String representing the date and time this method was invoked

Change Log

Version Description
5.5.26a Available in all editions.

Examples

 // Log the current time
 session.log(sutil.getCurrentDate(null));

getInstallDir

Sting sutil.getInstallDir ( )

Description

Retrieve the file path of the screen-scraper installation.

Parameters

This method does not receive parameters.

Return Values

Returns the installation directory file path, as a string.

Change Log

Version Description
5.0 Added for all editions.

Examples

Download to screen-scraper Directory

 url = "http://www.foo.com/imgs/puppy_image.gif";

 // Get installation file path
 path = sutil.getInstallDir() + "images/puppy.gif";

 // Download to screen-scraper directory
 session.downloadFile( url, path );

getMemoryUsage

int sutil.getMemoryUsage ( ) (enterprise edition only)

Description

Get memory usage of screen-scraper.

Parameters

This method does not receive any parameters.

Return Values

Returns the average percentage of memory used by screen-scraper over the past 30 seconds, as an integer.

Change Log

Version Description
5.0 Moved from session to sutil.
4.5 Available for enterprise edition.

For tips on optimizing screen-scraper's memory usage so that it can run faster, see our FAQ on optimization.

Examples

Stop Scrape on Memory Leak

 // Stop scrape if memory is low
 if( sutil.getMemoryUsage() > 98 )
 {
     session.log( "Memory is critically low. Stopping the scraping session." );
     session.stopScraping();
 }

getMimeType

String sutil.getMimeType ( String path )

Description

Get the mime-type of a local file.

Parameters

  • path File path to the local file, as a string.

Return Values

Returns the mime-type of the file, as a string.

Change Log

Version Description
5.0 Added for all editions.

Examples

Get File Mime Type

 // Get mime-type
 sutil.getMimeType( "c:/image/puppy.gif" );

getNumRunnableScrapingSessions

int sutil.getNumRunnableScrapingSessions ( ) (enterprise edition only)

Description

Get the number of runnable scraping sessions.

Parameters

This method does not receive any parameters.

Return Values

Returns the number of scraping sessions in this instance of screen-scraper, as a integer.

Change Log

Version Description
5.0 Added for all editions.

Examples

Get the Number of Runnable Scrapes

 // Write the number of running scrapes to the log
 session.log( "Number of Runnable Scrapes: " + sutil.getNumRunnableScrapingSessions() );

getNumRunningScrapingSessions

int sutil.getNumRunningScrapingSessions ()
int sutil.getNumRunningScrapingSessions ( String scrapingSessionName )

Description

Gets the number of scraping sessions that are currently being run.

Parameters

  • scrapingSessionName Narrows the scope to a given scraping session, if this parameter is passed in.

Return Values

An int representing the number of running scraping sessions.

Change Log

Version Description
5.5.42a Available in Enterprise edition.

Examples

session.log( "Num running scraping sessions: " + sutil.getNumRunningScrapingSessions( session.getName() ) );
if( sutil.getNumRunningScrapingSessions( session.getName() ) > 1 )
{
        session.log( "SESSION ALREADY RUNNING." );
        session.stopScraping();
        return;
}

getOptionSet

DataSet sutil.getOptionSet ( String options ) (professional and enterprise editions only)
DataSet sutil.getOptionSet ( String options, String ignoreLabel, boolean tidyRecords ) (professional and enterprise editions only)
DataSet sutil.getOptionSet ( String options, String[] ignoreLabels, boolean tidyRecords ) (professional and enterprise editions only)
DataSet sutil.getOptionSet ( String options, Collection<String> ignoreLabels, boolean tidyRecords ) (professional and enterprise editions only)

Description

Gets a DataSet containing each of the elements of a <select> tag. The returned DataRecords will contain a key for the text found between the tags (possibly with html tags removed), a value indicating if it was the selected option, and the value to submit for the specific option. Note that this only looks for option tags, and as such passing in text containing more than a single select tag will produce false output.

Parameters

  • options The text containing the options HTML from the select tag
  • ignoreLabels (or ignoreLabel) (optional) Text value(s) to ignore in the output set. Usually this would include the strings like "Please select a category"
  • tidyRecords (optional) Should the TEXT be tidied before being stored in the resulting DataRecords

Return Values

A DataSet with one record per option. Values extracted will be stored in
VALUE : The value the browser would submit for this option
TEXT : The text that was between the tags
SELECTED : A boolean that is true if this option was selected by default

Change Log

Version Description
5.5.26a Available in all editions.

Examples

Search each option from a dropdown menu

 String options = dataRecord.get("ITEM_OPTIONS");
 
 // We don't want the value for "Select an option" because that doesn't go to a search results page
 DataSet items = sutil.getOptionSet(options, "Select an option", true);
 
 for(int i = 0; i < items.getNumDataRecords(); i++)
 {
   DataRecord next = items.getDataRecord(i);
   session.setVariable("ITEM_VALUE", next.get("VALUE"));
   session.log("Now scraping results for " + next.get("TEXT"));
   session.scrapeFile("Search Results");
 }

getRadioButtonSet

DataSet sutil.getRadioButtonSet ( String buttons, String buttonName ) (professional and enterprise editions only)
DataSet sutil.getRadioButtonSet ( String buttons, String buttonName, String ignoreLabel ) (professional and enterprise editions only)
DataSet sutil.getRadioButtonSet ( String buttons, String buttonName, Collection<String> ignoreLabels ) (professional and enterprise editions only)
DataSet sutil.getRadioButtonSet ( String buttons, String buttonName, Collection<String> ignoreLabels, boolean tidyRecords ) (professional and enterprise editions only)

Description

Gets all the options from a radio button group. The values are returned in a data record. Any labels that are to be ignored will not be included in the returned set. Not all buttons have a label, as radio buttons do not require a label, and it would be difficult to know in a regular expression exactly what to extract as the label unless there is a label tag.

Parameters

  • buttons The text containing the buttons
  • buttonName The name of the buttons that should be grabbed, as a Regex pattern
  • ignoreLabels (or ignoreLabel) (optional) Any labels that should be excluded from the resulting set
  • tidyRecords (optional) Should the TEXT be tidied before being stored in the resulting DataRecords

Return Value

DataSet containing one record for each of the extracted radio buttons. Values will be stored in
VALUE : The value the browser would submit for this radio button
TEXT : The text that represents this button, or null if no label could be found for it
SELECTED : A boolean that is true if this button was selected by default
ID : The ID of the radio button, or null if no ID was found

Change Log

Version Description
5.5.29a Available in all editions.

Examples

Search each radio button from a radio button group

 String options = dataRecord.get("RADIO_BUTTONS");
 
 // Get all the radio buttons with the name attribute "selection"
 DataSet items = sutil.getOptionSet(options, "selection");
 
 for(int i = 0; i < items.getNumDataRecords(); i++)
 {
   DataRecord next = items.getDataRecord(i);
   session.setVariable("BUTTON_VALUE", next.get("VALUE"));
   session.log("Now scraping results for " + next.get("TEXT"));
   session.scrapeFile("Search Results");
 }

getRandomReferrer

String sutil.getRandomReferrer ( )

Description

Gets a random referrer page from a list of many different search engine web sites and a few other pages.

Parameters

This method does not receive any parameters.

Return Values

Returns a random referrer URL.

Change Log

Version Description
6.0.1a Introduced for all editions.

getRandomUserAgent

String sutil.getRandomUserAgent ( )

Description

Returns a random User Agent. The list isn't closely monitored, so it may not include newer user agents, and may include extremely old ones as well.

Parameters

This method does not receive any parameters.

Return Values

Returns a random user agent.

Change Log

Version Description
6.0.1a Introduced for all editions.

getScreenScraperEdition

String sutil.getScreenScraperEdition ( )

Description

Get edition of screen-scraper instance.

Parameters

This method does not receive any parameters.

Return Values

Returns the edition name, as a string.

Change Log

Version Description
5.0 Added for all editions.

Examples

Write Version to Log

 // Write the current version to log.
 session.log("Current edition: " + sutil.getScreenScraperEdition());

getScreenScraperVersion

String sutil.getScreenScraperVersion ( )

Description

Get version of screen-scraper instance.

Parameters

This method does not receive any parameters.

Return Values

Returns the version number, as a string.

Change Log

Version Description
5.0 Added for all editions.

Examples

Write Version to Log

 // Write the current version to log.
 session.log("Current version: " + sutil.getScreenScraperVersion());

isInt

boolean sutil.isInt ( String string )

Description

Determine if the value of a string is an integer.

Parameters

  • obj Object to be tested for containing an integer.

Return Values

Returns true if the string is an integer; otherwise, it returns false. If it is passed an object that is not a string, including an integer, an error will be thrown.

Change Log

Version Description
5.0 Added for all editions.

Examples

Check String Value

 // Does the GUESTS variable contain an integer
 if ( !sutil.isInt( session.getv( "GUESTS" ) ) )
 {
     session.logWarn( "Could not get the number of guests!" );
 }

isNullOrEmptyString

boolean sutil.isNullOrEmptyString ( Object object )

Description

Determine if an object's value is null or empty.

Parameters

  • object The object whole value will be tested.

Return Values

Returns true if the value of the object is null or an empty string; otherwise, it returns false.

Change Log

Version Description
5.0 Added for all editions.

Examples

Warning for Empty Variable

 // Give warning and stop scrape if variable is empty
 if ( sutil.isNullOrEmptyString( session.getv( "NAME" ) ) )
 {
     session.log( "The NAME variable was blank." );
     session.stopScraping();
 }

isPlatformLinux

boolean sutil.isPlatformLinux ( )

Description

Determine if operating system is a Linux platform.

Parameters

This method does not receive parameters.

Return Values

Returns true if the operating system is Linux; otherwise, it returns false.

Change Log

Version Description
5.0 Added for all editions.

Examples

Check Linux Platform

 url = "http://www.foo.com/imgs/puppy_image.gif";

 // Determine download location based on platform
 if ( sutil.isPlatformLinux() )
 {
     session.downloadFile( url, "/home/user/images/puppy.gif" );
 }
 else if ( sutil.isPlatformMac() )
 {
     session.downloadFile( url, "/Volumes/Documents/images/puppy.gif" );
 }
 else if ( sutil.isPlatformWindows() )
 {
     session.downloadFile( url, "c:/images/puppy.gif" );
 }

isPlatformMac

boolean sutil.isPlatformMac ( )

Description

Determine if operating system is a Mac platform.

Parameters

This method does not receive parameters.

Return Values

Returns true if the operating system is Mac; otherwise, it returns false.

Change Log

Version Description
5.0 Added for all editions.

Examples

Check Mac Platform

 url = "http://www.foo.com/imgs/puppy_image.gif";

 // Determine download location based on platform
 if ( sutil.isPlatformLinux() )
 {
     session.downloadFile( url, "/home/user/images/puppy.gif" );
 }
 else if ( sutil.isPlatformMac() )
 {
     session.downloadFile( url, "/Volumes/Documents/images/puppy.gif" );
 }
 else if ( sutil.isPlatformWindows() )
 {
     session.downloadFile( url, "c:/images/puppy.gif" );
 }

isPlatformWindows

boolean sutil.isPlatformWindows ( )

Description

Determine if operating system is a Windows platform.

Parameters

This method does not receive parameters.

Return Values

Returns true if the operating system is Windows; otherwise, it returns false.

Change Log

Version Description
5.0 Added for all editions.

Examples

Check Windows Platform

 url = "http://www.foo.com/imgs/puppy_image.gif";

 // Determine download location based on platform
 if ( sutil.isPlatformLinux() )
 {
     session.downloadFile( url, "/home/user/images/puppy.gif" );
 }
 else if ( sutil.isPlatformMac() )
 {
     session.downloadFile( url, "/Volumes/Documents/images/puppy.gif" );
 }
 else if ( sutil.isPlatformWindows() )
 {
     session.downloadFile( url, "c:/images/puppy.gif" );
 }

makeGETRequest

Sting sutil.makeGETRequest ( String url )

Description

Retrieve the response contents of a GET request.

Parameters

  • url URL encoded version of page request, as a string. Java provides a URLEncoder to aid in URL encoding of a string.

Return Values

Returns contents of the response, as a string.

Change Log

Version Description
5.0 Added for all editions.

This method will use any proxy settings that have been specified in the Settings dialog box.

Examples

Retrieve Page Contents

 // Returns contents resulting from
 // request to "http://www.screen-scraper.com"

 pageContents = sutil.makeGETRequest("http://www.screen-scraper.com/tutorial/basic_form.php?text_string=Hello+World");

makeGETRequestNoSessionProxy

String sutil.makeGETRequestNoSessionProxy ( String urlString )

Description

Makes a GET request and returns the result as a string. This method will use the proxy settings indicated in the "Settings" dialog box, if any.

Parameters

This method does not receive any parameters.

Return Values

  • urlString The URL to request, as a string.

Throws

  • java.lang.Exception If anything naughty happens.

Change Log

Version Description
6.0.6a Introduced for all editions.

makeGETRequestUseSessionProxy

String sutil.makeGETRequestUseSessionProxy ( String urlString )

Description

Makes a GET request and returns the result as a string. This method will use the proxy settings attached to the current scraping session.

Parameters

This method does not receive any parameters.

Return Values

  • urlString The URL to request, as a string.

Throws

  • java.lang.Exception If anything naughty happens.

Change Log

Version Description
6.0.6a Introduced for all editions.

makeHEADRequest

String[][] sutil.makeHEADRequest ( String url )

Description

Retrieve the response header contents.

Parameters

  • url URL encoded version of page request, as a string. Java provides a URLEncoder to aid in URL encoding of a string.

Return Values

Returns contents of the response, as a two-dimensional array.

Change Log

Version Description
5.0 Added for all editions.

This method will use any proxy settings that have been specified in the Settings dialog box..

Examples

Retrieve Page Contents

 // Log HEAD contents

 // Get head contents
 headerArray = sutil.makeHEADRequest("http://www.screen-scraper.com/tutorial/basic_form.php?text_string=Hello+World");

 // Loop through HEAD contents
 for (int i=0; i<headerArray.length; i++)
 {
     // Write header to log
     session.log(headerArray[i][0] + ": " + headerArray[i][1]);
 }

 /* Example Log:
 Date: Fri, 04 Jun 2010 17:18:11 GMT
 Server: Apache/2.2.3 (CentOS)
 X-Powered-By: PHP/5.1.6
 Connection: close
 Content-Type: text/html; charset=UTF-8
 */

See Also

mergeDataRecords

DataRecord sutil.mergeDataRecords ( DataRecord first, DataRecord second, boolean saveNonEmptyString ) (professional and enterprise editions only)

Description

Merges two data records by copying all values from the second record over values of the first record, and returning a new DataRecord with these values. Doesn't modify either original record

Parameters

  • first The first DataRecord, into which the values from the second record will be copied
  • second The second DataRecord, whose values will be copied into the first
  • saveNonEmptyString True if blank values should not overwrite blank values, whether the non-blank value is in the first or second record. If both records contain a value that is not blank for the same key, the value in the first record is saved and the value in the second record discarded. If false, all values in the second record will overwrite any values in the first record.

Return Values

A new DataRecord with the merged values

Change Log

Version Description
5.5.26a Available in all editions.

Examples

Combine values from the current dataRecord with a previous one

 DataRecord previous = session.getVariable("_DATARECORD");
 
 session.setVariable("_DATARECORD", sutil.mergeDataRecords(previous, dataRecord));

nullToEmptyString

String sutil.nullToEmptyString ( Object object )

Description

Get an object in string format.

Parameters

  • object Object to be returned in string format.

Return Values

Returns an empty string if the value of the object is null; otherwise, returns the value of the toString method of the object.

Change Log

Version Description
5.0 Added for all editions.

Examples

Get String Value of Variable

 // Always Specify Suffix even if not selected
 suffix = sutil.nullToEmptyString( session.getv( "SUFFIX" ) );

parseName

Name sutil.parseName ( String name ) (pro and enterprise editions only)

Description

Attempts to parse a string to a name. The parser is not perfect and works best on english formatted names (for example, "John Smith Jr." or "Guerrero, Antonio K". This uses standard settings for the parser. To get more control over how the name is parsed, use the EnglishNameParser class.

Parameters

  • name The name to be parsed.

Return Values

Returns the parsed name, as a Name object.

Change Log

Version Description
6.0.59a Available for professional and enterprise editions.

Examples

How to use the name parser

    String nameRaw = "John Fred Doe";
    DataRecord dr = new DataRecord();

    log.debug( "Name raw: " + nameRaw );
    if( nameRaw!=null )
    {
        try
        {
            Name name = sutil.parseName( nameRaw );
            log.debug( "First name: " + name.getFirstName() );
            log.debug( "Middle name: " + name.getMiddleName() );
            log.debug( "Last name: " + name.getLastName() );
            //log.debug( "Suffix: " + name.getSuffix() );

            dr.put( "FIRST_NAME", name.getFirstName() );
            dr.put( "MIDDLE_NAME", name.getMiddleName() );
            dr.put( "LAST_NAME", name.getLastName() );
            //dr.put( "SUFFIX", name.getAllSuffixString() );
        }
        catch( Exception e )
        {
            // The parser may throw an exception if it can't
            // parse the name.  If this occurs we want to know about it.
            log.warn( "Error parsing name: " + e.getMessage() );
        }
    }

See Also

Name sutil.parseName ( String name ) (pro and enterprise editions only)

Description

Attempts to parse a string to a name. The parser is not perfect and works best on english formatted names (for example, "John Smith Jr." or "Guerrero, Antonio K". This uses standard settings for the parser. To get more control over how the name is parsed, use the EnglishNameParser class.

Parameters

  • name The name to be parsed.

Return Values

Returns the parsed name, as a Name object.

Change Log

Version Description
6.0.59a Available for professional and enterprise editions.

Examples

How to use the name parser

    String nameRaw = "John Fred Doe";
    DataRecord dr = new DataRecord();

    log.debug( "Name raw: " + nameRaw );
    if( nameRaw!=null )
    {
        try
        {
            Name name = sutil.parseName( nameRaw );
            log.debug( "First name: " + name.getFirstName() );
            log.debug( "Middle name: " + name.getMiddleName() );
            log.debug( "Last name: " + name.getLastName() );
            //log.debug( "Suffix: " + name.getSuffix() );

            dr.put( "FIRST_NAME", name.getFirstName() );
            dr.put( "MIDDLE_NAME", name.getMiddleName() );
            dr.put( "LAST_NAME", name.getLastName() );
            //dr.put( "SUFFIX", name.getAllSuffixString() );
        }
        catch( Exception e )
        {
            // The parser may throw an exception if it can't
            // parse the name.  If this occurs we want to know about it.
            log.warn( "Error parsing name: " + e.getMessage() );
        }
    }

See Also

parseUSAddress

Address sutil.parseUSAddress ( String address ) (pro and enterprise editions only)

Description

Attempts to parse a string to an address. The parser is not perfect and works best on US addresses. Most likely other address formats can be parsed with the USAddressParser class by providing different constraints in the builder. This method is here for convenience in working with US addresses.

Parameters

  • address The address to be parsed.

Return Values

Returns the parsed address, as a Address object.

Change Log

Version Description
6.0.59a Available for professional and enterprise editions.

Examples

How to use the address parser

    import com.screenscraper.util.parsing.address.Address;
   
    String addressRaw = // some address

    DataRecord dr = new DataRecord();

    try
    {
        Address address = sutil.parseUSAddress( addressRaw );
        log.debug( "Street: " + address.getStreet() );
        log.debug( "Suite or Apartment: " + address.getSuiteOrApartment() );
        log.debug( "City: " + address.getCity() );
        log.debug( "State: " + address.getState() );
        log.debug( "Zip: " + address.getZipCode() );

        // if all of these four are blank then save only the raw address
        // else save what we can
        if(
            sutil.isNullOrEmptyString( address.getStreet() )
            &&
            sutil.isNullOrEmptyString( address.getState() )
            &&
            sutil.isNullOrEmptyString( address.getCity() )
            &&
            sutil.isNullOrEmptyString( address.getZipCode() )
        )
        {
            dr.put( "ADDRESS", addressRaw );
        }
        else
        {
            dr.put( "ADDRESS", address.getStreet() );
            dr.put( "ADDRESS2", address.getSuiteOrApartment() );
            dr.put( "STATE", address.getState() );
            dr.put( "CITY", address.getCity() );
            dr.put( "ZIP", address.getZipCode() );
        }
        session.setv( "DR_ADDRESS", dr );
    }
    catch( Exception e )
    {
        // If there was a parsing error, notify so it can be dealt with
        log.warn( "Exception parsing address: " + e.getMessage() );
    }

See Also

pause

void sutil.pause ( long milliseconds ) (professional and enterprise editions only)

Description

Pause scraping session.

Parameters

  • milliseconds Length of the pause, in milliseconds.

Return Values

Returns void.

Change Log

Version Description
5.0 Moved from session to sutil.
4.5 Available for professional and enterprise editions.

Pausing the scraping session also pauses the execution of the scripts including the one that initiates the pause.

Examples

Pause Scrape on Server Overload

 // It should be noted that a status code of 503 is not
 // always a temporary overloading of a server.

 // Check status code
 if (scrapeableFile.statusCode() == 503)
 {
     // Pause Scraping for 5 seconds
     sutil.pause( 5000 );

     // Continue/Rescrape file
     ...
 }

randomPause

void sutil.randomPause ( long min, long max ) (professional and enterprise editions only)

Description

Pauses for a random amount of time. This is also setup to stop immediately if the stop scrape button is clicked, and to allow breakpoints to be triggered while it is pausing.

Parameters

  • min The minimum duration of the pause, in milliseconds
  • max The maximum duration of the pause, in milliseconds

Return Value

Returns void.

Change Log

Version Description
5.5.29a Available in professional and enterprise editions.

Examples

Wait for between 2 and 4 seconds

 sutil.randomPause(2000, 4000);

reformatDate

String sutil.reformatDate ( String date, String dateFormatFrom, String dateFormatTo ) (professional and enterprise editions only)
String sutil.reformatDate ( String date, String dateFormatTo ) (enterprise edition only)

Description

Change a date format.

Parameters

  • date Date that is being reformatted, as a string.
  • dateFormatFrom (optional) The format of the date that is being reformated. The date format follows Sun's SimpleDateFormat.
  • dateFormatTo The format that the date is being changed to. If dateFormatFrom is being used this should also follow Sun's SimpleDateFormat. If dateFormatFrom is left off then the date format should follow PHP's date format. In the later method you can also use timestamp as the value of this parameter and it will return the timestamp corresponding to the date. Note also how PHP treats dashes and dots: "Dates in the m/d/y or d-m-y formats are disambiguated by looking at the separator between the various components: if the separator is a slash (/), then the American m/d/y is assumed; whereas if the separator is a dash (-) or a dot (.), then the European d-m-y format is assumed."

Return Values

Returns formatted date according to the specified format, as a string.

Change Log

Version Description
5.0 Moved from session to sutil.
4.5 Available for professional and enterprise editions. Unspecified source format available for enterprise edition.

The date formats are not the same for the two methods. Read carefully.

Examples

Reformat Date from Specified Format

 // Reformats the date shown to the format "2010-01-01".
 // This uses Sun's Date Formats

 sutil.reformatDate( "01/01/2010", "dd/MM/yyyy", "yyyy-MM-dd" );

Reformat Date from Unspecified Format

 // Reformats the date shown to the format "2010-01-01".
 // This uses PHP's Date Formats

 sutil.reformatDate( "01/01/2010", "Y-m-d" );

sendMail

void sutil.sendMail ( String subject, String body, String recipients ) (enterprise edition only)
void sutil.sendMail ( String subject, String body, String recipients, String attachments, String headers ) (enterprise edition only)
void sutil.sendMail ( String subject, String body, String recipients, String contentType, String attachments, String headers ) (enterprise edition only)

Description

Send an email using SMTP mail server specified in the settings.

Parameters

  • subject Subject line of the email, as a string.
  • body The content of the email, as a string.
  • recipients Comma-delimited list of email address to which the email will be sent, as a string.
  • contentType The content type as a valid MIME type.
  • attachments Comma-delimited list of local file paths to files that should be attached, as a string.
    If you do not have any attachments the value of null should be used.
  • headers Tab-delimited SMTP headers to be used when sending the email, as a string. If you don't have
    any headers to send use the value null.

Return Values

Returns void. If it runs into any problems while attempting to send the email an error will be thrown.

Change Log

Version Description
6.0.35a Now supports alternate content types.
5.0 Moved from session to sutil.
4.5 Available for enterprise edition.

Examples

Send Email at End of Scrape

 // In script called "After scraping session ends"

 // Sends an email message with the parameters shown.
 String message = "The '" + session.getName() + "' scrape is now finished.";
 sutil.sendMail( "Status Report: Scrape Finished", message, "[email protected]", null, null );

sortSet

List sutil.sortSet ( Set set )
List sutil.sortSet ( Set set, boolean ignoreCase )
List sutil.sortSet ( Set set, Comparator comparator )

Description

Sorts the elements in a set into an ordered list.

Parameters

  • set The set whose elements should be sorted
  • ignoreCase (optional) True if case is irrelevant when sorting strings
  • comparator (optional) The Comparator used to compare objects in the set to determine order

Return Values

This method returns a sorted list of elements that are in the set.

Change Log

Version Description
5.5.26a Available in all editions.

Examples

Output all the values in a DataRecord in alphabetical order

 // Generally when a sorted set or map is needed, a data structure should be chosen that stores the values
 // in a sorted way, such as TreeSet or TreeMap.  However, sometimes the set or map is returned by a library
 // and may not have sorted values, although sorted values are needed.
 
 List keys = sutil.sortSet(dataRecord.keySet(), true);
 
 for(int i = 0; i < keys.size(); i++)
 {
   key = keys.get(i);
   session.log(key + " : " + dataRecord.get(key));
 }

startsWithUpperCase

boolean sutil.startsWithUpperCase ( String start, String string )

Description

Determine if one string is the start of another, without regards for case.

Parameters

  • start Value to be checked as the start, as a string.
  • string Value to be searched in, as a string.

Return Values

Returns true if string starts with start when case is not considered; otherwise, it returns false.

Change Log

Version Description
5.0 Added for all editions.

Examples

Does String Start With Another String (Case Insensitive)

 // Check for RTMP URLs
 sutil.startsWithUpperCase( "rtmp", session.getv( "URL" ) );

stringToFloat

float sutil.stringToFloat ( String str ) (professional and enterprise editions only)

Description

Parse string into a floating point number.

Parameters

  • str String to be transformed into a float.

Return Values

Returns the string's value as a floating point number.

Change Log

Version Description
5.0.1a Introduced for professional and enterprise editions.

Examples

Parse a String into a Float

 // Parse Float from String
 gpa = sutil.stringToFloat( session.getv( "GPA" ) );

stripHTML

XmlNode sutil.stripHTML (String content ) (enterprise edition only)

Description

Strips HTML from a string, replacing some tags with corresonding text-only equivalents.

Parameters

  • content The content to be stripped.

Return Values

Returns the stripped content.

Change Log

Version Description
6.0.20a Available in only the Enterprise edition.

Examples

Apply an XPath expression to the current response

    String cleanedInput = sutil.stripHTML(input)

tidyDataRecord

DataRecord sutil.tidyDataRecord ( DataRecord record ) (professional and enterprise editions only)
DataRecord sutil.tidyDataRecord ( DataRecord record, boolean ignoreLowerCaseKeys ) (professional and enterprise editions only)
DataRecord sutil.tidyDataRecord ( DataRecord record, Map<String, Boolean> settings ) (professional and enterprise editions only)
DataRecord sutil.tidyDataRecord ( DataRecord record, Map<String, Boolean> settings, boolean ignoreLowerCaseKeys ) (professional and enterprise editions only)
DataRecord sutil.tidyDataRecord ( ScrapeableFile scrapeableFile, DataRecord record ) (professional and enterprise editions only)
DataRecord sutil.tidyDataRecord ( ScrapeableFile scrapeableFile, DataRecord record, boolean ignoreLowerCaseKeys ) (professional and enterprise editions only)
DataRecord sutil.tidyDataRecord ( ScrapeableFile scrapeableFile, DataRecord record, Map<String, Boolean> settings ) (professional and enterprise editions only)
DataRecord sutil.tidyDataRecord ( ScrapeableFile scrapeableFile, DataRecord record, Map<String, Boolean> settings, boolean ignoreLowerCaseKeys ) (professional and enterprise editions only)

Description

Tidies the DataRecord by performing actions based on the values of the settings map given (or getDefaultTidySettings if none is given). Each value in the record that is a string will be tidied. Keys are not modified. The record given will not be modified, but a new record with the tidied values will be returned. If no settings are given, will use the values obtained from sUtil.getDefaultTidySettings().

Parameters

  • record The DataRecord to tidy (values in the record will not be overwritten with the tidied values)
  • scrapeableFile (optional) The current ScrapeableFile, used for resolving relative URLs when tidying links
  • settings (optional) The operations to perform when tidying, using a Map<String, Boolean>

    The settings tidy settings and their default values are given below. If a key is missing in the settings map, that operation will not be performed.

    Map Key Default Value Description of operation performed
    trim true Trims whitespace from values
    convertNullStringToLiteral true Converts the string 'null' (without quotes) to the null literal (unless it has quotes around it, such as "null")
    convertLinks true Preserves links by converting <a href="link">text</a> to text (link), will try to resolve urls if scrapeableFile isn't null. Note that if there isn't a start and end <a> tag, this will do nothing
    removeTags true Remove html tags, and attempts to convert line break HTML tags such as <br> to a new line in the result
    removeSurroundingQuotes true Remove quotes from values surrounded by them -- "value" becomes value
    convertEntities (professional and enterprise editions only) true Convert html entities
    removeNewLines false Remove all new lines from the text. Replaces them with a space
    removeMultipleSpaces true Convert multiple spaces to a single space, and preserve new lines
    convertBlankToNull false Convert blank strings to null literal

  • ignoreLowerCaseKeys (optional) True if values with keys containing lowercase characters should be ignored

Return Values

A new DataRecord containing all the tidied values and any values that were not Strings in the original record. The values that were Strings but were not tidied as well as the DATARECORD value will not be in the returned record.

Change Log

Version Description
5.5.26a Available in all editions.
5.5.28a Now uses a Map for the settings, rather than bit flags.

Examples

Tidy all values in an extracted DataRecord

 DataRecord tidied = sutil.tidyDataRecord(dataRecord);
 
 // Run code here to save the tidied record

tidyString

String sutil.tidyString ( String value ) (professional and enterprise editions only)
String sutil.tidyString ( String value, Map<String, Boolean> settings ) (professional and enterprise editions only)
String sutil.tidyString ( ScrapeableFile scrapeableFile, String value ) (professional and enterprise editions only)
String sutil.tidyString ( ScrapeableFile scrapeableFile, String value, Map<String, Boolean> settings ) (professional and enterprise editions only)

Description

Tidies the string by performing actions based on the values of the settings map.

Parameters

  • value The String to tidy
  • settings(optional) The operations to perform when tidying, using a Map<String, Boolean>

    The tidy settings and their default values are given below. If a key is missing in the settings map, that operation will not be performed.

    Map Key Default Value Description of operation performed
    trim true Trims whitespace from values
    convertNullStringToLiteral true Converts the string 'null' (without quotes) to the null literal (unless it has quotes around it, such as "null")
    convertLinks true Preserves links by converting <a href="link">text</a> to text (link), will try to resolve urls if scrapeableFile isn't null. Note that if there isn't a start and end <a> tag, this will do nothing
    removeTags true Remove html tags, and attempts to convert line break HTML tags such as <br> to a new line in the result
    removeSurroundingQuotes true Remove quotes from values surrounded by them -- "value" becomes value
    convertEntities (professional and enterprise editions only) true Convert html entities
    removeNewLines false Remove all new lines from the text. Replaces them with a space
    removeMultipleSpaces true Convert multiple spaces to a single space, and preserve new lines
    convertBlankToNull false Convert blank strings to null literal

  • scrapeableFile (optional) The current ScrapeableFile, used for resolving relative URLs when tidying links

Return Values

The tidied string

Change Log

Version Description
5.5.26a Available in all editions.
5.5.28a Now uses a Map for the settings, rather than bit flags.

Examples

Tidy a comment extracted from a website

Assuming the extracted text's HTML code was:
&nbsp;&nbsp;<a href="http://www.somelink.com">This</a> was great because of these reasons:<br />
1 - Some reason<br />
2 - Another reason<br />
3 - Final reason

 String comment = sutil.tidyString(scrapeableFile, dataRecord.get("COMMENT"));

The output text would be:

This (http://www.somelink.com) was great because of these reasons:
1 - Some reason
2 - Another reason
3 - Final reason

Run only specific operations

 Map settings = new HashMap();
 settings.put("convertEntities", true);
 settings.put("trim", true);
 String text = sutil.tidyString("&nbsp;A String to tidy", settings);

unzipFile

void sutil.unzipFile ( String zippedFile )

Description

Unzip a zipped file. Contents will appear in the same directory as the zipped file.

Parameters

  • zippedFile File path to the zipped file, as a string.

Return Values

Returns void. If a file input/output error is experienced it will be thrown.

Change Log

Version Description
5.0 Added for all editions.

Examples

Unzip File

 // Unzips contents of "c:/mydir/myzip.zip"
 // to "c:/mydir/"

 sutil.unzipFile( "c:/mydir/myzip.zip" );

writeValueToFile

void sutil.writeValueToFile ( Object value, String file, String charSet )

Description

Write to a file.

Parameters

  • value The string to be written.
  • file File path where the value should be created/written, as a string. If the file already exists it will be overwritten.
  • charSet Character set of the file, as a string. Java provides a list of supported character sets in its documentation.

Return Values

Returns void.

Change Log

Version Description
5.0 Added for all editions.

Examples

Write To File

 // Writes "abc",123 to file myfile.csv using character set UTF-8
 sutil.writeValueToFile( "\"abc\",123", "myfile.csv", "UTF-8" );

Write To File Using Default Character Set

 // Writes "abc",123 to file myfile.csv
 // using screen-scraper's character set

 sutil.writeValueToFile("\"abc\",123","myfile.csv", null);

Proxy Server API

Overview

screen-scraper provides three built-in objects for proxy sessions. These objects are: proxySession, request, and response. See the Variable scope section for details on which objects are available based on when scripts are run.

proxySession

Overview

This object gives you the ability to control interactions with the proxy session. It is only for use in scripts that associated with proxy sessions.

getVariable

Object proxySession.getVariable ( String identifier )

Description

Retrieve the value of the proxy session variable.

Parameters

  • identifier Name of the session variable, as a string.

Return Values

Returns the value of the session variable.

Change Log

Version Description
4.5 Available for all editions.

Examples

Retrieve Session Variable

 // Places the proxy variable "CITY_CODE" in
 // the local variable "cityCode"

 cityCode = proxySession.getVariable( "CITY_CODE" );

See Also

  • setVariable() [proxySession] - Sets the value of a proxy session variable

log

void proxySession.log ( String message )

Description

Write to the log.

Parameters

  • message Message to be written to the log, as a string.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

Examples

Write to Log

 // Writes "Inserting request parameters into the database."
 // to the proxy session log

 proxySession.log( "Inserting request parameters into the database." );

setVariable

void proxySession.setVariable ( String identifier, Object value )

Description

Set the value of a proxy session variable.

Parameters

  • identifier Name of the session variable, as a string.
  • value The value to be assigned to the session variable.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

Examples

Set Session Variable

 // Sets the variable "CITY_CODE" in the proxySession
 // to be equal to the value of the get method of the dataSet
 proxySession.setVariable( "CITY_CODE", dataSet.get( 0, "CITY_CODE" ) );

See Also

  • getVariable() [proxySession] - Gets the value of a proxy session variable

request

A request objects references a proxySession page request. Through this object you can control various aspects of the request.

Scripts run in the scraping engine use the scrapeable file to manipulate server requests.

addHTTPHeader

void request.addHTTPHeader ( String key, String value )

Description

Manually add an HTTP header.

Parameters

  • key Name of the HTTP header, as a string.
  • value Value to be associated with the header, as a string.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

Examples

Add HTTP Header

 // Set Cookie header to someCookieValue
 request.addHTTPHeader( "Cookie" , "someCookieValue");

See Also

addPOSTParameter

void request.addPOSTParameter ( String key, String value )

Description

Add POST parameter to HTTP request.

Parameters

  • key Name of the POST parameter, as a string.
  • value Value of the POST parameter, as a string.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

Examples

Add POST Parameter

 // Add selectedState parameter to the POST variables
 // with a value of Alaska

 request.addPOSTParameter( "selectedState" , "AL");

See Also

getURLAsString

String request.getURLAsString ( String key )

Description

Retrieve the URL of the request.

Parameters

This method does not receive any parameters.

Return Values

Returns the URL of the request, as a string.

Change Log

Version Description
4.5 Available for all editions.

Examples

Retrieve Request URL

 // Retrieve the URL String
 url = request.getURLAsString();

removeHTTPHeader

void request.removeHTTPHeader ( String key, String value )

Description

Manually remove an HTTP header. Both the key and value have to be specified as HTTP headers allow for multiple headers with the same key.

Parameters

  • key Name of the HTTP header, as a string.
  • value Value to be associated with the header, as a string.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

Examples

Remove HTTP Header

 // Remove the Cookie header with the value someCookieValue
 request.removeHTTPHeader( "Cookie" , "someCookieValue");

See Also

removePOSTParameter

void request.removePOSTParameter ( String key )

Description

Remove POST parameter from HTTP request.

Parameters

  • key Name of the POST parameter, as a string.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

Examples

Remove POST Parameter

 // Removes the POST parameter selectedState
 request.removePOSTParameter( "selectedState" );

See Also

setRequestLine

void request.setRequestLine ( String requestMethod, String url, String httpVersion )

Description

Manually set the request line.

Parameters

  • requestMethod HTTP request type, as a string.
  • url Valid uri, as a string.
  • httpVersion HTTP version, as a string.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

Examples

Set Request Line

 // Sets the request line on the request
 request.setRequestLine( "GET" , "http://somesite.com/somepage.html", "HTTP/1.1");

response

The response class provides you with a means for editing the responses received by the proxy server.

Scripts run in the scraping engine us the scrapeable file to manipulate server responses.

addHTTPHeader

void response.addHTTPHeader ( String key, String value )

Description

Add HTTP header to response.

Parameters

  • key Name of the header, as a string.
  • value Value associated with the header, as a string.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

Examples

Add HTTP Header

 // Adds the HTTP Header Set-Cookie with a value
 // of someCookieValue

 response.addHTTPHeader( "Set-Cookie" , "someCookieValue");

See Also

getContentAsString

String response.getContentAsString ( )

Description

Retrieve the content of the response.

Parameters

This method does not receive any parameters.

Return Values

Returns the content of the response, as a string.

Change Log

Version Description
4.5 Available for all editions.

Examples

Get the Response Text

 // Retrieve the contents of the response
 content = response.getContentAsString();

See Also

getStatusLine

String response.getStatusLine ( )

Description

Retrieve the status line of the response.

Parameters

This method does not receive any parameters.

Return Values

Returns the status line of the response, as a string.

Change Log

Version Description
4.5 Available for all editions.

Examples

Get the Status Line Text

 // Retrieve the status line of the response
 statusLine = response.getStatusLine();

See Also

removeHTTPHeader

void response.removeHTTPHeader ( String key, String value )

Description

Remove HTTP header from response.

Parameters

  • key Name of the header, as a string.
  • value Value associated with the header, as a string.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

Examples

Remove HTTP Header

 // Remove the HTTP Header Set-Cookie that has a
 // value of someCookieValue

 response.removeHTTPHeader( "Set-Cookie" , "someCookieValue");

See Also

setContentAsString

void response.setContentAsString ( String content )

Description

Manually set the response content.

Parameters

  • content Response text, as a string.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

Examples

Change the Response Text

 // Supply your own content to the response
 response.setContentAsString( "<html> ... </html>");

See Also

setStatusLine

void response.setStatusLine ( String statusLine )

Description

Manually set the status line.

Parameters

  • statusLine New status line declaration, as a string.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

Examples

Set Status Line

 // Set the status line to HTTP/1.1 200 OK
 response.setStatusLine( "HTTP/1.1 200 OK" );

See Also

Utilities API

Overview

There are many classes that can be very helpful in getting your scripts to run correctly. Many of these are initially developed in-house to speed up coding time and once they have proved very stable offered to the public. For all classes you will need to import their packages. They are not automatically imported like the built-in screen-scraper objects.

Classes

  • CsvWriter (com.screenscraper.csv): For recording data into a CSV file (helpful for Excel).
  • DataManagerFactory (com.screenscraper.datamanager): Facilitates the creation of an SqlDataManager.
  • ProxyServerPool (com.screenscraper.util): For setting up anonymization using your own proxies.
  • RetryPolicy and RetryPolicyFactory (com.screenscraper.util.retry): Objects that tell a scrapeable file how to check for errors, and optionally what to do before retrying to download them. .
  • SqlDataManager (com.screenscraper.datamanager): Facilitates writing of data into a SQL database.
  • XmlWriter (com.screenscraper.xml): Oftentimes you want to write extracted data directly to an XML file. This class facilitates doing that.

Apache Lang Library

Overview

The Apache Lang library provides enhancements to the standard Lang library of Java and can be particularly useful for completing tasks. As it is not a class that we maintain we will not document the methods in case they change without our notice but we invite you to look over how to use it in their API.

CSVReader

Overview

The CSVReader is not a class that is part of screen-scraper but is very useful and well put together. We have used it extensively. It is part of the opencsv package which actually holds the under pinnings of our own CsvWriter. As it is not a class that we maintain we will not document the methods in case they change without our notice but we invite you to look over how to use it in their API or brief documentation.

Using CSVReader

To use the CSVReader simply import it in your script, the same as you would any other utility class. The opencsv.jar file is already included in the Professional and Enterprise Editions of screen-scraper's default installation.

//import opencsv class
import au.com.bytecode.opencsv.*;

// read file
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"));

CsvWriter

Overview

This CsvWriter has been created to work particularly well with the screen-scraper objects. It is simple to use and provided to ease the task of keeping track of everything when creating a csv file.

The most used methods are documented here but if you would like more information you can read the JavaDoc for the CsvWriter.

CsvWriter

CsvWriter CsvWriter ( String filePath ) (professional and enterprise editions only)
CsvWriter CsvWriter ( String filePath, boolean addTimeStamp ) (professional and enterprise editions only)
CsvWriter CsvWriter ( String filePath, char separator ) (professional and enterprise editions only)
CsvWriter CsvWriter ( String filePath, char separator, boolean addTimeStamp ) (professional and enterprise editions only)
CsvWriter CsvWriter ( String filePath, char separator, char quotechar ) (professional and enterprise editions only)
CsvWriter CsvWriter ( String filePath, char separator, char quotechar, char escapechar ) (professional and enterprise editions only)
CsvWriter CsvWriter ( String filePath, char separator, char quotechar, String lineEnd ) (professional and enterprise editions only)
CsvWriter CsvWriter ( String filePath, char separator, char quotechar, char escapechar, String lineEnd ) (professional and enterprise editions only)

Description

Create a csv file writer.

Parameters

  • filePath File path to where the csv file should be created/saved, as a string.
  • addTimeStamp (optional) If true a time stamp will be added to the filename; otherwise, the filePath will remain unchanged.
  • seperator (optional) The character that should be used to separate the fields in the csv file, the default is char 44 (comma).
  • quotechar (optional) The character that should be used to quote fields, the default is char 34 (straight double-quotes).
  • escapechar (optional) The escape character for quotes, the default is char 34 (straight double-quotes).
  • lineEnd (optional) The end of line character, as a string. The default is the new line character ("\n").

Return Values

Returns a CsvWriter object. If it encounters an error it will be thrown.

Change Log

Version Description
5.0 Available for Professional and Enterprise editions.
4.5.18a Introduced in alpha version.

Class Location

com.screenscraper.csv.CsvWriter

Examples

Create CsvWriter

 // Import class
 import com.screenscraper.csv.*;

 // Create CsvWriter with timestamp
 CsvWriter writer = new CsvWriter("output.csv", true);

 // Save in session variable for general access
 session.setVariable( "WRITER", writer);

close

void csvWriter.close ( )

Description

Clear the buffer contents and close the file.

Parameters

This method does not receive any parameters.

Return Values

Returns void.

Change Log

Version Description
5.0 Available for all editions.
4.5.18a Introduced in alpha version.

Examples

Close CsvWriter

 // Retrieve CsvWriter from session variable
 writer = session.getv( "WRITER" );

 // Write buffer and close file
 writer.close();

flush

void csvWriter.flush ( )

Description

Write the buffer contents to the file.

Parameters

This method does not receive any parameters.

Return Values

Returns void.

Change Log

Version Description
5.0 Available for all editions.
4.5.18a Introduced in alpha version.

Examples

Write Data Record to CSV

 // Retrieve CsvWriter from session variable
 writer = session.getv( "WRITER" );

 // Write dataRecord to the file (headers already set)
 writer.write(dataRecord);

 // Flush record to file (write it now)
 writer.flush();

setHeader

void csvWriter.setHeader ( String[ ] header )

Description

Set the header row of the csv document. If the document already exists the headers will not be written. Also creates a data record mapping to ease writing to file.

Parameters

  • header Headers of csv file, as a one-dimensional array of strings.

Return Values

Returns void.

Change Log

Version Description
5.0 Available for all editions.
4.5.18a Introduced in alpha version.

If you want to use the data record mapping then the extractor tokens names should be all caps and all spaces should be replaced with underscores.

Examples

Add Headers to CSV File

 // Create CsvWriter with timestamp
 CsvWriter writer = new CsvWriter("output.csv", true);

 // Create Headers Array
 String[] header = {"Brand Name", "Product Title"};

 // Set Headers
 writer.setHeader(header);

 // Write out to file
 writer.flush();

 // Save in session variable for general access
 session.setVariable( "WRITER", writer);

write

void csvWriter.write ( DataRecord dataRecord )

Description

Write to the CsvWriter object.

Parameters

  • dataRecord The data record containing the mapped token matches (see setHeader). Note that the token names in the data record should be in all caps, and spaces should be replaced with underscores. For example, if one of your headers is "Product ID", the corresponding data record token should be "PRODUCT_ID". This is in keeping with the recommended naming convention for extractor pattern tokens.

Return Values

Returns void.

Change Log

Version Description
5.0 Available for all editions.
4.5.18a Introduced in alpha version.

Examples

Write Data Record to CSV

 // Retrieve CsvWriter from session variable
 writer = session.getv( "WRITER" );

 // Write dataRecord to the file (headers already set)
 writer.write(dataRecord);

 // Flush record to file (write it now)
 writer.flush();

DataManagerFactory

Overview

This class is used to instantiate a data manager object. This is done to simplify the process of creating a data manager of a given type. Currently it only creates SqlDataManagers. A SQL data manager can be created without the use of this class, but it is simplified greatly through its use.

This class should no longer be used. Use a java.sql.BasicDataSource or com.screenscraper.datamanager.SshDataSource instead. See the SqlDataManager.buildSchemas page for examples

This class is only available for Professional and Enterprise editions of screen-scraper.

getMsSqlDataManager

This method is no longer supported. Use a java.sql.BasicDataSource or com.screenscraper.datamanager.SshDataSource instead. See the SqlDataManager.buildSchemas page for examples.

SqlDataManager dataManagerFactory.getMsSqlDataManager ( ScrapingSession session, String host, String database, String username, String password, String queryString) (professional and enterprise editions only)

Description

Create a MsSQL data manager object.

Parameters

  • session The scraping session that the data manager should be attached to.
  • host The database host (URL and maybe Port), as a string.
  • database The name of the database, as a string.
  • username Username that is being used to access the database, as a string.
  • password The username's associated password, as a string.
  • parameters URL encoded query string, as a string.

Return Values

Returns a SqlDataManager object. If an error is experienced it will be thrown.

Change Log

Version Description
5.0 Available for professional and enterprise editions.

In order to create the MsSQL data manager you will need to make sure to install the appropriate jdbc driver. This can be done by downloading the MsSQL JDBC driver and placing it in the lib/ext folder in the screen-scraper installation directory.

Examples

Create MsSQL Data Manager

 // Import classes
 import com.screenscraper.datamanager.*;
 import org.apache.commons.dbcp.BasicDataSource;

 // Set Variables
 host = "127.0.0.1";
 database = "mydb";
 username = "user";
 password = "pwrd";
 parameters = null;

 // Get MsSQL datamanager
 dm = DataManagerFactory.getMsSqlDataManager( session, host, database, username, password, parameters);

getMySqlDataManager

This method is no longer supported. Use a java.sql.BasicDataSource or com.screenscraper.datamanager.SshDataSource instead. See the SqlDataManager.buildSchemas page for examples.

SqlDataManager dataManagerFactory.getMySqlDataManager ( ScrapingSession session, String host, String database, String username, String password, String parameters ) (professional and enterprise editions only)

Description

Create a MySQL data manager object.

Parameters

  • session The scraping session that the data manager should be attached to.
  • host The database host (URL and maybe Port), as a string.
  • database The name of the database, as a string.
  • username Username that is being used to access the database, as a string.
  • password The username's associated password, as a string.
  • parameters URL encoded query string, as a string.

Return Values

Returns a SqlDataManager object. If an error is experienced it will be thrown.

Change Log

Version Description
5.0 Available for professional and enterprise editions.

In order to create the MySQL data manager you will need to make sure to install the appropriate jdbc driver. This can be done by downloading the MySQL JDBC driver and placing it in the lib/ext folder in the screen-scraper installation directory.

Examples

Create MySQL Data Manager

 // Import classes
 import com.screenscraper.datamanager.*;
 import org.apache.commons.dbcp.BasicDataSource;

 // Set Variables
 host = "127.0.0.1:3306";
 database = "mydb";
 username = "user";
 password = "pwrd";
 parameters = null;

 // Get MySQL datamanager
 dm = DataManagerFactory.getMySqlDataManager( session, host, database, username, password, parameters);

getOracleDataManager

This method is no longer supported. Use a java.sql.BasicDataSource or com.screenscraper.datamanager.SshDataSource instead. See the SqlDataManager.buildSchemas page for examples.

SqlDataManager dataManagerFactory.getOracleDataManager ( ScrapingSession session, String host, String database, String username, String password, String queryString ) (professional and enterprise editions only)

Description

Create an Oracle data manager object.

Parameters

  • session The scraping session that the data manager should be attached to.
  • host The database host (URL and maybe Port), as a string.
  • database The name of the database, as a string.
  • username Username that is being used to access the database, as a string.
  • password The username's associated password, as a string.
  • parameters URL encoded query string, as a string.

Return Values

Returns a SqlDataManager object. If an error is experienced it will be thrown.

Change Log

Version Description
5.0 Available for professional and enterprise editions.

In order to create the Oracle data manager you will need to make sure to install the appropriate jdbc driver. This can be done by downloading the Oracle JDBC driver and placing it in the lib/ext folder in the screen-scraper installation directory.

Examples

Create an Oracle Data Manager

 // Import classes
 import com.screenscraper.datamanager.*;
 import org.apache.commons.dbcp.BasicDataSource;

 // Set Variables
 host = "127.0.0.1:3306";
 database = "mydb";
 username = "user";
 password = "pwrd";
 parameters = null;

 // Get Oracle datamanager
 dm = DataManagerFactory.getOracleDataManager( session, host, database, username, password, parameters);

getPostreSqlDataManager

This method is no longer supported. Use a java.sql.BasicDataSource or com.screenscraper.datamanager.SshDataSource instead. See the SqlDataManager.buildSchemas page for examples.

SqlDataManager dataManagerFactory.getPostreSqlDataManager ( ScrapingSession session, String host, String database, String username, String password, String queryString ) (professional and enterprise editions only)

Description

Create a Postgre data manager object.

Parameters

  • session The scraping session that the data manager should be attached to.
  • host The database host (URL and maybe Port), as a string.
  • database The name of the database, as a string.
  • username Username that is being used to access the database, as a string.
  • password The username's associated password, as a string.
  • parameters URL encoded query string, as a string.

Return Values

Returns a SqlDataManager object. If an error is experienced it will be thrown.

Change Log

Version Description
5.0 Available for professional and enterprise editions.

In order to create the Postgre data manager you will need to make sure to install the appropriate jdbc driver. This can be done by downloading the Postgre JDBC driver and placing it in the lib/ext folder in the screen-scraper installation directory.

Examples

Create a Postgre Data Manager

 // Import classes
 import com.screenscraper.datamanager.*;
 import org.apache.commons.dbcp.BasicDataSource;

 // Set Variables
 host = "127.0.0.1:3306";
 database = "mydb";
 username = "user";
 password = "pwrd";
 parameters = null;

 // Get PostgreSQL datamanager
 dm = DataManagerFactory.getPostreSqlDataManager( session, host, database, username, password, parameters);

getSqliteDataManager

This method is no longer supported. Use a java.sql.BasicDataSource or com.screenscraper.datamanager.SshDataSource instead. See the SqlDataManager.buildSchemas page for examples.

SqlDataManager dataManagerFactory.getSqliteDataManager ( ScrapingSession session, String file, String username, String password ) (professional and enterprise editions only)

Description

Create a SQLite data manager object.

Parameters

  • session The scraping session that the data manager should be attached to.
  • file The file path of the sqlite file, as a string.
  • username Username that is being used to access the database, as a string.
  • password The username's associated password, as a string.

Return Values

Returns a SqlDataManager object. If an error is experienced it will be thrown.

Change Log

Version Description
5.0 Available for professional and enterprise editions.

In order to create the Sqlite data manager you will need to make sure to install the appropriate jdbc driver. This can be done by downloading the Sqlite JDBC driver and placing it in the lib/ext folder in the screen-scraper installation directory.

Examples

Create a SQLite Data Manager

 // Import classes
 import com.screenscraper.datamanager.*;
 import org.apache.commons.dbcp.BasicDataSource;

 // Set Variables
 file = "c:/db/mydb.sqlite";
 username = "user";
 password = "pwrd";

 // Get Sqlite datamanager
 dm = DataManagerFactory.getSqliteDataManager( session, file, username, password);

ProxyServerPool

Overview

The proxy server pool object is used to aid with manual anonymization of scrapes. An example of how to setup manual proxy pools is available in the documentation. You will likely want to read that page first if you are new to the process.

Additionally, you should reference the available method's available in the Anonymous API

ProxyServerPool

ProxyServerPool ProxyServerPool ( )

Description

Initiate a ProxyServerPool object.

Parameters

This method does not receive any parameters.

Return Values

Returns a ProxyServerPool.

Change Log

Version Description
4.5 Available for all editions.

Class Location

com.screenscraper.util.ProxyServerPool

Examples

Creating ProxyServerPool

 import com.screenscraper.util.*;

 // Create a new ProxyServerPool object. This object will
 // control how screen-scraper interacts with proxy servers.

 proxyServerPool = new ProxyServerPool();

filter

void proxyServerPool.filter ( int timeout )

Description

Set the timeout that will render a proxy as being bad.

Parameters

  • timeout Number of seconds before timeout, as an integer.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

Examples

Setup Timout for Bad Proxies

 import com.screenscraper.util.*;

 // Create a new ProxyServerPool object.
 proxyServerPool = new ProxyServerPool();

 // Must be set on the session before other calls are made
 session.setProxyServerPool(proxyServerPool);

 // This tells the pool to populate itself from a file
 proxyServerPool.populateFromFile( "proxies.txt" );

 // Validate proxies up to 25 proxies at a time.
 proxyServerPool.setNumProxiesToValidateConcurrently( 25 );

 // This method call tells screen-scraper to filter the list of>
 // proxy servers using 7 seconds as a timeout value. That is,
 // if a server doesnt respond within 7 seconds, it's deemed
 // to be invalid.

 proxyServerPool.filter( 7 );

getNumProxyServers

int proxyServerPool.getNumProxyServers ( int numProxyServers )

Description

Retrieve the number of available proxy servers.

Parameters

This method does not receive any parameters.

Return Values

Returns the number of available proxy servers, as an integer.

Change Log

Version Description
4.5 Available for all editions.

Examples

Write Good Proxies to File

 import com.screenscraper.util.*;

 // Create a new ProxyServerPool object.
 proxyServerPool = new ProxyServerPool();

 // Must be set on the session before other calls are made
 session.setProxyServerPool(proxyServerPool);

 // This tells the pool to populate itself from a file
 proxyServerPool.populateFromFile( "proxies.txt" );

 // Validate proxies up to 25 proxies at a time.
 proxyServerPool.setNumProxiesToValidateConcurrently( 25 );

 // Set timout interval
 proxyServerPool.filter( 7 );
 
 // Check number of available proxies
 if (proxyServerPool.getNumProxyServers() < 4) {
 {
    // Repopulate available proxies
    proxyServerPool.setRepopulateThreshold( 5 );
 }

outputProxyServersToLog

void proxyServerPool.outputProxyServersToLog ( )

Description

Write list of proxies to log.

Parameters

This method does not receive any parameters.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

Examples

Write Good Proxies to File

 import com.screenscraper.util.*;

 // Create a new ProxyServerPool object.
 proxyServerPool = new ProxyServerPool();

 // Must be set on the session before other calls are made
 session.setProxyServerPool(proxyServerPool);

 // This tells the pool to populate itself from a file>
 proxyServerPool.populateFromFile( "proxies.txt" );

 // Validate proxies up to 25 proxies at a time.
 proxyServerPool.setNumProxiesToValidateConcurrently( 25 );

 // Set timout interval
 proxyServerPool.filter( 7 );

 // Write good proxies to file
 proxyServerPool.writeProxyPoolToFile( "good_proxies.txt" );

 // You might also want to write out the list of proxy servers
 // to screen-scraper's log.

 proxyServerPool.outputProxyServersToLog();

populateFromFile

void proxyServerPool.populateFromFile ( String filePath )

Description

Add proxy servers to pool using a text file.

Parameters

  • filePath Path to the file containing proxy settings, as a string. The format of the file is a hard return delimited list of domain:port listing.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

Examples

Creating ProxyServerPool

 import com.screenscraper.util.*;

 // Create a new ProxyServerPool object. This object will
 // control how screen-scraper interacts with proxy servers.

 proxyServerPool = new ProxyServerPool();

 // Must be set on the session before other calls are made
 session.setProxyServerPool(proxyServerPool);

 // This tells the pool to populate itself from a file
 // containing a list of proxy servers. The format is very
 // simple--you should have a proxy server on each line of
 // the file, with the host separated from the port by a colon.
 // For example:
 // one.proxy.com:8888
 // two.proxy.com:3128
 // 29.283.928.10:8080
 // But obviously without the slashes at the beginning.

 proxyServerPool.populateFromFile( "proxies.txt" );

setAutomaticProxyCycling

void setAutomaticProxyCycling ( boolean cycleProxies )(professional and enterprise editions only)

Description

Enables or disables automatic proxy cycling. When this is set to false (default is true) the current proxy that was automatically selected from the pool will be used each time the next proxy is requested. When set to true, each call to the getNextProxy method will cycle as normal between all available proxies.

Parameters

A boolean value.

Return Value

None

Change Log

Version Description
5.5.17a Available in Professional and Enterprise editions.

Example

// Assuming a ProxyServerPool object was created previously, and
// stored in the PROXY_SERVER_POOL session variable.
pool = session.getv( "PROXY_SERVER_POOL" );

// This will cause the current proxy server to be reused until the
// value is set back to true.
pool.setAutomaticProxyCycling( false );

// The corresponding getter will indicate what the current value is.
session.log( "Automatically cycling proxies: " + pool.getAutomaticProxyCycling() );

setNumProxiesToValidateConcurrently

void proxyServerPool.setNumProxiesToValidateConcurrently ( int numProxies )

Description

Set the number of proxies that can be tested concurrently.

Parameters

  • numProxies Number of proxies to be validated concurrently, as an integer.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

Examples

Test Proxies in Pool in Multiple Threads

 import com.screenscraper.util.*;

 // Create a new ProxyServerPool object.
 proxyServerPool = new ProxyServerPool();

 // Must be set on the session before other calls are made
 session.setProxyServerPool(proxyServerPool);

 // This tells the pool to populate itself from a file
 proxyServerPool.populateFromFile( "proxies.txt" );

 // screen-scraper can iterate through all of the proxies to
 // ensure theyre responsive. This can be a time-consuming
 // process unless it's done in a multi-threaded fashion.
 // This method call tells screen-scraper to validate up to
 // 25 proxies at a time.

 proxyServerPool.setNumProxiesToValidateConcurrently( 25 );

setRepopulateThreshold

void proxyServerPool.setRepopulateThreshold ( int repopulateThreshold )

Description

Set threshold to get more proxy servers.

Parameters

  • repopulateThreshold Lowest number of proxies before more proxies are requested.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

Examples

Write Good Proxies to File

 import com.screenscraper.util.*;

 // Create a new ProxyServerPool object.
 proxyServerPool = new ProxyServerPool();

 // Must be set on the session before other calls are made
 session.setProxyServerPool(proxyServerPool);

 // This tells the pool to populate itself from a file
 proxyServerPool.populateFromFile( "proxies.txt" );

 // Validate proxies up to 25 proxies at a time.
 proxyServerPool.setNumProxiesToValidateConcurrently( 25 );

 // Set timout interval
 proxyServerPool.filter( 7 );

 // Write good proxies to file
 proxyServerPool.writeProxyPoolToFile( "good_proxies.txt" );

 // Write Proxy Servers to log
 proxyServerPool.outputProxyServersToLog();

 // As a scraping session runs, screen-scraper will filter out
 // proxies that become non-responsive. If the number of proxies
 // gets down to a specified level, screen-scraper can repopulate
 // itself. Thats what this method call controls.

 proxyServerPool.setRepopulateThreshold( 5 );

writeProxyPoolToFile

void proxyServerPool.writeProxyPoolToFile ( String path )

Description

Write list of proxies after invalid proxies have been removed.

Parameters

  • path File path to where the file should be written, as a string.

Return Values

Returns void.

Change Log

Version Description
4.5 Available for all editions.

Examples

Write Good Proxies to File

 import com.screenscraper.util.*;

 // Create a new ProxyServerPool object.
 proxyServerPool = new ProxyServerPool();

 // Must be set on the session before other calls are made
 session.setProxyServerPool(proxyServerPool);

 // This tells the pool to populate itself from a file
 proxyServerPool.populateFromFile( "proxies.txt" );

 // Validate proxies up to 25 proxies at a time.
 proxyServerPool.setNumProxiesToValidateConcurrently( 25 );

 // Set timout interval
 proxyServerPool.filter( 7 );

 // Once filtering is done, it's often helpful to write the good
 // set of proxies out to a file. That way you may not have to
 // filter again the next time.

 proxyServerPool.writeProxyPoolToFile( "good_proxies.txt" );

RetryPolicy

Overview

Retry Policies are objects that tell a scrapeable file how to check for errors, and optionally what to do before retrying to download the files. Some of the things that can be done are executing scripts when the page loads incorrectly or running Runnables. Usually these things would either request a new proxy, output some helpful information, or could simply stop the scrape. RetryPolicy is an interface and can be implemented to create a custom retry policy, or there is a RetryPolicyFactory class that can be used to create some standard policies.

This policy is checked AFTER all the extractors have been run. This allows for checks on whether extractor patterns matched or not, and also allows a page to have it's 'error status' based off of another page (since extractor patterns could execute scripts that scrape other files, and those files could set a variable that acts as a flag to a previous retry policy). This could also cause some problems if the scrape isn't built to handle a page whose extractors shouldn't be run before the error checking occurs.
This interface is in the com.screenscraper.util.retry package.

Interface Implementation

If you need a custom retry policy, you can implement your own version of it. Be aware that you will need to ensure the references it has to the scrapeableFile are to the correct scrapeableFile. This could be tricky if you use the session.setDefaultRetryPolicy method. When using the scrapeableFile.setRetryPolicy method, the scrapeableFile will be the correct object. The interface is given below.

To help ensure you can create custom retry policies that have access to the scraping session and the scrapeable file that is currently being checked, there is an AbstractRetryPolicy class in the same package as the interface. This class defines some default behavior and adds protected fields for the session and scrapeable file that get set before the policy is run. If you extend this abstract class you can access the session and scrapeable file through this.scrapingSession and this.theScrapeableFile. Due to some oddities with the interpreter it is best to reference these variables with 'this.' to eliminate a few problems that arise in a few specific cases.

public interface RetryPolicy
{
        /**
         * Checks to see if the page loaded incorrectly
         *
         * @return True on errors, false otherwise
         * @throws Exception If something goes wrong while executing this method
         */

        public boolean isError() throws Exception;

        /**
         * Runs this code when the page had an error.  This could include things such as rotating the proxy.
         *
         * @throws Exception If something goes wrong while executing this method
         */

        public void runOnError() throws Exception;

        /**
         * Returns a map that can be used to output an error message to indicate what checks failed.  For instance,
         * you could set a key to the value "Status Code" and the value '200', or a key with "Valid Page" and value 'false'
         *
         * @return Map of keys, or null if no values are indicated
         *
         * @throws Exception If something goes wrong while executing this method
         */

        public Map getErrorChecksMap() throws Exception;

        /**
         * Returns true if the session variables should be reset before attempting to rescrape the file, if there was an error.
         * This can be useful especially if extractors null session variables when they don't match, but the value is needed
         * to rescrape the file.
         *
         * @return True if session variables should be reset if there was an error, false otherwise.
         */

        public boolean resetSessionVariablesBeforeRescrape();

        /**
         * Returns true if the referrer should be reset before attempting to rescrape the file,
         * if there was an error. This can be useful to reset so the referrer
         * doesn't show the page you just requested.
         *
         * @return True if the referrer should be reset if there was an error, false otherwise.
         */

        public boolean resetReferrerBeforeRescrape();

        /**
         * Returns true if errors should be logged to the log/web interface when they occur
         *
         * @return True if errors should be logged to the log/web interface when they occur
         */

        public boolean shouldLogErrors();

        /**
         * Return the maximum number of times this policy allows for a retry before terminating in an error
         *
         * @return The maximum number of times to allow the ScrapeableFile to be rescraped before resulting in an error
         */

        public int getMaxRetryAttempts();

        /**
         * This will be called if all the retry attempts for the scrapeable file failed.
         * In other words, if the policy said to retry 25 times, after 25 failures this
         * method will be called.  Note that {@link #runOnError()} will be called just before this,
         * as it is called after each time the scrapeable file fails to load
         * correctly, including the last time it fails to load.
         * <p/>
         * This should only contain code that handles the final error.  Any proxy rotating, cookie
         * clearing, etc... should generally be done in the {@link #runOnError()}
         * method, especially since it will still be called after the final error.
         */

        public void runOnAllAttemptsFailed();
}

getErrorChecksMap

Map getErrorChecksMap ( )

Description

Returns a map that can be used to output an error message to indicate what checks failed. For instance, you could set a key to the value "Status Code" and the value '200', or a key with "Valid Page" and value 'false'

Parameters

This method takes no parameters

Return Value

Map of keys, or null if no values are indicated

Change Log

Version Description
5.5.29a Available in all editions.

Examples

Create a custom RetryPolicy

 import com.screenscraper.util.retry.RetryPolicy;
 
 _log = log;
 _session = session;
 
 RetryPolicy policy = new RetryPolicy()
 {
   Map errorMap = new HashMap();

   boolean isError() throws Exception
   {
     errorMap.put("Was Error On Request", scrapeableFile.wasErrorOnRequest());
     return scrapeableFile.wasErrorOnRequest();
   }

   void runOnError() throws Exception
   {
     session.executeScript("Rotate Proxy");
   }

   Map getErrorChecksMap() throws Exception
   {
     return errorMap;
   }

   boolean resetSessionVariablesBeforeRescrape()
   {
     return true;
   }

   boolean shouldLogErrors()
   {
     return true;
   }

   int getMaxRetryAttempts()
   {
     return 5;
   }
   
   boolean resetReferrerBeforeRescrape()
   {
      return false;
   }
   
   void runOnAllAttemptsFailed()
   {
      _log.logError("Failed to fix errors with the retry policy, stopping scrape");
      _session.stopScraping();
   }
 };

 scrapeableFile.setRetryPolicy(policy);

getMaxRetryAttempts

int getMaxRetryAttempts ( )

Description

Return the maximum number of times this policy allows for a retry before terminating in an error

Parameters

This method takes no parameters

Return Value

The maximum number of times to allow the ScrapeableFile to be rescraped before resulting in an error

Change Log

Version Description
5.5.29a Available in all editions.

Examples

Create a custom RetryPolicy

 import com.screenscraper.util.retry.RetryPolicy;
 
 _log = log;
 _session = session;
 
 RetryPolicy policy = new RetryPolicy()
 {
   Map errorMap = new HashMap();

   boolean isError() throws Exception
   {
     errorMap.put("Was Error On Request", scrapeableFile.wasErrorOnRequest());
     return scrapeableFile.wasErrorOnRequest();
   }

   void runOnError() throws Exception
   {
     session.executeScript("Rotate Proxy");
   }

   Map getErrorChecksMap() throws Exception
   {
     return errorMap;
   }

   boolean resetSessionVariablesBeforeRescrape()
   {
     return true;
   }

   boolean shouldLogErrors()
   {
     return true;
   }

   int getMaxRetryAttempts()
   {
     return 5;
   }
   
   boolean resetReferrerBeforeRescrape()
   {
      return false;
   }
   
   void runOnAllAttemptsFailed()
   {
      _log.logError("Failed to fix errors with the retry policy, stopping scrape");
      _session.stopScraping();
   }
 };

 scrapeableFile.setRetryPolicy(policy);

isError

boolean isError ( )

Description

Checks to see if the page loaded incorrectly

Parameters

This method takes no parameters

Return Value

True on errors, false otherwise

Change Log

Version Description
5.5.29a Available in all editions.

Examples

Create a custom RetryPolicy

 import com.screenscraper.util.retry.RetryPolicy;
 
 _log = log;
 _session = session;
 
 RetryPolicy policy = new RetryPolicy()
 {
   Map errorMap = new HashMap();

   boolean isError() throws Exception
   {
     errorMap.put("Was Error On Request", scrapeableFile.wasErrorOnRequest());
     return scrapeableFile.wasErrorOnRequest();
   }

   void runOnError