Miscellaneous

Overview

This section s provided to give additional information about the software, how it works, and the technologies behind screen-scraper. Many of these pages contain links that are not under the control of our company. We have chosen them for their quality at the time. If the links break or the content changes we would appreciate your contacting us about it so that the links remain relevant.

Code Completion

Overview

As of screen-scraper 5.0 a simple code completion has been added to the scripting. It is meant to make it easier to remember the method names and their parameters. It provides you with this information as well as a link back to the documentation on the methods.

Using Code Completion

To activate the dialog simply type the name of a built-in object followed by a period (just like you would when coding). If you pause after the period the dialog will pop up and allow you to click through the methods of the object. As you type it will limit the list until it gets to the one that you are looking for. By double-clicking on one, or hitting the tab key when it is selected, you will get the remaining code in your script with place holders for the parameters. Type in the values of the parameters and hit tab to jump to the next. When you are finished, the last time you hit tab it will jump to the end of the method call.

Macros

In addition to the code completion there are a number of built-in macros for common tasks. To active a macro simply type in its code and then hit the spacebar while holding down the Ctrl button.

Macro Codes

  • sv - Set a session variable.
  • gv - Get a session variable.
  • dp - Put a value into the dataRecord.
  • dg - Get a value from the dataRecord.
  • es - Manually initiate a script
  • sf - Manually initiate a file scrape.
  • log - Log a debug message.
  • err - Log an error message.
  • warn - Log a warning message.
  • info - Log an information message.
  • extract - Manually invoke an extractor pattern.
  • for - Setup a for loop.

Generating RSS and Atom Feeds

Overview

This feature is only available to the Enterprise edition of screen-scraper.

screen-scraper has the ability to automatically generate RSS and Atom feeds from extracted data. If you're unfamiliar with RSS and Atom feeds you might take a minute to read up on the topic first.

The documentation on this page is a bit abstract. If you're interested in building RSS/Atom feeds with screen-scraper it would probably be a good idea for you to go through our Sixth Tutorial, which will walk you through the process in detail.

How it Works

A small web server runs within screen-scraper that interacts with the scraping engine. As such, you can access a URL within a browser or RSS/Atom reader that will cause screen-scraper to invoke a scraping session, then return back an RSS or Atom feed.

The basic syntax for the URL you'll use to generate a feed looks like this:

 http://(host:port)/ss/xmlfeed?scraping_session=(scraping session name)[&key1=value1&key2=value2...]

For example, if you were running screen-scraper on your local machine, and wanted to generate a feed for the "Shopping Site" example used in our tutorials with the search term "bug" the URL would look like this:

 http://localhost/ss/xmlfeed?scraping_session=Shopping+Site&SEARCH=bug

As with any other URL, each of the parameters must be properly URL-encoded. Key/value pairs can also be passed in as POST parameters.

The only required parameter is "scraping_session". screen-scraper will create session variables out of any other parameters that get passed in.

Setting Up the Scraping Session

The scraping session must have certain named elements present in order to generate the feed. They are as follows:

  • XML_FEED_TITLE optional: A String session variable containing the name that will be used for the entire feed. (e.g., "CNN Headlines")
  • XML_FEED_LINK optional: A String session variable containing the link associated with the feed. (e.g., "http://www.cnn.com/")
  • XML_FEED_DESCRIPTION optional: A String session variable containing the description of the feed. (e.g., "The latest news headlines from CNN.com")
  • XML_FEED_FORMAT optional: A String session variable indicating the format of the feed. Valid values are atom_0.3, rss_0.9, rss_0.91N, rss_0.91U, rss_0.92, rss_0.93, rss_0.94, rss_1.0, and rss_2.0. If omitted, the default value is rss_1.0.
  • XML_FEED (required): This session variable should hold a DataSet consisting of DataRecords that will make up the various feed items (e.g., each news headline). Each DataRecord should contain values using the names given below.
  • TITLE: The title of the feed item.
  • LINK: The link of the feed item.
  • DESCRIPTION: The description of the feed item.
  • PUBLISHED_DATE: The published date of the feed item. This should be a Java Date object.

When the XML feed is requested through your browser or reader screen-scraper will invoke the scraping session named by the "scraping_session" parameter. Once the scraping session completes screen-scraper will look for a DataSet called "XML_FEED", iterate over its constituent DataRecord objects, building the feed from them.

How HTTP Works

Overview

Hypertext transfer protocol provides a way for clients such as web browsers to communicate with web servers. There's quite a bit on the web that's written on the topic, so for the time being we'll just provide some good links for you:

Importing and Exporting Objects

Overview

Scraping sessions and scripts can be exported from screen-scraper to external files. You might consider doing this in order to back up your work, and even commit them to a versioning system, such as CVS or Subversion.

Exporting Objects from screen-scraper

In order to export a scraping session or script to an external file simply select the object you wish to export then click on the corresponding Export button (Export Session or Export Script). You'll be asked to save the file to a location of your choice. You're also free to name the file what you wish, though we recommend you leave the (scraping session) or (script) portion of the name in tact so that you can identify the type of the object later on. When you export a scraping session from screen-scraper all scripts directly associated with that scraping session will be exported within the same file.

When a scraping session is exported the time of export is also included in the resulting file. This date can be useful to track versions of the scraping session. To view the date, open the .sss file in a text editor and search for the XML node.

Importing Objects into screen-scraper

To import a scraping session or script into screen-scraper select the Import... option from the File menu. Locate the ".sss" file corresponding to the object you wish to import, and select Open. If you've selected a valid file the objects contained within that file will be imported into the application.

Import Directory

You can also import exported scraping sessions and scripts into screen-scraper by copying them into the import folder you'll find in the directory where screen-scraper was installed. This can be especially useful while screen-scraper is running as a server, which allows the objects to be imported on the fly (that is, without stopping the server). screen-scraper will check this directory just before executing a scraping session, and import any files found in it. Note that imported files will be removed from the import folder once they are imported by screen-scraper.

Update Directory

In cases where you want to pack up scraping sessions and scripts along with other files needed to run a scrape, you can compress them all into an update.zip file. This file should replicate the directory structure of screen-scraper. For example, you might have a folder called import that contains a scraping session. You might also have a CSV file in the root of the zip file that contains parameters needed to run the scraping session. You can zip all of these up into an update.zip file, then place that file inside an update folder found in screen-scraper's install directory. When screen-scraper starts up it will unzip the file, copy all of its contents to the corresponding locations, then delete the update.zip file.

Overwriting Import

If you've un-checked the Overwrite on import checkbox for a script, and would like to import that script into an instance of screen-scraper that is running in a GUI-less environment, follow the instructions on script overwriting.

Memory usage indicator

Overview

The memory usage indicator was introduced in screen-scraper 4.5 and shows you how much of the memory currently allocated to screen-scraper that is being used. As screen-scraper requires it, it may be allocated more memory from the underlying Java Virtual Machine, up to the amount specified in the settings dialog box.

In the workbench, the indicator is on the far right of the main window's status bar. In the Enterprise Edition's web interface, the indicator is at the top of the page under the Import button.

The current memory usage can also be queried in a script via the getMemoryUsage method.

Updating screen-scraper in a GUI-less environment

Overview

Often times screen-scraper will be running on a server that has no graphical interface. Updating to the latest version in such an environment previously required multiple steps, but can now be done with a simple Python script.

You can download the script from our site.

Any Unix-based computer worth its salt will already have python installed. To use the updater, open a terminal and navigate to the screen-scraper install directory. Ensure that screen-scraper is not currently running (via ./server status). After that, issue this command to update to the latest version:

python ss_updater.py

If you want to force screen-scraper to upgrade to the latest unstable version, use this command:

python ss_updater.py -u