General Utility
Frequently there are tasks that you will perform on a regular basis. While you can write separate scripts for each of these, sometimes it is more useful to create an object that can store information to be used between scripts, much like an object in java. Below is general utility script that contains many useful functions. The first few hundred lines list the methods and what they are used for. The script is rather large (over 6500 lines), so please download it to view it.
The script is setup to create a Utility object when run, and store it in the session variable "_GENERAL_UTILITY". Generally when using this script, it should run before anything else. Then, to use it during the scrape it can be accessed by retrieving it from the session.
Example basic usage
util = session.getVariable("_GENERAL_UTILITY");
// Remove anything that isn't a digit or decimal
// A number such as 5,678.77 would be returned as 5678.77
dataRecord.put("PRICE", util.formatNumber(dataRecord.get("PRICE"));
Example of advanced scrape monitoring
// Get a reference to the utility
util = session.getVariable("_GENERAL_UTILITY");
// Set it to log the contents of all session variables that start with SEARCH_ each time it writes to the log
util.addMonitoredPrefix("SEARCH_");
// Also watch a specific session variable named WATCH_ME
util.addMonitoredVariable("WATCH_ME");
session.setVariable("DATASET", dataSet);
util.addMonitoredVariable("DATASET");
// Iterate over letters of the alphabet for a search on the site we are scraping
// and track the progress in the log
letterProgress = util.createProgressBar();
letterProgress.setTitle("Letters");
letterProgress.setTotal(26); // 26 letters to search
for(char c = 'a'; c <= 'z'; c++)
{
session.setVariable("SEARCH_LETTER", c);
session.scrapeFile("Search page");
// Increment the progress for the current letter search
letterProgress.add(1);
// Output a message to the log with the value of all currently monitored session variables
// and the progress (and estimated remaining scrape time).
// ** Note that when running in server mode and with enterprise edition, this will also output
// an easy-to-read message and progress bar in the web interface.
util.webMessage("Completed letter: " + c);
}
// Now that this loop is completed, remove the corresponding progress bar
util.removeProgressBar(letterProgress.getIndex());
// I like to end all my scrapes with a webClose() so the log ends with a snapshot of the values
// at the end of the scrape. This is just personal preference.
util.webClose("Scrape completed");
The output in the log from the above example would be something like the following, depending on the value of other variables that had been set.
Running in Workbench/Command Line Mode or on Professional Edition, message sent to log instead of web interface.
=================== Log Variables with Message ===============
Completed letter: i
=================== Current Scrape Progress ===================
=== Letters: 34.61538461538461% (9.0 of 26.0) ===
5 minutes, 10 seconds, 201 ms, 543 ps, 902 ns
=================== Variables being monitored ===============
DATASET : DataSet
--- Record 0 : DataRecord
------ A_DATARECORD : DataRecord
--------- KEY : value in key
--------- KEY2 : value in key2
------ SOME_KEY : text
------ SOME_OTHER_KEY : other text
--- Record 1 : DataRecord
------ A_DATARECORD : DataRecord
--------- KEY : extracted data
--------- KEY2 : other data
------ SOME_KEY : 1
------ SOME_OTHER_KEY : other text
SEARCH_LETTER : i
WATCH_ME : null
================ End variables being monitored ==============
The monitored variables section tries to correctly output common types of data. For instance, DataSet and DataRecord objects are output as shown above with the DATASET variable. Other classes where similar output occurs are: List, Set, Map, and Exception. Also for Enterprise Edition, the a monitored ScrapeableFile will output in the web interface with a clickable link to view the URL with the same POST request as the file used. This will not set cookies, so the page may or may not display as expected.
Update
This script will periodically be updated with new functionality. Recently it was converted to a .jar file to increase the speed during execution. Because of this, if the jar version is not in the lib/ext directory of Screen-Scraper, an error will be logged when the script is run, but everything should still work. The error simply informs you that the script version is being run, and so it will not be as fast and may be missing a few features that could not be put in script form.