The sutil class provides general functions used to manipulate and work with extracted data. It also allows you to get information regarding screen-scraper such as its memory usage or version.
Images
Overview
In the course of a scrape it you might want to gather images associated with the other information being gathered. These methods are provided to not only download the images but to gather size information and resize to your desired size.
These methods are only available to enterprise edition users.
getImageHeight
int sutil.getImageHeight ( String imagePath ) (enterprise edition only)
Description
Get the height of an image.
Parameters
imagePath File path to the image, as a string.
Return Values
Returns the height in pixels of the image file, as an integer. If the file doesn't exist or is not an image an error will be thrown and -1 will be returned.
Change Log
Version
Description
5.0
Moved from session to sutil.
4.5
Available for enterprise edition.
Examples
Write Image Height to Log
// Output the height of the image to the log.
session.log("Image height: "+ sutil.getImageHeight("C:/my_image.jpg"));
getImageWidth
int sutil.getImageWidth ( String imagePath ) (enterprise edition only)
Description
Get the width of an image.
Parameters
imagePath File path to the image, as a string.
Return Values
Returns the width in pixels of the image file, as an integer. If the file doesn't exist or is not an image an error will be thrown and -1 will be returned.
Change Log
Version
Description
5.0
Moved from session to sutil.
4.5
Available for enterprise edition.
Examples
Write Image Width to Log
// Output the width of the image to the log.
session.log("Image height: "+ sutil.getImageWidth("C:/my_image.jpg"));
resizeImage
Overview
Internally, only one function is used to resize all images; however, to facilitate the resizing of images, we have provided you with three methods. Each method will help you specify what measurement is most important (width or height) and whether the image should retain its aspect ratio.
resizeImageFixHeight() [sutil] - Resize image, retaining aspect ratio, based on specified height.
resizeImageFixWidth() [sutil] - Resize image, retaining aspect ratio, based on specified width.
To be used in conjunction with the ImageDecoder class.
This class represents decoded images. The objects can be queried for the text that was in the image, as well as any error that occurred while the image was being decoded. When the returned text is incorrect, there is a method that can be used to report it as bad. This can be used for sites like decaptcher.com, where refunds are given for incorrectly interpreted images.
getError
String getError ( )
Description
Gets any error message, or returns null if there was no error
Parameters
This method takes no parameters
Return Value
The error message returned
Error messages
OK Nothing went wrong
BALANCE_ERROR Insufficient funds with paid service
NETWORK_ERROR General network error (timeout, lost connection, server busy, etc...)
INVALID_LOGIN Credentials are invalid
GENERAL_ERROR General error, possibly image was bad or the site couldn't resolve it. See the error message for details
UNKNOWN Unknown error
Change Log
Version
Description
5.5.29a
Available in all editions.
Examples
Convert an image to text
importcom.screenscraper.util.images.*;
// Assuming an ImageDecoder was created in a different location and saved in "IMAGE_DECODER"
ImageDecoder decoder = session.getVariable("IMAGE_DECODER");
DecodedImage result = decoder.decodeFile("someFile.jpg");
Gets the result from decoding the image. Most likely this will be a String, but each implementation could return a specific object type.
Parameters
This method takes no parameters
Return Value
The text extracted from the image, or null if there was an error
Change Log
Version
Description
5.5.29a
Available in all editions.
Examples
Convert and image to text
importcom.screenscraper.util.images.*;
// Assuming an ImageDecoder was created in a different location and saved in "IMAGE_DECODER"
ImageDecoder decoder = session.getVariable("IMAGE_DECODER");
DecodedImage result = decoder.decodeFile("someFile.jpg");
Handles an incorrectly resolved image. Some types of decoders won't have anything here
Parameters
This method takes no parameters
Return Value
This method returns void.
Change Log
Version
Description
5.5.29a
Available in all editions.
Examples
Convert and image to text
importcom.screenscraper.util.images.*;
// Assuming an ImageDecoder was created in a different location and saved in "IMAGE_DECODER"
ImageDecoder decoder = session.getVariable("IMAGE_DECODER");
DecodedImage result = decoder.decodeFile("someFile.jpg");
Returns true if there was an error, false otherwise. Also returns false if the image has not been resolved yet
Parameters
This method takes no parameters
Return Value
True if there was an error, false otherwise
Change Log
Version
Description
5.5.29a
Available in all editions.
Examples
Convert and image to text
importcom.screenscraper.util.images.*;
// Assuming an ImageDecoder was created in a different location and saved in "IMAGE_DECODER"
ImageDecoder decoder = session.getVariable("IMAGE_DECODER");
DecodedImage result = decoder.decodeFile("someFile.jpg");
Class to convert images to text for interacting with CAPTCHA challenges. There are currently two implementations:
ManualDecoder: Creates a pop-up window for a user to enter in the text they read from the image
DecaptcherDecoder: Interface for the paid service decaptcher.com
When a reference to an image is passed to an instance of this class, it returns a DecodedImage object that can be queried for the resulting text, errors, and can report an image as poorly converted.
Type of ImageDecoder in the com.screenscraper.util.images package that uses the decaptcher.com service to convert images to text. The constructor is DecaptcherDecoder(ScrapingSession session, String username, String password) or DecaptcherDecoder(ScrapingSession session, String username, String password, String apiUrl).
Parameters
session Name of currently running scraping session.
username Username used to log in to decaptcher.com service.
password Password used to log in to decaptcher.com service.
port The port given by De-captcher.com to access your account on their site.
apiUrl (optional) URL used to access decaptcher.com service. This setting will override the default URL.
Return Values
Returns void. If it runs into any problems accessing the decaptcher.com service an error will be thrown.
Change Log
Version
Description
5.5.29a
Available in all editions
5.5.40a
Added the port parameter. The service now requires the correct port in order to authenticate.
Type of ImageDecoder in the com.screenscraper.util.images package that uses a popup window prompting the user to enter the text read from an image. Useful for debugging purposes, as the input text should always be correct (so long as it is typed correctly). Helpful during testing to avoid costs associated with paid-for CAPTCHA decoding services such as decaptcher.com.
Parameters
session Name of currently running scraping session.
Return Values
Returns void. If it runs into any problems decoding an image an error will be thrown.
Converts the image given to a DecodedImage that will handle it. Does not delete the file.
Parameters
file The image file
Return Value
A DecodedImage used to get the text, errors, and possibly report a result as bad.
Change Log
Version
Description
5.5.29a
Available in all editions.
Examples
image = decoder.decodeFile("path to the image file");
decodeURL
DecodedImage decodeURL ( String url )
Description
Converts the image at the given URL to a DecodedImage that will handle it. Temporarily saves the file in the screen-scraper root folder, but deletes it once it has been decoded. By default, this will use the scraping session's HttpClient to request the URL.
Parameters
url The url to the image
Return Value
A DecodedImage used to get the text, errors, and possibly report a result as bad.
String sutil.convertDateToString ( Date date ) (professional and enterprise editions only)
String sutil.convertDateToString ( Date date, String format ) (professional and enterprise editions only)
Description
Converts the Date given to a string in a specified format, or in the "MM/dd/yyyy HH:mm:ss.SS zzz" if no format is given.
Parameters
date The date to convert
format (optional) A String representation (as a SimpleDateFormat) for the output
Return Values
A String representing the date given
Change Log
Version
Description
5.5.26a
Available in all editions.
Examples
// Log the current time Date now =newDate();
session.log(sutil.convertDateToString(now, "MM/dd/yyyy HH:mm:ss zzz"));
convertHTMLEntities
void sutil.convertHTMLEntities ( String value )
Description
Decode HTML Entities.
Parameters
value String whose HTML Entities will be converted to characters.
Returns true if the values of the two strings are equal when case is not considered; otherwise, it returns false.
Change Log
Version
Description
5.0
Added for all editions.
Examples
Compare Two Strings (Case Insensitive)
// Compare strings without regard to case
sutil.equalsIgnoreCase("aBc123","ABc123");
formatNumber
String sutil.formatNumber ( String number ) (professional and enterprise editions only)
String sutil.formatNumber ( String number, int decimals, boolean padDecimals ) (professional and enterprise editions only)
Description
Returns a number formatted in such a way that it could be parsed as a Float, such as xxxxxxxxx.xxxx. It attempts to figure out if the number is formatted as European or American style, but if it cannot determine which it is, it defaults to American. If the number is something with a k on the end, it will convert the k to thousand (as 000). It will also try to convert m for million and b for billion. It also assumes that you won't have a number like 3.123k or 3.765m, however 3.54m is fine. It figures if you wanted all three of those digits you would have specified it as 3765k or 3,765k
Parameters
number String containing the number.
decimals (optional) The number of maximum number of decimal places to include in the result. When this value is omitted, any decimals are retained, but none are added
padDecimals (optional) Sets whether or not to pad the decimals (convert 5.1 to 5.10 if 2 decimals are specified)
Return Values
Returns a String formatted as a phone number, such as +1 (123) 456-7890x2, or null if the input was null
Change Log
Version
Description
5.5.26a
Available in all editions.
Examples
Format a scraped abbreviated number as a dollar amount
// Format a number to two decimal places String dollars = sutil.formatNumber("3.75k", 2, true); // This would set dollars to the String "3750.00"
// Format the amount without cents. String dollarsNoCents = sutil.formatNumber("3.75m"); // This would set dollars to the String "3750000"
Format a European number to be inserted in a MySQL statement
String number = sutil.formatNumber("3.275,10", 2, false); // number would now be "3275.1"
formatUSPhoneNumber
String sutil.formatUSPhoneNumber ( String number ) (professional and enterprise editions only)
Description
Converts a String to a US formatted phone number, as +1 (123) 456-7890x2. Expects a 7 digit or 10+ digit phone number. The extension is optional, and will be any digits found after an x. This allows for extensions listed as ext, x, or extension.
Parameters
number String containing the phone number. The only digits in this String should be the digits of the phone number.
Return Values
Returns a String formatted as a phone number, such as +1 (123) 456-7890x2, or null if the input was null
Change Log
Version
Description
5.5.26a
Available in all editions.
Examples
Format a scraped phone number
// Formats the phone number extracted String phone = sutil.formatUSPhoneNumber(dataRecord.get("PHONE_NUMBER"));
// If the extracted value had been "13334445678 ext. 23" the returned value "+1 (333) 444-5678x23"
formatUSZipCode
String sutil.formatUSZipCode ( String zip ) (professional and enterprise editions only)
Description
Formats and returns a US style zip code as 12345-6789. If the given zip code isn't 5 or 9 digits, will log a warning, but it will put 5 digits before the - and anything else (if any) after the -
Parameters
zip String to format as a zip code, either 5 or 9 digits
Return Values
Zip code formatted String, such as 12345-6789 or 12345
Change Log
Version
Description
5.5.26a
Available in all editions.
Examples
// Format a number to a nicer looking zip code String zip = sutil.formatUSZipCode(" 001011458");
// zip would be "00101-1458"
getCurrentDate
String sutil.getCurrentDate ( String format )
Description
Returns the current date in a specified format, or uses the "MM/dd/yyyy HH:mm:ss.SS zzz" if null is given. Uses the session's timezone.
Parameters
format The format for the output string
Return Values
A String representing the date and time this method was invoked
Change Log
Version
Description
5.5.26a
Available in all editions.
Examples
// Log the current time
session.log(sutil.getCurrentDate(null));
getInstallDir
Sting sutil.getInstallDir ( )
Description
Retrieve the file path of the screen-scraper installation.
Parameters
This method does not receive parameters.
Return Values
Returns the installation directory file path, as a string.
Change Log
Version
Description
5.0
Added for all editions.
Examples
Download to screen-scraper Directory
url ="http://www.foo.com/imgs/puppy_image.gif";
// Get installation file path
path = sutil.getInstallDir()+"images/puppy.gif";
// Download to screen-scraper directory
session.downloadFile( url, path );
getMemoryUsage
int sutil.getMemoryUsage ( ) (enterprise edition only)
Description
Get memory usage of screen-scraper.
Parameters
This method does not receive any parameters.
Return Values
Returns the average percentage of memory used by screen-scraper over the past 30 seconds, as an integer.
Change Log
Version
Description
5.0
Moved from session to sutil.
4.5
Available for enterprise edition.
For tips on optimizing screen-scraper's memory usage so that it can run faster, see our FAQ on optimization.
Examples
Stop Scrape on Memory Leak
// Stop scrape if memory is low if( sutil.getMemoryUsage()>98) {
session.log("Memory is critically low. Stopping the scraping session.");
session.stopScraping(); }
getMimeType
String sutil.getMimeType ( String path )
Description
Get the mime-type of a local file.
Parameters
path File path to the local file, as a string.
Return Values
Returns the mime-type of the file, as a string.
Change Log
Version
Description
5.0
Added for all editions.
Examples
Get File Mime Type
// Get mime-type
sutil.getMimeType("c:/image/puppy.gif");
getNumRunnableScrapingSessions
int sutil.getNumRunnableScrapingSessions ( ) (enterprise edition only)
Description
Get the number of runnable scraping sessions.
Parameters
This method does not receive any parameters.
Return Values
Returns the number of scraping sessions in this instance of screen-scraper, as a integer.
Change Log
Version
Description
5.0
Added for all editions.
Examples
Get the Number of Runnable Scrapes
// Write the number of running scrapes to the log
session.log("Number of Runnable Scrapes: "+ sutil.getNumRunnableScrapingSessions());
getNumRunningScrapingSessions
int sutil.getNumRunningScrapingSessions ()
int sutil.getNumRunningScrapingSessions ( String scrapingSessionName )
Description
Gets the number of scraping sessions that are currently being run.
Parameters
scrapingSessionName Narrows the scope to a given scraping session, if this parameter is passed in.
Return Values
An int representing the number of running scraping sessions.
Gets a DataSet containing each of the elements of a <select> tag. The returned DataRecords will contain a key for the text found between the tags (possibly with html tags removed), a value indicating if it was the selected option, and the value to submit for the specific option. Note that this only looks for option tags, and as such passing in text containing more than a single select tag will produce false output.
Parameters
options The text containing the options HTML from the select tag
ignoreLabels (or ignoreLabel) (optional) Text value(s) to ignore in the output set. Usually this would include the strings like "Please select a category"
tidyRecords (optional) Should the TEXT be tidied before being stored in the resulting DataRecords
Return Values
A DataSet with one record per option. Values extracted will be stored in
VALUE : The value the browser would submit for this option
TEXT : The text that was between the tags
SELECTED : A boolean that is true if this option was selected by default
Change Log
Version
Description
5.5.26a
Available in all editions.
Examples
Search each option from a dropdown menu
String options = dataRecord.get("ITEM_OPTIONS");
// We don't want the value for "Select an option" because that doesn't go to a search results page
DataSet items = sutil.getOptionSet(options, "Select an option", true);
for(int i =0; i < items.getNumDataRecords(); i++) {
DataRecord next = items.getDataRecord(i);
session.setVariable("ITEM_VALUE", next.get("VALUE"));
session.log("Now scraping results for "+ next.get("TEXT"));
session.scrapeFile("Search Results"); }
Gets all the options from a radio button group. The values are returned in a data record. Any labels that are to be ignored will not be included in the returned set. Not all buttons have a label, as radio buttons do not require a label, and it would be difficult to know in a regular expression exactly what to extract as the label unless there is a label tag.
Parameters
buttons The text containing the buttons
buttonName The name of the buttons that should be grabbed, as a Regex pattern
ignoreLabels (or ignoreLabel) (optional) Any labels that should be excluded from the resulting set
tidyRecords (optional) Should the TEXT be tidied before being stored in the resulting DataRecords
Return Value
DataSet containing one record for each of the extracted radio buttons. Values will be stored in
VALUE : The value the browser would submit for this radio button
TEXT : The text that represents this button, or null if no label could be found for it
SELECTED : A boolean that is true if this button was selected by default
ID : The ID of the radio button, or null if no ID was found
Change Log
Version
Description
5.5.29a
Available in all editions.
Examples
Search each radio button from a radio button group
String options = dataRecord.get("RADIO_BUTTONS");
// Get all the radio buttons with the name attribute "selection"
DataSet items = sutil.getOptionSet(options, "selection");
for(int i =0; i < items.getNumDataRecords(); i++) {
DataRecord next = items.getDataRecord(i);
session.setVariable("BUTTON_VALUE", next.get("VALUE"));
session.log("Now scraping results for "+ next.get("TEXT"));
session.scrapeFile("Search Results"); }
getRandomReferrer
String sutil.getRandomReferrer ( )
Description
Gets a random referrer page from a list of many different search engine web sites and a few other pages.
Parameters
This method does not receive any parameters.
Return Values
Returns a random referrer URL.
Change Log
Version
Description
6.0.1a
Introduced for all editions.
getRandomUserAgent
String sutil.getRandomUserAgent ( )
Description
Returns a random User Agent. The list isn't closely monitored, so it may not include newer user agents, and may include extremely old ones as well.
Parameters
This method does not receive any parameters.
Return Values
Returns a random user agent.
Change Log
Version
Description
6.0.1a
Introduced for all editions.
getScreenScraperEdition
String sutil.getScreenScraperEdition ( )
Description
Get edition of screen-scraper instance.
Parameters
This method does not receive any parameters.
Return Values
Returns the edition name, as a string.
Change Log
Version
Description
5.0
Added for all editions.
Examples
Write Version to Log
// Write the current version to log.
session.log("Current edition: "+ sutil.getScreenScraperEdition());
getScreenScraperVersion
String sutil.getScreenScraperVersion ( )
Description
Get version of screen-scraper instance.
Parameters
This method does not receive any parameters.
Return Values
Returns the version number, as a string.
Change Log
Version
Description
5.0
Added for all editions.
Examples
Write Version to Log
// Write the current version to log.
session.log("Current version: "+ sutil.getScreenScraperVersion());
isInt
boolean sutil.isInt ( String string )
Description
Determine if the value of a string is an integer.
Parameters
obj Object to be tested for containing an integer.
Return Values
Returns true if the string is an integer; otherwise, it returns false. If it is passed an object that is not a string, including an integer, an error will be thrown.
Change Log
Version
Description
5.0
Added for all editions.
Examples
Check String Value
// Does the GUESTS variable contain an integer if(!sutil.isInt( session.getv("GUESTS"))) {
session.logWarn("Could not get the number of guests!"); }
Returns true if the value of the object is null or an empty string; otherwise, it returns false.
Change Log
Version
Description
5.0
Added for all editions.
Examples
Warning for Empty Variable
// Give warning and stop scrape if variable is empty if( sutil.isNullOrEmptyString( session.getv("NAME"))) {
session.log("The NAME variable was blank.");
session.stopScraping(); }
isPlatformLinux
boolean sutil.isPlatformLinux ( )
Description
Determine if operating system is a Linux platform.
Parameters
This method does not receive parameters.
Return Values
Returns true if the operating system is Linux; otherwise, it returns false.
Merges two data records by copying all values from the second record over values of the first record, and returning a new DataRecord with these values. Doesn't modify either original record
Parameters
first The first DataRecord, into which the values from the second record will be copied
second The second DataRecord, whose values will be copied into the first
saveNonEmptyString True if blank values should not overwrite blank values, whether the non-blank value is in the first or second record. If both records contain a value that is not blank for the same key, the value in the first record is saved and the value in the second record discarded. If false, all values in the second record will overwrite any values in the first record.
Return Values
A new DataRecord with the merged values
Change Log
Version
Description
5.5.26a
Available in all editions.
Examples
Combine values from the current dataRecord with a previous one
Returns an empty string if the value of the object is null; otherwise, returns the value of the toString method of the object.
Change Log
Version
Description
5.0
Added for all editions.
Examples
Get String Value of Variable
// Always Specify Suffix even if not selected
suffix = sutil.nullToEmptyString( session.getv("SUFFIX"));
parseName
Namesutil.parseName ( String name ) (pro and enterprise editions only)
Description
Attempts to parse a string to a name. The parser is not perfect and works best on english formatted names (for example, "John Smith Jr." or "Guerrero, Antonio K". This uses standard settings for the parser. To get more control over how the name is parsed, use the EnglishNameParser class.
dr.put("FIRST_NAME", name.getFirstName());
dr.put("MIDDLE_NAME", name.getMiddleName());
dr.put("LAST_NAME", name.getLastName()); //dr.put( "SUFFIX", name.getAllSuffixString() ); } catch(Exception e ) { // The parser may throw an exception if it can't // parse the name. If this occurs we want to know about it.
log.warn("Error parsing name: "+ e.getMessage()); } }
Namesutil.parseName ( String name ) (pro and enterprise editions only)
Description
Attempts to parse a string to a name. The parser is not perfect and works best on english formatted names (for example, "John Smith Jr." or "Guerrero, Antonio K". This uses standard settings for the parser. To get more control over how the name is parsed, use the EnglishNameParser class.
dr.put("FIRST_NAME", name.getFirstName());
dr.put("MIDDLE_NAME", name.getMiddleName());
dr.put("LAST_NAME", name.getLastName()); //dr.put( "SUFFIX", name.getAllSuffixString() ); } catch(Exception e ) { // The parser may throw an exception if it can't // parse the name. If this occurs we want to know about it.
log.warn("Error parsing name: "+ e.getMessage()); } }
Addresssutil.parseUSAddress ( String address ) (pro and enterprise editions only)
Description
Attempts to parse a string to an address. The parser is not perfect and works best on US addresses. Most likely other address formats can be parsed with the USAddressParser class by providing different constraints in the builder. This method is here for convenience in working with US addresses.
// if all of these four are blank then save only the raw address // else save what we can if(
sutil.isNullOrEmptyString( address.getStreet()) &&
sutil.isNullOrEmptyString( address.getState()) &&
sutil.isNullOrEmptyString( address.getCity()) &&
sutil.isNullOrEmptyString( address.getZipCode()) ) {
dr.put("ADDRESS", addressRaw ); } else {
dr.put("ADDRESS", address.getStreet());
dr.put("ADDRESS2", address.getSuiteOrApartment());
dr.put("STATE", address.getState());
dr.put("CITY", address.getCity());
dr.put("ZIP", address.getZipCode()); }
session.setv("DR_ADDRESS", dr ); } catch(Exception e ) { // If there was a parsing error, notify so it can be dealt with
log.warn("Exception parsing address: "+ e.getMessage()); }
void sutil.pause ( long milliseconds ) (professional and enterprise editions only)
Description
Pause scraping session.
Parameters
milliseconds Length of the pause, in milliseconds.
Return Values
Returns void.
Change Log
Version
Description
5.0
Moved from session to sutil.
4.5
Available for professional and enterprise editions.
Pausing the scraping session also pauses the execution of the scripts including the one that initiates the pause.
Examples
Pause Scrape on Server Overload
// It should be noted that a status code of 503 is not // always a temporary overloading of a server.
// Check status code if(scrapeableFile.statusCode()==503) { // Pause Scraping for 5 seconds
sutil.pause(5000);
// Continue/Rescrape file
... }
randomPause
void sutil.randomPause ( long min, long max ) (professional and enterprise editions only)
Description
Pauses for a random amount of time. This is also setup to stop immediately if the stop scrape button is clicked, and to allow breakpoints to be triggered while it is pausing.
Parameters
min The minimum duration of the pause, in milliseconds
max The maximum duration of the pause, in milliseconds
Return Value
Returns void.
Change Log
Version
Description
5.5.29a
Available in professional and enterprise editions.
dateFormatFrom (optional) The format of the date that is being reformated. The date format follows Sun's SimpleDateFormat.
dateFormatTo The format that the date is being changed to. If dateFormatFrom is being used this should also follow Sun's SimpleDateFormat. If dateFormatFrom is left off then the date format should follow PHP's date format. In the later method you can also use timestamp as the value of this parameter and it will return the timestamp corresponding to the date. Note also how PHP treats dashes and dots: "Dates in the m/d/y or d-m-y formats are disambiguated by looking at the separator between the various components: if the separator is a slash (/), then the American m/d/y is assumed; whereas if the separator is a dash (-) or a dot (.), then the European d-m-y format is assumed."
Return Values
Returns formatted date according to the specified format, as a string.
Change Log
Version
Description
5.0
Moved from session to sutil.
4.5
Available for professional and enterprise editions. Unspecified source format available for enterprise edition.
The date formats are not the same for the two methods. Read carefully.
Examples
Reformat Date from Specified Format
// Reformats the date shown to the format "2010-01-01". // This uses Sun's Date Formats
Send an email using SMTP mail server specified in the settings.
Parameters
subject Subject line of the email, as a string.
body The content of the email, as a string.
recipients Comma-delimited list of email address to which the email will be sent, as a string.
contentType The content type as a valid MIME type.
attachments Comma-delimited list of local file paths to files that should be attached, as a string.
If you do not have any attachments the value of null should be used.
headers Tab-delimited SMTP headers to be used when sending the email, as a string. If you don't have
any headers to send use the value null.
Return Values
Returns void. If it runs into any problems while attempting to send the email an error will be thrown.
Change Log
Version
Description
6.0.35a
Now supports alternate content types.
5.0
Moved from session to sutil.
4.5
Available for enterprise edition.
Examples
Send Email at End of Scrape
// In script called "After scraping session ends"
// Sends an email message with the parameters shown. String message ="The '"+ session.getName()+"' scrape is now finished.";
sutil.sendMail("Status Report: Scrape Finished", message, "[email protected]", null, null);
sortSet
List sutil.sortSet ( Set set )
List sutil.sortSet ( Set set, boolean ignoreCase )
List sutil.sortSet ( Set set, Comparator comparator )
Description
Sorts the elements in a set into an ordered list.
Parameters
set The set whose elements should be sorted
ignoreCase (optional) True if case is irrelevant when sorting strings
comparator (optional) The Comparator used to compare objects in the set to determine order
Return Values
This method returns a sorted list of elements that are in the set.
Change Log
Version
Description
5.5.26a
Available in all editions.
Examples
Output all the values in a DataRecord in alphabetical order
// Generally when a sorted set or map is needed, a data structure should be chosen that stores the values // in a sorted way, such as TreeSet or TreeMap. However, sometimes the set or map is returned by a library // and may not have sorted values, although sorted values are needed.
List keys = sutil.sortSet(dataRecord.keySet(), true);
for(int i =0; i < keys.size(); i++) {
key = keys.get(i);
session.log(key +" : "+ dataRecord.get(key)); }
Tidies the DataRecord by performing actions based on the values of the settings map given (or getDefaultTidySettings if none is given). Each value in the record that is a string will be tidied. Keys are not modified. The record given will not be modified, but a new record with the tidied values will be returned. If no settings are given, will use the values obtained from sUtil.getDefaultTidySettings().
Parameters
record The DataRecord to tidy (values in the record will not be overwritten with the tidied values)
scrapeableFile (optional) The current ScrapeableFile, used for resolving relative URLs when tidying links
settings (optional) The operations to perform when tidying, using a Map<String, Boolean>
The settings tidy settings and their default values are given below. If a key is missing in the settings map, that operation will not be performed.
Map Key
Default Value
Description of operation performed
trim
true
Trims whitespace from values
convertNullStringToLiteral
true
Converts the string 'null' (without quotes) to the null literal (unless it has quotes around it, such as "null")
convertLinks
true
Preserves links by converting <a href="link">text</a> to text (link), will try to resolve urls if scrapeableFile isn't null. Note that if there isn't a start and end <a> tag, this will do nothing
removeTags
true
Remove html tags, and attempts to convert line break HTML tags such as <br> to a new line in the result
removeSurroundingQuotes
true
Remove quotes from values surrounded by them -- "value" becomes value
convertEntities (professional and enterprise editions only)
true
Convert html entities
removeNewLines
false
Remove all new lines from the text. Replaces them with a space
removeMultipleSpaces
true
Convert multiple spaces to a single space, and preserve new lines
convertBlankToNull
false
Convert blank strings to null literal
ignoreLowerCaseKeys (optional) True if values with keys containing lowercase characters should be ignored
Return Values
A new DataRecord containing all the tidied values and any values that were not Strings in the original record. The values that were Strings but were not tidied as well as the DATARECORD value will not be in the returned record.
Change Log
Version
Description
5.5.26a
Available in all editions.
5.5.28a
Now uses a Map for the settings, rather than bit flags.
String sutil.tidyString ( String value ) (professional and enterprise editions only)
String sutil.tidyString ( String value, Map<String, Boolean> settings ) (professional and enterprise editions only)
String sutil.tidyString ( ScrapeableFile scrapeableFile, String value ) (professional and enterprise editions only)
String sutil.tidyString ( ScrapeableFile scrapeableFile, String value, Map<String, Boolean> settings ) (professional and enterprise editions only)
Description
Tidies the string by performing actions based on the values of the settings map.
Parameters
value The String to tidy
settings(optional) The operations to perform when tidying, using a Map<String, Boolean>
The tidy settings and their default values are given below. If a key is missing in the settings map, that operation will not be performed.
Map Key
Default Value
Description of operation performed
trim
true
Trims whitespace from values
convertNullStringToLiteral
true
Converts the string 'null' (without quotes) to the null literal (unless it has quotes around it, such as "null")
convertLinks
true
Preserves links by converting <a href="link">text</a> to text (link), will try to resolve urls if scrapeableFile isn't null. Note that if there isn't a start and end <a> tag, this will do nothing
removeTags
true
Remove html tags, and attempts to convert line break HTML tags such as <br> to a new line in the result
removeSurroundingQuotes
true
Remove quotes from values surrounded by them -- "value" becomes value
convertEntities (professional and enterprise editions only)
true
Convert html entities
removeNewLines
false
Remove all new lines from the text. Replaces them with a space
removeMultipleSpaces
true
Convert multiple spaces to a single space, and preserve new lines
convertBlankToNull
false
Convert blank strings to null literal
scrapeableFile (optional) The current ScrapeableFile, used for resolving relative URLs when tidying links
Return Values
The tidied string
Change Log
Version
Description
5.5.26a
Available in all editions.
5.5.28a
Now uses a Map for the settings, rather than bit flags.
Examples
Tidy a comment extracted from a website
Assuming the extracted text's HTML code was:
<a href="http://www.somelink.com">This</a> was great because of these reasons:<br />
1 - Some reason<br />
2 - Another reason<br />
3 - Final reason