The scripting engine requests files which it then parses, manipulates, and/or stores according to user defined processes. It is the heart of screen-scraper and has been optimized at all points of development to be as efficient as possible. It is made up of multiple parts which can be manipulated using the workbench:
The rest of this section contains information about using screen-scraper, through the workbench, to achieve different goals. These can be difficult to understand without some exposure to the software. That is why we would like to encourage you to go through our first few tutorials before continuing.
screen-scraper allows for Java libraries to be added to the classpath. Simply copy any jar files you'd like to reference into the lib\ext folder found in screen-scraper's directory. The next time you start up screen-scraper it will automatically add the jar files to its classpath. Note that you'll still need to use the import statement within your scripts to refer to specific classes:
screen-scraper was built on a Java 1.5 platform. Your Java scripts must accept at least a version 1.5 JRE in order to compile and run properly.
Under certain circumstances you may want to anonymize your scraping so that the target site is unable to trace back your IP address. For example, this might be desirable if you're scraping a competitor's site, or if the web site blocks IP addresses that make too many requests.
There are a few different ways to go about this using screen-scraper:
If you choose to run anonymous scripts from an external script, it is valuable to read through the documentation on controlling anonymization externally.
Aside from the above methods, you might find our blog posting on how to surf and screen-scrape anonymously helpful. It's slightly dated, but still very relevant.
The screen-scraper automatic anonymization service works by sending each HTTP request made in a scraping session through a separate high-speed HTTP proxy server. The end effect of this is that the site you're scraping will see any request you make as coming from one of several different IP addresses, rather than your actual IP address. These HTTP proxy servers are actually virtual machines that get spawned and terminated as you need them. You'll use screen-scraper to either manually or automatically spawn and terminate the proxy servers.
Note: When using the automatic anonymization method, while the remote web site may not be able to determine your IP address, your activity will still be logged. If you attempt to use the proxy service for any illegal activities, the chances are very good that you will be prosecuted.
While the automatic anonymization service provides an excellent way to cloak your IP address it is still possible that the target web site will block enough of the anonymized IP addresses that the anonymization could fail. Unfortunately we can't make any guarantees that you won't get blocked; however, by using the automatic anonymization service the chances of getting blocked are reduced dramatically.
The anonymous proxy servers will be set up in such a way that they only allow connections from your IP address. This way no one else can use any of the proxies without your authorization. This configuration is tied to your password. For more on restricting connections see documentation on managing the screen-scraper server.
If you'll be running your anonymized scraping sessions on the same machine (or local network) you're currently on and you are using the workbench, you can click the Get the IP address for this computer button to determine your current IP address.
Anonymization settings can be configured using screen-scraper's workbench. Settings are determined in the anonymous proxy settings of the settings dialog box.
When you sign up for the anonymization service you'll be given the password that allows your instance of screen-scraper to manage anonymous proxies for you. You'll enter it into the Password textbox in the settings.
As the proxy servers get spawned and terminated, it's a good idea to establish the maximum number of running proxy servers you'd like to allow. This is done via the Max running servers setting. Because you pay for proxy servers by the hour, if you don't have your scraping session set up to automatically shut them down at the end, you'll use the Terminate all running proxy servers button in order to do that.
We find that as many as 10 proxy servers but no fewer than five are adequate for most situations.
If you're setting this value in a GUI-less environment (i.e., a server with no graphical interface), you'll want to set these values in the resource/conf/screen-scraper.properties file (if these property is not already in the file you'll want to add it).
Acceptable values are http://anon.screen-scraper.com and http://anon2.screen-scraper.com.
Be sure to modify the resource/conf/screen-scraper.properties file only when screen-scraper is not running.
Aside from these global settings, there are a few settings that apply to each scraping session you'd like to anonymize. You can edit these settings under the anoymization tab of your scraping session.
Once you've configured all of the necessary settings, try running your scraping session to test it out. You'll see messages in the log that indicate what proxy servers are being used, how many have been spawned, etc.
As your anonymous scraping session runs, you'll notice that screen-scraper will automatically regulate the pool of proxy servers. For example, if screen-scraper gets a timed out connection or a 403 response (authorization denied), it will terminate the current proxy server, and automatically spawn a new one in its place. This way you will likely always have a complete set of proxy servers, regardless of how frequently the target web site might be blocking your requests. You can also manually report a proxy server as blocked by calling session.currentProxyServerIsBad() in a script. When this method is called the current proxy server will be shut down and replaced by another.
If the automatic anonymization method isn't right for you, the next best alternative might be to manually handle working with screen-scraper's built-in ProxyServerPool object. The basic approach involves running a script at the beginning of your scraping session that sets up the pool, then calling session.currentProxyServerIsBad() as you find that proxy servers are getting blocked. In order to use a proxy pool, you'll also need to get a list of anonymous proxy servers. Generally you can find these by Googling around a bit.
See available methods:
ProxyServerPool
Anonymization API
That's about all there is to it. Aside from occasionally calling session.currentProxyServerIsBad(), you may also want to call session.setUseProxyFromPool to turn anonymization on and off within the scraping sesison.
The web interface is only available for enterprise edition users of screen-scraper.
The mapping tab allows you to alter extracted values. Often once you extract data from a web page you need to put it into a consistent format. For example, you may want products with very similar names to have identical names.
screen-scraper makes use of mapping sets when determining how to map a given extracted value. A mapping set may contain any number of mappings, which screen-scraper will analyze in sequence until it finds a match, or runs out of mappings. As such, you'll often want to put more specific mappings higher in sequence than more general mappings.
Consider the screen-shot of the mapping tab: if the extracted value were Widget 123 screen-scraper would first try to match using the Widget 1 mapping. Because this is an equals match the mapping wouldn't occur, so screen-scraper would proceed to the second mapping. The second mapping would match because a contains type was designated. That is, the text Widget 123 contains the text Widget. As such, the extracted data Widget 123 would become Product ABC, because that is the To value designated for the second mapping.
When using regular expressions in your mapping you can also make use of back references. Back references allow you to preserve values in the original text when mapped to the To value. For example, if you were mapping the value Widget 123 you could use the regular expression Widget (\d*). In the To column you could then enter the value Product \1, which, when mapped, would convert Widget 123 to Product 123. The value in parentheses in the From column gets inserted via the \1 marker found in the To column.
This feature is only available to Professional and Enterprise editions of screen-scraper.
It is possible to run a scraping session within a scraping session that is already running. This is done with the RunnableScrapingSession class. Detailed documentation on methods available for the RunnableScrapingSession class are available in our API documentation. Here's a specific example of how the RunnableScrapingSession might be used in a screen-scraper script:
Scripts attached to a scraping session are exported along with it. When you subsequently import that scraping session into another instance of screen-scraper it might overwrite existing scripts in that instance. In some cases, though, you might have a series of general scripts shared by many scraping sessions. In these cases you often want to ensure that the very latest versions of these general scripts get retained in a given instance.
In the main pane of the script there is a Overwrite this script on import checkbox. When checked, any name clashes between existing scripts and imported versions will prompt you whether it should overwrite the script or not. If the local version's Overwrite this script on import checkbox is unchecked this file will not be overwritten even if you click to have it overwritten.
Checking the Overwrite this script on import requires access to the screen-scraper workbench, which you may not have access to if screen-scraper is running in a GUI-less environment. In these cases you can make use of the ForceOverwriteScripts property in the resource/conf/screen-scraper.properties file to allow scripts that have this box un-checked to be overwritten. In order to overwrite scripts that have this checkbox un-checked in a GUI-less environment you would follow these steps:
Once you're finished importing you may want to stop screen-scraper, set the property back to false, then start up again. Note that when you import scripts with the ForceOverwriteScripts property set to true screen-scraper will import the scripts regardless of whether or not the Overwrite this script on import checkbox is checked.
Extractor patterns allow you to pinpoint select snippets of data that you want extracted from a web page. It is a block of text (usually HTML) that contains special tokens that will match pieces of data you're interested in extracting. These tokens are text labels surrounded by the delimiters ~@ and @~ (e.g. ~@NAME@~). The identifier between the delimiters can contain only alpha-numeric characters and underscores.
Extractor patterns are added to scrapeable files under the extractor patterns tab.
You can think of an extractor pattern like a stencil. A stencil is an image in cut-out form, often made of thin cardboard. As you place a stencil over a piece of paper, apply paint to it, then remove the stencil, the paint remains only where there were holes in the stencil. Analogously, you can think of placing an extractor pattern over the HTML of a web page. The tokens correspond to the holes where the paint would pass through. After an extractor pattern is applied it reveals only the portions of the web page you'd like to extract.
Extractor tokens designate regions where data elements are to be captured. For example, given the following HTML snippet:
you would extract piece of text by creating an extractor pattern with a token positioned like so:
The extracted text could then be accessed via the identifier EXTRACTED_TEXT.
If you haven't done so already, we'd recommend going through our first tutorial to get a better feel for using extractor patterns.
If an extractor pattern takes too long to match a block of text it will timeout. The timeout setting may be adjusted from the general tab of the Settings located in the menu. If you find that your extractor pattern is timing out you might try adjusting it by using more precise regular expressions.
screen-scraper's scraping engine allows you to associate custom scripts with various events in the scraping process. It is recommended that you read about managing and scripting in screen-scraper before continuing.
Depending on what event triggers a script to be run different objects will be in-scope. Triggers regarding the scraping session are added on the general tab of the scraping session, file request/response triggers are associated on the properties tab of the scrapeable file, and extractor pattern events in the scripts section of the main tab in the extractor patterns tab of the scrapeable file.
Scripts can also be used to run scripts using the session.executeScript method.
screen-scraper offers a few objects that you can work with in a script in the scraping engine. See the variable scope section and/or API documentation for more details.
Depending on when a script gets run different variables may be in or out of scope. When associating a script with an object, such as a scraping session or scrapeable file, you're asked to specify when the script is to be run. The table that follows specifies what variables will be in scope depending on when a given script is run. Only variables that are in scope are accessible to the script.
When Script is Run | session in scope | scrapeableFile in scope | dataSet in scope | dataRecord in scope |
Before scraping session begins | X | |||
After scraping session ends | X | |||
Before file is scraped | X | X | ||
After file is scraped | X | X | ||
Before pattern is applied | X | X | ||
After pattern is applied | X | X | X | |
Once if pattern matches | X | X | X | X |
Once if no matches | X | X | ||
After each pattern match | X | X | X | X |
One of the best ways to fix errors is to simply watch the scraping session log and the error.log file (located in the log directory where screen-scraper was installed) for script errors. When a problem arises in executing a script screen-scraper will output a series of error-related statements to the logs. Often a good approach in debugging is to build your script bit by bit, running it frequently to ensure that it runs without errors as you add each piece.
When screen-scraper is running as a server it will automatically generate individual log files in the log directory for each running scraping session (this can be disabled in the settings window). An error.log file will also be generated in that same directory when internal screen-scraper errors occur.
The breakpoint window can also be invaluable in debugging scripts. You can invoke it by inserting the line session.breakpoint() into your script.
Session variables allow you store values that will persist across the life of a scraping session.
There are a few different ways to set session variables.
As with setting session variables, there is more than one way to retrieve values of session variables.
If you have a session variable identified by QUERY_PARAM you might embed it into the URL field of a scrapeable file using http://www.mydomain.com/myscript.php?query=~#QUERY_PARAM#~. screen-scraper will automatically replace the ~#QUERY_PARAM#~ with the value of the session variable.
Sub-extractor patterns allow you to extract data in the context of an extractor pattern, providing significantly more flexibility in pinpointing the specific pieces you're after. Consider a search results page consisting of rows and columns of data. Using normal extractor patterns you would use a single pattern to extract the data from all columns for a single row. In many cases this works just fine; however, the process gets more complicated when each row differs significantly. For example, certain cell rows may be in different colors or their contents may be completely missing. With a normal extractor pattern it would be difficult to account for the variability in the cells. By using sub-extractor patterns you could create a normal extractor pattern to extract an entire row, then use individual sub-extractor patterns to pull out the individual cells.
When using sub-extractor patterns only the first match will be used. That is, even if a sub-extractor pattern could match multiple times, only the data corresponding to the first match will be extracted. Because of this sub-extractor patterns are not always the correct method for getting data within a larger context. To get multiple matches in a larger context, like all rows in a table, you would instead use manual extractor patterns.
Consider the following HTML table:
Name | Phone | Address |
---|---|---|
Juan Ferrero | 111-222-3333 | 123 Elm St. |
Joe Bloggs | No contact information available | |
Sherry Lloyd | 234-5678 (needs area code) | 456 Maple Rd. |
Here is the corresponding HTML source:
It would be difficult to write a single extractor pattern that would extract the information for each row because the contents of the cells differ so significantly. The different colored cells and the cell spanning two columns make the data too inconsistent to be easily extracted using a single pattern (which would require lots of regular expressions and might still prove impossible or inconsistent).
Consider this extractor pattern:
The ~@DATARECORD@~ extractor pattern token is special in that it defines the block of data to which you wish to apply sub-extractor patterns. Sub-extractor patterns cannot be applied to a token with a name other than DATARECORD
If applied to the HTML above the extractor pattern would produce the following three matches:
Sub-extractor patterns would allow you to extract individual pieces of information from each row. For example, consider this sub-extractor pattern:
If applied to each of the individual extracted rows above the following three pieces of information would be extracted:
This is a simple case. Now consider the extractor pattern for the phone number:
If applied to each of the individual extracted rows above the following three pieces of information would be extracted:
In the case of Sherry Lloyd this presents a serious problem because she does have a phone number listed. It is not selected because of the additional class. Let's adjust the sub-extractor pattern slightly:
The ~@nondoublequotes@~ represents an extractor token that uses the Non-double quotes regular expression: [^"]*. Matching anything between where it is covering until it encounters double quotes. In this particular case Sherry's phone number also gets extracted.
We now have the case of the cell in the second row that spans two columns, which would not get extracted by our current sub-extractor patterns. We may still want this information, however, so we create the following sub-extractor pattern, just in case the cell exists:
If applied to our data we'd get the following results:
When multiple sub-extractor patterns hold a token with the same name (in this case, PHONE), the last one to match is the one that determines the value of the token. In this example either one or the other will match. If both could match then we would want to have the first phone extractor pattern ordered later than the one to match the no-data-available pattern
Sub-extractor patterns aggregate everything that's extracted into a single data set. Using all of our extractor and sub-extractor patterns together we'd get the following data set:
Data record # | Name | Phone |
---|---|---|
Data record #1 | Juan Ferrero | 111-222-3333 |
Data record #2 | Joe Bloggs | No contact information available |
Data record #3 | Sherry Lloyd | 234-5678 (needs area code) |