Tips & Suggestions

How to setup some special features, best practices, etc.

Why do I keep getting an OutOfMemoryError in screen-scraper's "log/error.log" file

This usually happens when there is a lot of data in the "last response". First, for the scrapeable file requesting this page you'll want to set the tidy option on the "Advanced" tab to "don't tidy HTML". After that, it's likely a matter of gradually increasing the memory allocation to screen-scraper until it's able to load the HTML file. You can adjust memory settings under the "Settings" dialog box (click the wrench icon). Remember that you'll need to restart screen-scraper in order for memory changes to take effect.

How can I run screen-scraper via a batch file when running on Microsoft Windows Server?

Due to additional security policies in place in versions of Windows Vista & Server 2008 when running a scraping session from a batch file you may need to interact with screen-scraper in different ways.

Here are some recommendations.

I'm unable to start screen-scraper in server mode. How can I troubleshoot this?

The most likely cause to this is that the Java Runtime Environment that ships with screen-scraper is not compatible with your system. See our Installation Instructions page for help in troubleshooting this.

Screen-scraper won't open, with the message "The JVM could not be started." How can I fix this?

This was probably caused by setting screen-scraper's maximum memory allocation too high. Specifically, in the 32-bit version of screen-scraper running on Windows, the memory allocation must be at or below about 1500 MB in order to open and run properly.

To correct this, open the screen-scraper.vmoptions file in a plain text editor. Edit the following value to be at or below 1500m

-Xmx1500m

Any recommendations on how to handle projects that involve large numbers of scraping sessions?

In cases where you're dealing with large numbers of scraping sessions, it becomes too cumbersome to retain them all in the workbench. Even if you organize them neatly into folders, there will likely still be too many to viably work with. Rather than keep all scraping sessions in the workbench at once, we generally find it useful to export and save them all to a central directory, which, ideally is under version control using something like Subversion or CVS. When you need to work with a particular scraping session, you simply import it from the repository.

How do I send dynamic POST parameters in screen-scraper?

If you've gone through our first few tutorials, you know that session variables can be embedded in URL's by using a token like this: ~#FOO#~ (see this page for a detailed example of this). Well, the very same technique can be used with POST variables. When you create a scrapeable file that uses POST parameters, they'll be displayed under the "Parameters" tab for that scrapeable file. In any of those POST parameters you can use the same type of token mentioned before.

I'm trying to scrape an HTML form that requires the user to type in text shown in an image. Can screen-scraper handle this?

This is known as a CAPTCHA mechanism, and is intended to discourage automated form submissions. There are essentially two ways of working with these:

How do I extract data from two tables that are basically identical in structure?

This isn't a scenario you'll run into too often, but it's common enough that we decided to include it in the FAQ. At times you may run into a page containing various tables of data. All of the tables are essentially identical in structure, but when you extract the data you want to be able to tell which rows of data came from which tables.