How can I optimize screen-scraper's performance?
Here are some tips:
- Provide more bandwidth to screen-scraper. Bandwidth is by far the biggest limiting factor that will determine how fast screen-scraper runs.
- Allocate more memory to screen-scraper. This can be done under the "Settings" dialog box (click the wrench icon) via the "Maximum memory allocation" setting.
- Run long scrapes either from the command line or in server mode. The workbench is really just designed for creating scraping sessions and such; if you try to run long scrapes from it you could encounter memory problems.
- Only save values in session variables when you have to. This is especially true for data sets extracted by extractor patterns. Each time you save a value in a session variable screen-scraper keeps it in memory for the life of the scraping session unless you explicitly null it out. For an extractor pattern, under the "Advanced" tab, when you click the "Automatically save the data set generated by this extractor pattern in a session variable" checkbox you're telling screen-scraper to retain that entire data set in memory. This is fine for relatively small data sets, but should be avoided for large ones. The performance hit for doing this can be mitigated by also checking the "Cache the data set" checkbox (also found under the "Advanced" tab), but when the value for the variable is requested screen-scraper will still need to read it into memory temporarily.
- Write data out as it gets extracted. This is a corollary to the previous point. Rather than saving data sets in memory you should instead write scripts that will either write the data out to a file or insert it into a database as it gets extracted. A common way of doing this is to write compiled Java code that takes a DataRecord containing extracted data, and handles inserting it into a database. See "I'd like to insert the data screen-scraper extracts into a database. How do I do that?" for more on this.
- Don't tidy HTML. This can make working with extractor patterns a bit trickier, but can save a fair amount on CPU usage. You can tell screen-scraper not to tidy HTML by unchecking the "Tidy HTML after scraping?" box found under the "Advanced" tab for a scrapeable file.
- Reuse objects. This is a general principle of programming, and should be followed when using screen-scraper. For example, if you're connecting to a database within screen-scraper scripts, rather than disconnecting and reconnecting each time you need to issue a SQL statement, you should instead keep a connection object in a session variable so that it can be reused (either that or use a connection pooling library).
- Use compiled code where possible. This will generally mean writing Java code, compiling it into a jar file, then placing it into screen-scraper's "lib/ext" folder. The jar will then be automatically added to screen-scraper's classpath such that you can refer to it in your scripts (e.g., you can include "import" statements in your scripts in order to use your classes).
- Reduce the number of scraping sessions you run in parallel. screen-scraper has the ability to run multiple scraping sessions simultaneously. This is often necessary and desirable, but it can also have an impact on memory usage and the performance of each scraping session. You can set the number of scraping sessions you'd like to allow screen-scraper to run simultaneously by opening the "Settings" dialog box (click on the wrench icon), then adjusting the value labeled "Maximum number of concurrent running scraping sessions".
- Avoid requesting files that are unnecessary. Oftentimes in order to get to the page containing the data you'd like to extract screen-scraper will need to first request a few other pages (e.g., one that handles logging in to the site). It's often worth it to experiment a bit by disabling certain files that you would normally request in your web browser (e.g., frames in a frameset) to see if they're actually required in order to be able to request the page containing the data you want.
- Fix extractor patterns that are timing out. To see if your extractor patterns are timing out look for a message like this in your log: "Warning! The operation timed out while applying the extractor pattern, so it is being skipped." You should also try to add regular expressions to other tokens so as to make the match more precise. You can also often avoid timeouts by using sub-extractor patterns instead of full extractor patterns. This allows the extraction to be done in a more piecemeal fashion, which is more efficient.
- Disable logging. This can be done in the "Settings" window (click on the wrench icon) under the "Servers" section, by un-checking the box labeled "Generate log files". You should, of course, only do this, though, once you're satisfied that your scraping sessions are all working as you'd like them to.
- You may also wish to read a blog entry written by one of screen-scraper's developers about how to optimize large scrapes, specifically involving web-page iteration: Techniques for Scraping Large Datasets