Screen-scraping is the practice of extracting information from web sites so that it can be used in other contexts. It has its roots in an earlier practice that dealt with reading the display from a mainframe terminal, then re-purposing the information via character recognition or some other method in order to persist the functionality of legacy applications.
If possible, the preferred method for getting information presented on a web site is via something like an RSS or other XML-based feed. As data that's extracted from web sites is often used directly in existing applications, SOAP is another possible alternative to getting at needed information. Unfortunately it's not always possible to get information using RSS or SOAP, which makes room for screen-scraping as an approach to get at data. Take a look at our solutions page for specific examples of screen-scraping.
While it's typically fairly easy for a person to log in to a web site, navigate to a particular page, and copy information out of a document, a machine needs a lot more help. Web pages are obviously designed to be viewed and used by humans, so in screen-scraping we typically need to take the same actions that a human would take when copying data from a web page. There are typically three phases in scraping information from a given page:
The purpose of screen-scraper is to dramatically reduce the time required to perform all of these steps, so that you can focus on what to do with the extracted information.
A good portion of the information on the web is copyrighted, which obviously has legal implications for screen-scraping. One should use discretion when grabbing data from web sites to be re-purposed. Read more about screen-scraping ethics.