Overview of screen-scraping
What is screen-scraping?
Screen-scraping is the practice of extracting information from web sites so that it can be used in other contexts. It has its roots in an earlier practice that dealt with reading the display from a mainframe terminal, then re-purposing the information via character recognition or some other method in order to persist the functionality of legacy applications.
Why do screen-scraping?
If possible, the preferred method for getting information presented on a web site is via something like an RSS or other XML-based feed. As data that's extracted from web sites is often used directly in existing applications, SOAP is another possible alternative to getting at needed information. Unfortunately it's not always possible to get information using RSS or SOAP, which makes room for screen-scraping as an approach to get at data. Take a look at our solutions page for specific examples of screen-scraping.
The basic approach
While it's typically fairly easy for a person to log in to a web site, navigate to a particular page, and copy information out of a document, a machine needs a lot more help. Web pages are obviously designed to be viewed and used by humans, so in screen-scraping we typically need to take the same actions that a human would take when copying data from a web page. There are typically three phases in scraping information from a given page:
- Request the page.This first part may actually be more complex than it sounds. Oftentimes the page that's needed can only be accessed after logging in to a site and following a series of specific links. Your web browser will typically handle things such as tracking cookies and submitting all of the elements of a form for you, but it becomes a bit more of a manual process when done by a computer.
- Extract the information. Once the web page is requested the next step is to parse the HTML text such that specific pieces of data can be extracted and used within computer code. There are several ways to go about this. One possibility is to apply regular expressions, which often work well since they allow for relatively "fuzzy" searches. Another might be to attempt to turn the HTML in the document into XML so that it can be queried using such methods as XPath.
- Do something with the extracted data. From here the information might be inserted into a database or perhaps re-formatted in some way to be presented to a user.
The purpose of screen-scraper is to dramatically reduce the time required to perform all of these steps, so that you can focus on what to do with the extracted information.
Legal issues
A good portion of the information on the web is copyrighted, which obviously has legal implications for screen-scraping. One should use discretion when grabbing data from web sites to be re-purposed. Read more about screen-scraping ethics.
- Printer-friendly version
- Login or register to post comments