Getting Started Using screen-scraper
Overview
Using screen-scraper to extract information from web sites typically consists of a four main steps:
- Use the proxy server to determine which files to scrape. It's frequently necessary to request a few files before you can get at the file that contains the data you need (e.g. you may need to log in to the site first). The proxy server allows you to surf a site as you normally would, then easily select files you need to have scraped.
- Organize and configure files to be scraped. Once you've selected the files to scrape you'll typically need to organize and sequence them. You'll also usually tweak information related to the files, such as POST data to be sent or authentication tokens.
- Create extractor patterns. Extractor patterns provide an intuitive way to selectively identify snippets of data you want extracted from individual pages.
- Create scripts. Scripts let you do something with the data that gets extracted. This might be writing the data out to a formatted file or inserting the information into a database.
The best way to learn to use screen-scraper is by going through our tutorials.
Helpful Links
These links allow you to get a general feel for screen-scraper. They are not representative of all that can be done. Each link will simple jump you to another section of this documentation.
On the proxy server:
On the scraping engine:
- An overview of the scraping engine
- A complete example of a simple site set up for scraping
- Running screen-scraper as a server
- Interacting with screen-scraper externally
On extractor patterns:
On scripts:
scraper on 07/16/2010 at 4:26 pm
- Printer-friendly version
- Login or register to post comments