A scrapeable file is a URL-accessible file that you want to have retrieved as part of a scraping session. These files are the core of screen-scraping as they determine what files will be available to extract data from.
In addition to working with files on remote servers, screen-scraper can also handle files on local file systems. For example, the following is a valid path to designate in the URL field: C:\wwwroot\myweb\my_file.htm.
You can tell what files are being scraped manually and which are in sequence using the objects tree. Sequenced scrapeable files are displayed with a pound sign (#) on them.
GET parameters can also be embedded in the URL field under the Properties tab.
Parameters can be deleted by selecting them and either hitting the Delete key on the keyboard, or by right-clicking and selecting Delete.
Session variables can be used in the Key and Value fields. For example, if you have a POST parameter, username, you might embed a USERNAME session variable in the Value field with the token ~#USERNAME#~. This would cause the value of the USERNAME session variable to be substituted in place of the token at run time.
In the enterprise edition of screen-scraper you can also designate files to be uploaded. This is done by designating FILE as the parameter type. The Key column would contain the name of the parameter (as found in the corresponding HTML form), and the value would be the local path to the file you'd like to upload (e.g., C:\myfiles\this_file.txt).
This button is grayed out if there is not a extractor pattern currently copied.
This tab holds the various extractor patterns that will be applied to the HTML of this scrapeable file. The inner frame will be discussed in more detail when discussing them.
This can be very helpful for pages that are very specific on request settings or where you are getting unexpected results from the page. This is the best place to start when you experience this type of issue.
This tab will display the raw HTTP request for the last time this file was retrieved. This tab can be useful for debugging and looking at POST and GET parameters that were sent to the server.
The contents shown under the this tab might appear differently from the original HTML of the page. screen-scraper has the
ability to tidy the HTML, which is done to facilitate data extraction. See using extractor patterns for more details.
The most common use for this tab is in generating and testing extractor patterns. You can generate
an extractor patterns by highlighting a block of text or HTML, right-clicking and selecting
Generate extractor pattern from selected text.
You can generally recognize when a web site requires this type of authentication because, after requesting the page, a small box will pop up requesting a username and password.
A minor performance hit is incurred, however, when tidying. In cases where performance is critical Don't Tidy HTML should be selected.