How do I extract data from two tables that are basically identical in structure?

This isn't a scenario you'll run into too often, but it's common enough that we decided to include it in the FAQ. At times you may run into a page containing various tables of data. All of the tables are essentially identical in structure, but when you extract the data you want to be able to tell which rows of data came from which tables.

For example, consider this page. If you view the HTML from the page you'll notice that the structure of the two tables is basically the same. If you use a normal extractor pattern that matches a row of data, though, you're going to get all four rows of data, and won't be able to tell which row came from which table. That is, your first inclination might be to use an extractor pattern like this:

It matches the data just fine, but you don't know which table each row came from.

In situations like these there are two possible approaches. The first is to use regular expressions that match the data in such a way that you are able to differentiate between the table rows. For example, download this scraping session and import it into screen-scraper. If you run it, you'll notice that it extracts the data from each table separately. It does this by using regular expressions that differentiate the data in the first table (whose cells all end with the letter "a") from the data in the second table (whose cells all end with the letter "x"). You can see this by opening the "Table 1 row" or "Table 2 row" extractor patterns, and editing the properties on any of the tokens (e.g., ~@CELL_DATA1@~). If you look under the "Regular Expression" tab, you'll see the expression that makes the match.

Unfortunately, it's not always the case that regular expressions will allow you to distinguish between table rows. The alternative is to handle the data extraction in scripts. Note that this approach only works in the Professional or Enterprise Editions of screen-scraper, and makes use of the scrapeableFile.extractData method. Download this scraping session and import it into scraping session. Again, if you run it, you'll notice that it extracts the data from the two tables separately. The scripts here provide the key to extracting the data. Take a look at the "Similar tables--extract table 1 data" script. It gets invoked after the
"Table 1 data" extractor pattern matches.

If you've encountered a similar situation to the one presented here it's possible you can use these examples to tackle the task. Take a careful look through the extractor patterns and scripts to see how they're set up. If you have questions on them or run into any trouble, don't hesitate to post to our support forum.

TOP OF PAGE

scraper on 07/03/2008 at 12:34 pm

Tips & Suggestions

Search

Community

screen-scraper

User login

How do I extract data from two tables that are basically identical in structure?