Extracting data from tables with varying row counts

I am new to screenscraping, and have limited knowledge of HTML, no knowledge of scripts/Java, etc. I need help creating an extractor pattern for the following:

I have multiple web pages.
Each web page contains a single table.
Every table is exactly the same [format] on each page; only the data in the table cells change.
Each table contains a different amount of rows (i.e., page 1 table has 5 rows, page 2 has 46 rows, etc.)
Each row in each table constitutes a single record.

What I want to do is extract the data from each table, keeping the record/row data together, and export the data to an excel spreadsheet. Currently, I am using Excels' web query, which works fine, but I do not want to have to repeat this action over 200 times to get all the data I need (tedious to say the least).

So far, I have been unable to extract the data properly - either the extractor pattern returns only a single row, or, I can get the extractor pattern to return all the rows together as a single long LONG row, which completely ruins the individual records.

Below is source from one of the pages:

"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

National UFO Reporting Center
Monthly Report Index For 08/1943
Click on links for details

NUFORC Home

Date / Time City State Shape Duration Summary Posted
8/15/43 00:00 unknown LA Light 10-15 minutes 1942 Louisiana midnight light like airplane 10-15 minutes, then corkscrewing up into the atmosphere 2/14/08
8/13/43 22:00 Portsmouth (UK/England) Sphere 10mins vertical lines of 5-6 crescent moons, some 10+ rows in a night sky over naval base. 4/16/05

NOTE: I am not interested in maintaining the DATE/TIME as a hyperlink; I am only interested in the actual data as a date/time. Also, I do not need the posted date info at all, but if it is easier to leave it in, I can clean the data later. The most important thing is to keep the record data together, each row being a seperate record. Please be very specific and thorough in your answer, as I am a noob, and just learning. Thank you!

Longwinded (sorry)

Hello! The solution to the problem is exactly why we built screen-scraper :)

Quick rundown: Extractor patterns search over the page and find as many matches as they can. Thus, a single pattern could match 600 times on a single page if it can find that many.

The HTML is formatted so that each table row begins with a TR tag, and it just so happens that they all have that valign="TOP", which helps to identify those rows you're after.

There are two ways you could solve this obstacle, and they're both about the same. Consequently, we'll only worry about the simpler one:

Since the pattern is going to match multiple times all by itself, you may as well just make the pattern consist of a single row:
~@DATETIME@~ ~@CITY@~ ~@STATE@~ ~@SHAPE@~ ~@DURATION@~ ~@SUMMARY@~ ~@POSTED@~

The patterns I'm using in each of those "tokens" are as follows:
  • junk_parameters: [^>]* (one of the default entries in the dropdown box "HTML tag parameters")
  • junk_some_url: [^"]* (default entry "Non double-quote")
  • DATETIME, POSTED: [^<>]* (Default entry "Non HTML tags")
  • CITY, STATE, SHAPE, DURATION, SUMMARY: Don't put anything in the regex textbox for these tokens' patterns. Sometimes the page is missing data for a cell, like for a state or shape. Leaving it blank will allow the pattern to simply match whatever's between the starting point of the pattern and the very next "</td>" tag.

Now it will match as many times as it can (which should be equal to the number of rows on the page).

From there, you can make 2 little scripts; one to get rid of extra HTML in your data (right now, the empty cells actually contain a "<br />") and a second one to write out your data to a CSV (basic tab-delimited spreadsheet format).

Script number one; this one gets rid of those silly "<br />" things. Set this one to run after your pattern, and make sure the "When to run" box says "After each pattern application".
// Set the language of the script using the drop-down/combo box
// just above the big text field for the script; set it to JAVA.
//
// You could literally just copy and paste this script out of this reply and paste it into screen-scraper.
//
// Here we go...!

for (Enumeration e = dataRecord.keys() ; e.hasMoreElements() ;)
{ varName = e.nextElement();
dataRecord.put(varName, dataRecord.get(varName).replaceAll("<[^>]*>", ""));
}

Aaaaand, script number 2. Set this one to run after the pattern, and make sure the "When to run" box says "After pattern is applied":
// Set the language of the script using the drop-down/combo box
// just above the big text field for the script; set it to JAVA.
//
// You could literally just copy and paste this script out of this reply and paste it into screen-scraper.
//
// Here we go...!

dataSet.writeToFile( "C:/site_data/extracted_data.csv" );

In that second one, you can change the last line to point to whatever folder you want, and screen-scraper will put your spreadsheet output there. (Don't use backslashes between folder names, use forwardslashes like in the example.)

As a final note, you should know that the second script is simply going to dump all of your extractor pattern variables into the CSV file. Consequently, our two junk variables, "junk_parameters" and "junk_some_url" are going to get their own columns in the CSV, too. Using this quick method of writing out to a CSV file, you'll have to manually get rid of those columns since you probably don't want them :)

If you need help with anything more, be sure to let me know!

Tim

Cannot get scripts to work...

I cannot get either script to work at all. I literally pasted both scripts, they appear to run in the session, then...nothing. No output, no br removal -

A sample

I admit that I had not tested the scripts before posting my reply, but they seem to be performing properly in my setup.

I had written out more instructions here, but I decided to edit my post and just upload a sample of the scraping session that I have on my machine.

This one runs through every page on the site you supplied, and goes through the tables and writes them all out to CSV files, broken up by month (identical to the way the pages are found on the website).

Often the best way to learn something is to do it yourself, but there are many other instances where having a model to work from is far more beneficial. I hope this helps you tackle your project!

Even if you want to build the scrape yourself for the sake of learning, you should know that my "output" script that I supplied above will only work in the Pro or Entriprise versions of screen-scraper. I had forgotten about that limitation in Basic. However, the scraping session that I have attached to this post has an updated script that will work no matter which version you are running.

In addition, you will no longer have to erase undesired columns. Only the ones you want will appear.

To change which folder your output files are being saved, change the path in the first line of the "Write to CSV" script. I've left it defaulted to "C:/".

Tim

Right-click and "Save as..." to download the scraping session I've quickly put together. Then import it using screen-scraper's "File" menu.

Can't Imagine Life Before SS -

Amazing product - this is hands-down an absolute necessity for those of us who design databases professionally. In an enterprise type of business, the potential application for certain projects would be invaluable, and SS would quickly become an absolute necessity. I have forwarded this information to contacts engaged in enterprise business endeavors with private contractors - again, an amazing program...

Thank You.