trouble parsing similar tables

Hey all,

I'm attempting to scrape the staff information from a page similar to this:

I can get the information for the superintendent just fine; but only because it is the first table in the list.

When I try to get the data for the athletic director (AD), the extractor pattern first finds the name & phone from the superintendent, then it finds the 'Athletic Director' label, and (kind of) gets the correct data after that.

So.... in other words, screen-scraper doesn't appear to 'look ahead' and make sure that the html in the file it's scraping matches the entire extractor pattern. If it finds a partial match, it uses that, then continues through the html file to the rest of the pattern. Is this how it's supposed to work?

The only real difference in these patterns is the value of the 'Position' field in the middle of the table.

extractor pattern for the superintendent (works):

Name

~@SUPER@~

Phone

~@PHONES@~

Position

Superintendent

Fax

~@FAXS@~

Email

~@IGNORE@~

extractor patten for the AD (doesn't work):


Name	~@NAMEAD@~	Phone	~@PHONEAD@~
Position	Athletic Director	Fax	~@FAXAD@~
Email	~@IGNORE@~

and last of all, session variables from the AD extractor pattern:

Form submission: Extracting data for pattern "AD"
Form submission: The following data elements were found:
AD--DataRecord 0:
FAXAD=(928) 773-8247
Storing this value in a session variable.
NAMEAD=Dr. Kevin Brown
Storing this value in a session variable.
[email protected]
PHONEAD=(928) 527-6001 Position Superintendent Fax (928) 527-6026 Email [email protected]


Name	Dave Roth	Phone	(982) 773-8212
Position	Principal	Fax	(928) 773-8247
Email	[email protected]


Name	Becky Gonzales	Phone	(928) 773-8212
Position	Principal Secretary	Fax	(928) 773-8247
Email	[email protected]


Name	Dana Gruver	Phone	(928) 773-8200
Position	Director
Email	[email protected]


Name	Steve Bonderud	Phone	(928) 773-8216
Position	Assistant Principal
Email	[email protected]


Name	Joe Tissaw	Phone	(928) 773-8205
Position	Assistant Principal
Email	[email protected]

Name

GEORGE MOATE

Phone

(928) 773-8215
Storing this value in a session variable.

So... screen-scraper finds the beginning of the AD pattern in the superintendent(SI) table, grabs the SI Name, then grabs all the html between the SI Phone and the AD Phone and puts it in the PHONEAD variable, then grabs the (correct) AD fax and email.

I've tried changing how much I included in the patterns; tried using the tidied data generated by screen-scraper; etc. Worked for about an hour and still haven't figured it out. Also played around with sub-extractor patterns, but didn't get anything to work. The tutorial only goes over simple examples also.

Any other ideas? thanks,
-Dave

(sorry for the long post, I feel more information > not enough information)

hooziewhatsit on 03/06/2006 at 5:14 pm

screen-scraper public support

trouble parsing similar tables

Not a problem, Dave. Good luck with your project.

Kind regards,

Todd

todd on 03/08/2006 at 7:19 pm

trouble parsing similar tables

Todd,

Thanks for the reply, even though it's not the one I necessarily wanted :D

I was hoping to be able to only pull the data out of certain tables; not all of them.

However, since I posted this, we found a compeletely different way to find the data we need, so I won't be needing screen-scraper anymore (it's still a very cool program though!).

If I were to continue... I would probably use the subextractor patterns like you suggested. Then, when I went to print out only the records I wanted, I would probably do a check on the Position variable, and act accordingly.

thanks for the help,
-Dave

hooziewhatsit on 03/08/2006 at 6:40 pm

trouble parsing similar tables

Hi Dave,

Any elements that will be variable between the records you want to extract should have extractor pattern tokens. In your "superintendent" extractor pattern you have this line:

<td width="203">Superintendent</td>

You'd actually want to do something like this instead:

<td width="203">~@POSITION@~</td>

Since the position will differ between the records (e.g., Superintendent, Athletic Director).

In this particular case it also looks like your extractor pattern is picking up a lot of HTML that you don't want. Since none of the fields you're extracting will have HTML in them I would recommend using the "Non-HTML tags" regular expression.

I took a shot at extracting the data from this page, and I think you'd actually be better off using sub-extractor patterns (see our third tutorial for an example). In some cases the fax number is missing, which would cause your extractor pattern to miss some entries if you're just using one large pattern to grab all of the data.

For example, I used this as my extractor pattern:

<table width="100%" border="0" cellspacing="2" cellpadding="5">
<tr bgcolor="#CCCCCC">
~@DATARECORD@~
</table>

And these as sub-extractor patterns:

<strong>Phone</strong></td>
<td width="203">~@PHONE@~</td>

<strong>Position</strong></td>
<td width="203">~@POSITION@~</td>

Which seemed to extract the data just fine.

Kind regards,

Todd Wilson

todd on 03/07/2006 at 10:33 am

Search

Community

screen-scraper

User login

trouble parsing similar tables

trouble parsing similar tables

trouble parsing similar tables

trouble parsing similar tables