trouble parsing similar tables
Hey all,
I'm attempting to scrape the staff information from a page similar to this:
I can get the information for the superintendent just fine; but only because it is the first table in the list.
When I try to get the data for the athletic director (AD), the extractor pattern first finds the name & phone from the superintendent, then it finds the 'Athletic Director' label, and (kind of) gets the correct data after that.
So.... in other words, screen-scraper doesn't appear to 'look ahead' and make sure that the html in the file it's scraping matches the entire extractor pattern. If it finds a partial match, it uses that, then continues through the html file to the rest of the pattern. Is this how it's supposed to work?
The only real difference in these patterns is the value of the 'Position' field in the middle of the table.
extractor pattern for the superintendent (works):
extractor patten for the AD (doesn't work):
Name | ~@NAMEAD@~ | Phone | ~@PHONEAD@~ |
Position | Athletic Director | Fax | ~@FAXAD@~ |
~@IGNORE@~ |
and last of all, session variables from the AD extractor pattern:
Form submission: The following data elements were found:
AD--DataRecord 0:
FAXAD=(928) 773-8247
Storing this value in a session variable.
NAMEAD=Dr. Kevin Brown
Storing this value in a session variable.
[email protected]
PHONEAD=(928) 527-6001
Name | Dave Roth | Phone | (982) 773-8212 |
Position | Principal | Fax | (928) 773-8247 |
[email protected] |
Name | Becky Gonzales | Phone | (928) 773-8212 |
Position | Principal Secretary | Fax | (928) 773-8247 |
[email protected] |
Name | Dana Gruver | Phone | (928) 773-8200 |
Position | Director | ||
[email protected] |
Name | Steve Bonderud | Phone | (928) 773-8216 |
Position | Assistant Principal | ||
[email protected] |
Name | Joe Tissaw | Phone | (928) 773-8205 |
Position | Assistant Principal | ||
[email protected] |
Name | GEORGE MOATE | Phone | (928) 773-8215 Storing this value in a session variable. So... screen-scraper finds the beginning of the AD pattern in the superintendent(SI) table, grabs the SI Name, then grabs all the html between the SI Phone and the AD Phone and puts it in the PHONEAD variable, then grabs the (correct) AD fax and email. I've tried changing how much I included in the patterns; tried using the tidied data generated by screen-scraper; etc. Worked for about an hour and still haven't figured it out. Also played around with sub-extractor patterns, but didn't get anything to work. The tutorial only goes over simple examples also. Any other ideas? thanks, (sorry for the long post, I feel more information > not enough information) |
trouble parsing similar tables
Not a problem, Dave. Good luck with your project.
Kind regards,
Todd
trouble parsing similar tables
Todd,
Thanks for the reply, even though it's not the one I necessarily wanted :D
I was hoping to be able to only pull the data out of certain tables; not all of them.
However, since I posted this, we found a compeletely different way to find the data we need, so I won't be needing screen-scraper anymore (it's still a very cool program though!).
If I were to continue... I would probably use the subextractor patterns like you suggested. Then, when I went to print out only the records I wanted, I would probably do a check on the Position variable, and act accordingly.
thanks for the help,
-Dave
trouble parsing similar tables
Hi Dave,
Any elements that will be variable between the records you want to extract should have extractor pattern tokens. In your "superintendent" extractor pattern you have this line:
<td width="203">Superintendent</td>
You'd actually want to do something like this instead:
<td width="203">~@POSITION@~</td>
Since the position will differ between the records (e.g., Superintendent, Athletic Director).
In this particular case it also looks like your extractor pattern is picking up a lot of HTML that you don't want. Since none of the fields you're extracting will have HTML in them I would recommend using the "Non-HTML tags" regular expression.
I took a shot at extracting the data from this page, and I think you'd actually be better off using sub-extractor patterns (see our third tutorial for an example). In some cases the fax number is missing, which would cause your extractor pattern to miss some entries if you're just using one large pattern to grab all of the data.
For example, I used this as my extractor pattern:
<tr bgcolor="#CCCCCC">
~@DATARECORD@~
</table>
And these as sub-extractor patterns:
<td width="203">~@NAME@~</td>
<td width="203">~@PHONE@~</td>
<td width="203">~@POSITION@~</td>
Which seemed to extract the data just fine.
Kind regards,
Todd Wilson