How to compose a hierarchy?
We've been using our own tools to scrape some sites, and now that we've seen the light of screen-scraper.com we're really excited about moving in that direction.
I'm currently doing a scrape manually in which I turn the data on an HTML page into an XML document with a hierarchy that can be transformed into something else.
In any case, I'm not clear on how to do this with screen-scraper yet. I can get a dataset back, but it seems two dimensional.
Here is an example of a page that could be scraped -- how would you return the data? We're not databasing or saving anything, merely taking a scraped page and passing it on as understandable data.
http://www.fnirt.com/_yankee/sampledata.html
BTW, I'm implementing all of this using the SOAP interface and I'm really loving it. SOO much easier than doing everything by hand!
Oh yeah, just to make this post a complete headache -- any advice on scraping a binary file (i.e. PDF or TIF) and passing that on through the SOAP interface? In old-skool .Net we're passing a bytearray through a web service. Thoughts?
(With the TONS of money I'm saving our company I'm not going to ask for a cash bonus, but for a license for screen-scraper for home use!)
How to compose a hierarchy?
Hi,
That's great to hear that screen-scraper is working out for you so far. Feel free to continue to send along any issues we can help with.
This particular case is a bit tricky, but certainly can be done. If I were to approach it I would probably do most of the extraction in a script, using the scrapeableFile.extractData( String text, String name ) (here) method. The basic approach will be to define extractor patterns that grab the major pieces out, such as entire tables, then use other extractor patterns to grab rows from the tables. If you perform all of the extraction within scripts you have pretty fine-grained control over how it gets saved and structured. For example, you might just write the data out to an XML file as it gets scraped.
Does that help? Feel free to post a reply if I can clarify.
Kind regards,
Todd Wilson