'font' gets replaced with 'span'

I'm getting some strange results when I examine my last scraped data.

I am scraping two pages from a site, which are identical apart from being for two different products. One gives:

 <h2><b>Price:</b> <span style="color: 990000">£1349.00</span></h2>

The other gives:

 <h2><b>Price: </b><font color="990000">&pound;239.00</font></h2>

But if you look at the source in a web browser or web debug utility, the response from the server is always in the second format. It is as if screen scraper is sometimes replacing the formatting, and this causes my extractor pattern to sometimes fail.

Html tidy is turned off, so I can't see where these 'span' tags are coming from, they are certainly not being returned by the site.

Can anyone help ?

Yeah, that still looks as if

Yeah, that still looks as if it's being tidied. Usually that's where those goofy  characters come from..

I know it sounds silly, but can you verify that tidy is turned off, save, restart screen-scraper, run it again, and check again. Sometimes we've seen weird things with settings not being saved (though usually that happens when the server and workbench are both running from the same installation of screen-scraper).

Can you give me the URL to that particular page? (or if it requires a search to be performed with POST parameters, tell me the process.)

Perhaps we've got a bug with disabled HTML tidy in certain situations.

Tim

Thanks for that. I managed

Thanks for that. I managed to get tidying turned off by your method (also required upgrade to 4.5). I now get consistent results.

Which is great... ...but it will mean I now have to re-do all my extractor patterns, as most were created with tidy on.

Leaving tidy on still produces different results for similar pages.

Eg http://tinyurl.com/scrapeclassic gives


Absolute Price: £1349.00

http://tinyurl.com/scrape2a gives


Absolute Price: £239.00

though Firefox gives the source as


Absolute Price: £1349.00



Absolute Price: £239.00

So tidying is still inconsistent.

Any ideas, before I redo all my patterns ???

Cheers.

I was just proxying those two

I was just proxying those two urls, and they come out the same when I turn off tidying.

Something of note is that the second scrapeableFile will fail to tidy the HTML if tidying is left on, while the first link does not fail, and will successfully return tidied HTML.

That first scrapeableFile is certainly still tidying... Do you run the server simultaneous to running the workbench? I'm at a loss for what to think. "Tidy HTML" is a per-scrapeableFile setting, so make sure it's unchecked on all of the scrapeableFiles in question.

>I was just proxying those

>I was just proxying those two urls, and they come out the same when I turn off tidying.

Agreed.

>Something of note is that the second scrapeableFile will fail to tidy the HTML if tidying is left on, while the first link does not fail, and will successfully return tidied HTML.

Forgive my ignorance, but I'm not sure what you mean?

>That first scrapeableFile is certainly still tidying... Do you run the server simultaneous to running the workbench? I'm at a loss for what to think. "Tidy HTML" is a per-scrapeableFile setting, so make sure it's unchecked on all of the scrapeableFiles in question.

I'm only running standalone workbench at the moment. Are you sure 'tidy' set per scrapable file ? I can only find the global option in settings ?

Thanks for your help so far...

Also, just found in the log

Also, just found in the log 'Sorry, tidying HTML failed. Returning the original HTML' for one url not the other. So that would explain it. But why would tidying fail for one, not the other when they are almost identical ?

Also just read that 'tidy per scrapable file' is only in professional, not basic. I guess I need to upgrade?

Ah, I am sorry for

Ah, I am sorry for overlooking this. Yes, if you have Pro / Enterprise version, then there is a checkbox on the "advanced" tab of each scrapeableFile. The Basic edition is intended to be just that; for scraping basic pages which don't require attention to special Unicode characters, etc.

Tidying fails at different times for rather subtle reasons. The long and short of it is that if the HTML is intensely malformed at some point, and it basically encounters an error and can't figure out what to do, then it'll fail entirely and return back to you the unaltered HTML.

>>> I guess I need to upgrade?
Well, there is a sale, currently :)

Sorry for the trouble

Tim

Thanks again. I've not found

Thanks again.

I've not found much detail about what 'tidy' actually does, but from what you imply, its only really needed for special pages, eg those with unicode ?

I guess simple pages would not get much benefit from its use ?

Cheers.

Well, it's actually a

Well, it's actually a hindrance to those pages that use Unicode characters.

What it's good for is simplifying HTML and removing whitespace from it. For instance, if the developers of that website decide to put in one extra space or newline, then you're extractor pattern is broken. Tidy will remove that stuff, and try to contract tag attributes (such as in <font color="red" margin="2px">) into a simple "style" attribute. That's why you were seeing the "font" tag and it's "color" attribute turning into a "span" tag with a "style" attribute. "Font" tags are by all means deprecated, when held against the CSS styling movement. Therefore it changed "font" to "span", and just gave the "span" tag some inline CSS.

This normally helps out a lot, since any variable number of tag attributes can usually all contract into one inline "style" attribute.

The problem you're encountering is that you're using Basic, and one of the pages fails to Tidy from time to time, so you're ending up with two sets of HTML. If the website had all tidied HTML, it wouldn't be a problem. If the website had all untidied HTML, it wouldn't be a problem. But yours is mixed :P (just to make us all mad, I'm sure)

If you want to get really ambitious, you could actually make tokens over the HTML tag names, which accepts either "span" or "font", like the following isolated example:

<~@font_or_span@~~@junk_parameters@~>~@PRICE@~

~@font_or_span@~'s token regular expression should be font|span.

At least that way, you wouldn't have to actually purchase a full professional product if you don't have the means to do that.

Thanks again. I've got round

Thanks again.

I've got round it for now by turning off tidying and re-doing my extractor patterns, which didn't take as long as I thought.

We'll consider buying the pro version - it depends on how seriously my bosses are taking this project - whether its as vital as they say, or whether it will be forgotten in a month!

Cheers!