Why is Tidy HTML Failing?

I have created a series of scrapeable files with extractor patterns that are dependent upon upon Tidy HTML successfully cleaning responses. A recursive pattern is being used to obtain sub category data that appears differently as subsequent sub categories are obtained. As such, re-writing my logic to handle responses which have not be Tidy'd successfully isn't an option.

The Tidy HTML is failing with the following message, "Sorry, tidying HTML failed. Returning the original HTML." while attempting to Tidy the following URL: http://www.thebay.com/stores/shop/catalog/en/bay/10001/25513213/Category. However, it successfully processes http://www.zellers.com/stores/shop/catalog/en/zellers/10001/1000002/Cate..., which has a nearly identical HTML structure.

In addition, it appears as though the document actually is Tidy-able. See the following URL for a web based Tidy application, which is able to successfully Tidy both URLs: http://infohound.net/tidy/tidy.pl?_function=tidy&_url=http%3A%2F%2Fwww.t...

I'd like to find out why Screen-Scraper 4.0, Professional Edition is not able to Tidy http://www.thebay.com/stores/shop/catalog/en/bay/10001/25513213/Category. Any suggestions or feedback would be greatly appreciated.

Re: Why is Tidy HTML Failing?

Hi,

I wish I had a better response for you, but, unfortunately, a page can either be tidied or it can't. There are lots of tidiers out there (as you've discovered), and the one we use obviously isn't perfect. In some cases the very same dynamic page may get successfully tidied with certain data in it, but may fail to be tidied with other data. One approach is to disable tidying if you know you're dealing with a page that may be tidied sometimes, but not others. Unfortunately, you might discover this after the fact, which might mean redoing certain extractor patterns. Yet another approach is to include extractor patterns that will work when the page tidies and others when it doesn't. This might work best in your case.

We've considered building in a feature to screen-scraper that will allow you to select the tidier you'd like to use. In the interest of backward compatibility, we can't just switch to a new tidier (which would likely break old scraping sessions), but we can provide access to a different tidier which would likely be more robust than our current tidier. I'll consider this posting a vote for that feature.

Hopefully that addresses the issue. Feel free to reply back if I can clarify.

Thanks,

Todd Wilson