Binary Data
I'm actually collecting the same data as the following forum topic was:
http://community.screen-scraper.com/node/1131
I found the JS file okay, and used the referrer and everything, but when I scrape the file, all I get is this:
HTTP/1.1 200 OK
Content-Type: application/javascript
Last-Modified: Sat, 05 Dec 2009 05:30:21 GMT
ETag: "12af-44aba-479f4853a1540"
Accept-Ranges: bytes
Content-Encoding: gzip
Date: Sun, 06 Dec 2009 07:04:02 GMT
Expires: Mon, 07 Dec 2009 07:04:02 GMT
Cache-Control: max-age=86400
Server: Apache/2.2.13 (Unix) mod_jk/1.2.27
Content-Length: 42476
Vary: Accept-Encoding,User-Agent
Content-Type: application/javascript
Last-Modified: Sat, 05 Dec 2009 05:30:21 GMT
ETag: "12af-44aba-479f4853a1540"
Accept-Ranges: bytes
Content-Encoding: gzip
Date: Sun, 06 Dec 2009 07:04:02 GMT
Expires: Mon, 07 Dec 2009 07:04:02 GMT
Cache-Control: max-age=86400
Server: Apache/2.2.13 (Unix) mod_jk/1.2.27
Content-Length: 42476
Vary: Accept-Encoding,User-Agent
[Binary Data]
The URL is a bit different now (https://www.bankofthewest.com/static_files/botw2/home/special-publish/br...), but it shows up fine in a browser, even with just cut and paste. What is it that the browser can interpret that screen-scraper is having problems with?
Hi Chris, Glad to see you're
Hi Chris,
Glad to see you're still scrapin'. This was a result of screen-scraper not checking for a specific content-type header. Try it in version 4.5.24a, and let us know if it still doesn't work.
Thanks,
Todd
Hi, I've tried version
Hi,
I've tried version 4.5.24a but still got [Binary Data]. The strange thing is after scraping the first few levels of a site, it is ok. But after, for example, the 3rd level, I only receive [Binary Data], and then 4th level and so on are ok again. I've checked the HTML and the headers are all the same. Is there anyway around this?
We have version 4.5.36a now;
We have version 4.5.36a now; have you tried that? Could you let me get to a page that returns this just so I can test it?
Thank you for your
Thank you for your response!
Its still not working with the latest update. Here is the link:
http://tiny.cc/K4q9x
I can scrape other pages, but it is just this page that can't be scraped. It is the same whether logged in or not.
You're right that there is a
You're right that there is a problem here. The web-server isn't providing a content type (which it really should), and our means of determination is getting it wrong right now. We just came up with an idea though, so watch for version 4.5.38a today or tomorrow, and the fix will be integrated.
Yep, that worked perfect.
Yep, that worked perfect. Thanks!
The binary data just refers
The binary data just refers to the response not being text/HTML. It could be an applet, Flash, image ... hard to tell from here. You could use session.downloadFile() to capture the response to see what it is.
It's just text, but
It's just text, but screen-scraper thinks it's some sort of application. It would be great to be able to tell screen-scraper what the response type should be. Is that possible now?
Chris, Right now the
Chris,
Right now the exceptions are stored internally in screen-scraper. We started off using the W3C's list of content types that would be binary but over time have come across these outliers.
Maybe in time we'll make this configurable but it's such a rarity that it's not a very high priority.
-Scott