Web site encoding
Hi,
How does screen scraper know which character incoding to use to scrape a site? Does it use the http response header information? If not, is there a way to tell screen scraper which encoding to use for a particular site?
Thanks,
Brendan
Web site encoding
Hi AkiPaki,
That's great to hear that it worked out. Regarding upgrading, we try to encourage people to stay with the stable versions, which is why you had to upgrade to 2.5 (the latest stable) then 2.5.0.2a (the latest unstable--actually, we just did a 2.5.0.3a).
That's an excellent suggestion to put the encoding on the scraping session level. We'll add that to our list. For the present, we just wanted to get the feature out in an alpha version so that we could let people try it out.
Best,
Todd
Web site encoding
Hi AkiPaki,
We've just implemented a feature in screen-scraper that should allow you to handle the characters in this site. If you upgrade to version 2.5.0.2a of screen-scraper you'll see a couple of new settings in the "Settings" dialog box. These settings allow you to select the font and character encoding you'd like to use. You may be able to leave the default font as is. For the encoding I'd suggest trying UTF-8.
Please try that out when you get a chance and let us know how it goes.
Kind regards,
Todd Wilson
Web site encoding
Hi AkiPaki,
As it turns out, we're actually in the process of working through internationalization issues in screen-scraper right now. We'll use the site you give as a test, and I'll post a reply once we have something concrete for you to try out. Thanks for your patience in the meantime.
Kind regards,
Todd Wilson
Web site encoding
Hi Brendan,
We don't actually specify any encoding, so it would be whatever the web site happens to be using. screen-scraper is written in Java, which uses unicode throughout, so it should be able to handle any character set. Admittedly, though, we've only gone as far as testing with Finnish, which does contain characters outside of the ASCII character set.
Having said all of that, if you encounter situations where screen-scraper doesn't seem to be handling character encodings properly we'd love to hear about them so that we can resolve any problems (which may include the ability to explicitly specify the encoding to use).
Kind regards,
Todd Wilson
[email protected]