Character encoding problem
I recently updated the Screen-Scraper from version 4.5 to 5.5 and now I have some problems with character encoding, especially with cyrillic alphabet. I'm trying to harvest a site which has UTF-8 encoding. I also set manually the encoding with 'Settings.setDefaultCharacterSet("UTF-8");' function, but in 'Last response' tab I have:
7421019 УÑлуги в облаÑÑ'и архиÑ'екÑ'уры прочие
instead of:
7421019 Услуги в области архитектуры прочие
The logs are:
Starting scraper.
Running scraping session: xxx
Processing scripts before scraping session begins.
Processing script: "xxx"
Using the following OS default encoding: UTF-8
Using the following SS defaultCharacterSet: UTF-8
Processing script: "xxx"
Switching SS defaultCharacterSet to: UTF-8
Processing scripts after scraping session has ended.
Processing script: "xxx Search"
[...]
Running scraping session: xxx
Processing scripts before scraping session begins.
Processing script: "xxx"
Using the following OS default encoding: UTF-8
Using the following SS defaultCharacterSet: UTF-8
Processing script: "xxx"
Switching SS defaultCharacterSet to: UTF-8
Processing scripts after scraping session has ended.
Processing script: "xxx Search"
[...]
Also the default character set is UTF-8 and default font is Courier 10 Pitch.
In 4.5 version with the same settings the session works properly.
The items in last response
The items in last response are HTML entities, and they are likely put there by JTidy. I generally must turn off tidy in cases like this.
Do you have the encoding set on the scrapeable file level?
I know that they are HTML
I know that they are HTML entities but I don't think there are the correct ones because I tried to decode them and the result wasn't the original Russian text.
I also tried without tidy and I set the enconding on scrapeable file level (I saw that this is a new feature in the recent versions of Screens-Scraper). Nothing works...
Ionut, When you disable Tidy
Ionut,
When you disable Tidy are you still getting HTML entities in the last response? If not, then this is going to be an issue of character encoding.
Character encoding can be a little tricky to figure out. One thing you can do to make it easier on you is to make sure you have the proper language pack installed for your operating system. Then, make sure you have the right font installed. We usually recommend Arial Unicode MS as it's a pretty inclusive font.
Once you have the right font in place you then need to experiment until you find the right character set. Since you're dealing with Russian characters I would recommend trying each of the Supported Encodings related to the Cyrillic alphabet. For example: Cp1251 and ISO8859_5. But don't limit yourself to the most obvious. You may be surprised if you find that the correct encoding wasn't the first, most obvious choice.
I haven't tried it yet, but you could try the "Universal Encoding Detector". It's a Python 2 app that claims to be able to automatically detect the encoding of a site.
Here's our FAQ on the topic for your reference, as well.
-Scott
If I disable Tidy in the last
If I disable Tidy in the last response tab I'm getting the Russian text not the HTML entities.
This is a bit odd because in version 4.5 with the 'Tidy HTML after scraping?' option checked and with the same default font the response is not encoded.
We made some changes to how
We made some changes to how Tidy works between 4.5 and 5.5. Let us know if you were able to find the right encoding for the Cyrillic characters.
I use UTF-8 and without Tidy
I use UTF-8 and without Tidy the Cyrillic characters are displayed proper in 'Last Response' tab.