How do I handle international character encoding issues?

screen-scraper can work with a variety of international character sets, including those that use Asian characters and even writing systems that run right to left. There are certain considerations that should be taken into account, though, when dealing with non-Roman character sets.

Don't Tidy or use Jericho Tidier

By default screen-scraper will tidy HTML in order to make it more consistent, which aids in extracting content. The default tidier, JTidy, however, doesn't work with many non-Roman character sets. As such, you'll often either need to change this setting such that screen-scraper either doesn't tidy, or uses the Jericho tidier, which does handle international character sets correctly. This setting can be changed for any given scrapeable file under its "Advanced" tab.

screen-scraper will do its best to determine the character set of HTML content, but often servers will indicate a character set that differs from the actual character set of the content. This is why, when dealing with international character sets, you may see garbled characters or questions marks in the content. To deal with this, screen-scraper provides a few methods for overriding the character set indicated by a server.

Manually set the character set

You have three options for altering what character set screen-scraper uses. Here they are listed in order of precedence.

  1. Scrapeable File (Advanced tab or scrapeableFile.setCharacterSet method)
  2. Session (Advanced tab or session.setCharacterSet method)
  3. screen-scraper general setting (Settings dialog or DefaultCharacterSet in screen-scraper.properties file)

You'll also want to be sure that you have a font selected in the workbench that will correctly render whatever character set you're dealing with. This can be set via the "Settings" dialog box, under "Default font". If you have it installed on your computer, "Arial Unicode MS" is a font that will display virtually all characters.

Once screen-scraper has correctly extracted content, you may also need to indicate a character set when saving that content. For a database this may mean ensuring that the database tables you've created support the character set you're dealing with. When writing information out to a file you may need to specify the character set. For example, if you use the FileWriter class you'll be unable to indicate the character set. As an alternative, you may want to wrap a few different writers together, like this:

BufferedWriter writer =
  new BufferedWriter 
  (
    new OutputStreamWriter
    (
      new FileOutputStream
      (
        "output/my_data.txt"
      )
      ,
      "UTF-8"
    )
  );

If you're having trouble with a particular site, please feel free to contact us so that we can look into it for you.