Characterencoding problem, and writing to file...

Hi, first of all great program, keep up the good work guys and girls!

To my question now, im having some trouble to extract some data which is encodeded either by utf-8 or some difrent cind of encoding. Cant find it in the html code.
I read on the forum and tryed all kid of difrent solutions that you posted, but still i couldnt make it work.
I get an questionmark on evry special character (?) for instance this is the text on the webbpage: "Preduzeće bavi se informatičkim inžinjeringom" ill get in my text file or the screenscraper log "Preduze?e bavi se informati?kim in?injeringom"

this is an example url: http://www.adresar.ba/detalji_klijenta.aspx?id=22117

I tryed to change the encoding in Options -> Settings -> General -> Default Character Set = UTF-8 or windows-1250 or ISO 8859-2

tryed to change in the "write to file" script to:
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(outputFile),"windows-1250");
or this one
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile),"windows-1250"));

but still i get the questionsmarks...

What am i doing wrong?
Please help me.

ivan82 on 11/29/2008 at 7:13 am

screen-scraper public support

Often, the problem is just

Often, the problem is just that you are not using a font that can display the characters. When we scrape Chinese or Japanese text, all we have to do is use a "Unicode" font.

Windows comes loaded with the Arial Unicode MS font, and Lucida Sans Unicode font.

If you don't have one of those, you can try google'ing for a full Unicode font. They are quite large, though, so be prepared for a 22mb download if you have to go find one!

After changing the font in the screen-scraper settings, restart the program to have the change take effect.

Does that solve the problem?

timv on 12/01/2008 at 2:21 pm

Arial Unicode MS

Hi Tim
I changed the font to Arial Unicode MS (True Type), but still i get the same problem.
Result in log and csv export:

BRA?E JUGOVI? 45

ivan82 on 12/03/2008 at 11:03 am

You'll have to check on the

You'll have to check on the data sooner than the CSV in order to find this issue. You've got three steps before the CSV where things could be going wrong:

After setting SS's font and you've restarted, check on the "Last Response" tab of the scrapeableFile with the data in question. Don't display it in the browser, just examine it. Do the characters appear on this page?
Push the "Apply pattern to last extracted data" button on your extractor pattern for this data. Does it appear properly in the dataSet?
Use some basic text editor like Notepad (making sure you're on a Unicode font). Do the characters appear properly?

Somewhere down the line it's breaking. We need to figure out what what step is the broken link in the chain.

timv on 12/04/2008 at 2:36 pm

Search

Community

screen-scraper

User login

Characterencoding problem, and writing to file...

Often, the problem is just

Arial Unicode MS

You'll have to check on the