FileWriter Output Encoding problems
Hi,
Before I get into my problems I've got to say this is a really impressive program. Really appreciate the flexibility in the basic edition.
I'm scraping a Chinese Web Site and I'm running into some problems getting the final output to give me non-ascii - instead I'm getting question marks. I've done the following:
- Checked the FAQ, set UTF-8 as my default, Using Arial Unicode for the font
- Checked the forums, learned that HTML Tidy can't be disabled in basic (wish that was in the FAQ!)
- Upgraded to the professional Trial, disabled HTML tidy
- I've tried forcing UTF-8 in my text editor in case there wasn't a BOM
When I apply the extractor patterns to the last scraped data I see the proper non-ascii data within the screen-scraper UI. So everything is good above, and I think that covers all the different points in the forums/faqs. However when I actually write this out using FileWriter and out.write I lose the non-ascii.
Note I'm using OS X. I'm wondering if my problem is that OS X actually defaults to using MacRoman with Java output:
http://developer.apple.com/DOCUMENTATION/Java/Conceptual/Java14Developme...
If that's the case, what's the proper way to get my script to force the output to UTF-8? If that's not the problem here what else could be the cause?
Here's the basic script I'm using, copied from the tutorials:
FileWriter out = null;
session.log( "Writing data to a file." );
// Open up the file to be appended to.
out = new FileWriter( "Cpod_Tome.txt", true );
// Write out the data to the file.
out.write( "
// Close up the file.
out.close();
EDIT:
I found a partial solution. If I replace this line:
out = new FileWriter( "Cpod_Tome.txt", true );
To either this:
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream("cpodtome.txt"),"UTF-8");
OR:
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("cpodtome.txt"),"UTF8"));
Then I can get it to output UTF-8 and the output looks good.
However OutputStreamWriter works a bit differently from FileWriter. Instead of appending the file like FileWriter it just replaces the existing file. I haven't been able to find any code examples that show how I can get OutputStreamWriter to append. Would appreciate any help, Thanks!
Found the cause, which is
Found the cause, which is more or less along the lines of what I expected, but looks like it affects all platforms. Filewriter uses the default system encoding, this apparently can't be changed.
I just followed the other recent thread on encoding problems and changed this line:
out = new FileWriter( "Cpod_Tome.txt", true );
To this:
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream("cpodtome.txt"),"UTF-8");
Works like a charm, but perhaps the FAQ should be updated?
Yes, that would be a good
Yes, that would be a good idea. Thanks for doing that research :)
The first fix I posted wasn't
The first fix I posted wasn't quite write, as it doesn't append to the output file like the tutorials do. Here is the proper code:
Replace (from the tutorial):
out = new FileWriter( "filename.txt", true );
With:
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream("filename.txt", true),"UTF-8");