Scraping data in the form of Korean characters
Hello,
I am trying to scrape a site containing Korean characters.
I have Arial Unicode MS font selected, and have used both UTF-8 and EUC-KR encoding. I have the "Tidy HTML after scraping" function turned off otherwise and Korean characters are displayed as ?????'s.
When I run the scrape, I can see the Korean characters in the "Last Response" tab, I have an extractor pattern which extracts Korean characters and displays the correct characters in the scraping session log.
The problem is that I get ??'s in the outputted .txt / .csv file rather than the Korean characters.
Any idea what I'm doing wrong?
try "no tidy" on these asia
try "no tidy" on these asia websites.
Hi-- I was trying to dig up
Hi-- I was trying to dig up the example I've used a few times, but I couldn't find it :P
The problem is that when you try to write to a file, most languages will default the encoding to whatever your local system default is. So, you have to find a way to specify the encoding type on the output stream, or else it'll get misinterpreted when it gets written out.
Try this:
// Interpreted Java
import java.io.*;
File outFile = new File("path/to/file.txt");
OutputStreamWriter fileWriter = new OutputStreamWriter(new FileOutputStream(outFile, true), "EUC-KR");
fileWriter.write("some string data");
fileWriter.close();
I haven't checked on the encoding code you used ("EUC-KR") to see if it's valid for Java or not, but I would assume that it is.
Let me know if that helps! You've already gotten the hardest parts out of the way!
Tim
Hi Tim, Tried this with both
Hi Tim,
Tried this with both UTF-8 and EUC-KR specied with no luck.
My log: -
Starting scraper.
Running scraping session: Korail
Processing scripts before scraping session begins.
Scraping file: "Form Submission"
Form Submission: Preliminary URL: http://logis.korail.go.kr/driveinfo/TrainInfop.jsp
Form Submission: Using strict mode.
Form Submission: Resolved URL: http://logis.korail.go.kr/driveinfo/TrainInfop.jsp?opsDd=20090426&trnNo=1781
Form Submission: Sending request.
Form Submission: Processing scripts before all pattern applications.
Form Submission: Extracting data for pattern "Untitled Extractor Pattern"
Form Submission: The following data elements were found:
Untitled Extractor Pattern--DataRecord 0:
FORM_SUBMITTED_TEXT=여객
Storing this value in a session variable.
Form Submission: Processing scripts after a pattern application.
Form Submission: Processing scripts after all pattern applications.
Processing script: "Write extracted data to a file"
Writing data to a file.
Processing scripts after scraping session has ended.
Scraping session "Korail" finished.
My script: -
// Interpreted Java
import java.io.*;
// Output a message to the log so we know that we'll be writing the text out to a file.
session.log( "Writing data to a file." );
// Create a FileWriter object that we'll use to write out the text.
File outFile = new File("form_submitted_text.xls"); //or .txt or whatever
OutputStreamWriter fileWriter = new OutputStreamWriter(new FileOutputStream(outFile, true), "UTF-8");
// Write out the text.
fileWriter.write(session.getVariable( "FORM_SUBMITTED_TEXT" ));
// Close the file.
fileWriter.close();
The reslting output was "ì—¬ê°" instead of "여객". Am I doing anything wrong?
Bummer. The problem
Bummer. The problem obviously stems from the file-writing leg of the scrape. For the script, I would stick to whatever character encoding
This is kind of a stretch, but do you have another editor to open it up with? Something that can handle strange encodings. (Notepad may/may not do the job.) Even if Excel normally can display Korean characters, I wonder if maybe if Excel is just trying to guess an encoding, and is failing.
I wish I could help more-- I don't seem to have an option for selecting the EUC-KR encoding. screen-scraper doesn't seem to be detecting it automatically because of that, and so I can't actually scrape the page and successfully see the proper Korean characters.
Let me know if that changes anything for you-- I'm not sure what else to do without being able to properly test on my end.
Tim