Scraping foreign language site
The site from which I’m trying to scrap are Korean sites (thus Korean fonts). So, I have set ‘Default character set’ as ‘euc_kr’ and ‘Default font’ as ‘Arial Unicode MS’. I am able to receive token results from scrapeable file except that they are unreadable texts (actually symbols and squares). Of course, when I then transfer the token results to database I can then see the results in readable Korean fonts. But, this isn't good enough.. I really need to see the results in scraper program before they are transferred to database so that I know exactly what is being scraped.
I can solve this problem if I un-check ‘Tidy HTML after scraping’ in the Advanced Tab menu. However, my new problem is that no results are being found. The error message reads ‘Warning! No matches were made by any of the extractor patterns associated with this scrapeable file.’
Does anyone know how I can see the results in readable Korean font in the scraper program without having it transferred to database?
You help is very much appreciated.
Thank you.
Helpful foreign language tool
When scraping sites in foreign languages, there are a few tools available to you. One that we have recently come across is a translation addon for Firefox that allows you to convert Chinese pages to English. The addon can be found at https://addons.mozilla.org/en-US/firefox/addon/3349