XML file with Shift_JIS character set
Hi Guys,
After im scraping some website with shift_JIS character set, and im receving data i have probably simply problem to fix: in screen scraper logs i have result:
//_LINK_: KN2300060600394539
//ENTITY_NAME: KOKO・TOYOTA
//ENTITY_ADDRESS: 〒471-0034 愛知県豊田市小坂本町4丁目1−4
//PHONE: 0838-26-5200
//_LINK_: KN3500060700059393
//ENTITY_NAME: ファミリーtoyota
//ENTITY_ADDRESS: 〒758-0011 山口県萩市大字椿東無田ケ原2884−1
//PHONE: 0120-060861
//_LINK_: KN2307011300001766
//ENTITY_NAME: トヨタすまいるライフ株式会社/レジデンス・THE・TOYOTAマンションパビリオン
//ENTITY_ADDRESS: 〒471-0878 愛知県豊田市下林町1丁目3−3−1501
//PHONE: 0565-37-8567
but XML file result is like that:
<_LINK_>KN2300060600394539
<_LINK_>KN3500060700059393
<_LINK_>KN2307011300001766
can you please help me with sort out this problem, how to set up xmlWriter?
Best Regards,
Radek
We checked, and as of right
We checked, and as of right now the XMLWriter doesn't allow setting of the character sets. It should be there, though, so we're going to add it. Watch the blog for a note when we release an alpha version in the next day or two with this feature added.
We have this added in version
We have this added in version 5.5.3a. If you upgrade to this version, you can use this session to see how it works.
First, we know the site is showing the characters in Shift_JIS, but the writer needs to be set to UTF-8 to output it correctly.
You need to copy this text to your editor, save the file as "XML Writer.sss" and import it to your screen-scraper.
<scraping-session use-strict-mode="true"><script-instances><script-instances when-to-run="20" sequence="1" enabled="true"><script><script-text>xmlWriter =
new com.screenscraper.xml.XmlWriter
(
"output/test.xml",
"root_element",
"This is the root element",
null,
//"Shift_JIS"
"UTF-8"
);
xmlWriter.addElement( "foo", session.getv( "SAMPLE" ) );
xmlWriter.close();</script-text><name>XML Writer--go</name><language>Interpreted Java</language></script></script-instances><owner-type>ScrapingSession</owner-type><owner-name>XML Writer</owner-name></script-instances><name>XML Writer</name><notes></notes><cookiePolicy>0</cookiePolicy><maxHTTPRequests>1</maxHTTPRequests><external_proxy_username></external_proxy_username><external_proxy_password></external_proxy_password><external_proxy_host></external_proxy_host><external_proxy_port></external_proxy_port><external_nt_proxy_username></external_nt_proxy_username><external_nt_proxy_password></external_nt_proxy_password><external_nt_proxy_domain></external_nt_proxy_domain><external_nt_proxy_host></external_nt_proxy_host><anonymize>false</anonymize><terminate_proxies_on_completion>false</terminate_proxies_on_completion><number_of_required_proxies>5</number_of_required_proxies><originator_edition>2</originator_edition><logging_level>1</logging_level><date_exported>April 28, 2011 10:09:33</date_exported><character_set>Shift_JIS</character_set><scrapeable-files sequence="1" will-be-invoked-manually="false" tidy-html="jericho"><last-scraped-data></last-scraped-data><URL>http://www.phdcc.com/fiscd/japan.htm</URL><BASICAuthenticationUsername></BASICAuthenticationUsername><last-request></last-request><name>Sample</name><extractor-patterns sequence="1" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text><TITLE>
~@SAMPLE@~
</TITLE>
</pattern-text><identifier>Sample</identifier><extractor-pattern-tokens optional="false" save-in-session-variable="true" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="1"><regular-expression>[^<>]*</regular-expression><identifier>SAMPLE</identifier></extractor-pattern-tokens><script-instances><owner-type>ExtractorPattern</owner-type><owner-name>Sample</owner-name></script-instances></extractor-patterns><script-instances><owner-type>ScrapeableFile</owner-type><owner-name>Sample</owner-name></script-instances></scrapeable-files></scraping-session>
thank you very much
thank you very much Jason
Radek
P.S. Thank you for very quick respond if you dont mind can you show me syntax example how to use it now?
Best Regards