Code problem

I am currently testing SS basic addition and I have trouble extracting a page which contains german characters like "äöü". Normally these are represented via \&auml\; \&ouml\; etc., but this page contains them in "raw" format: http://immoads.oe24.at/. So far I had no luck trying different codes and tidy/no tidy settings: The german characters are either replaced by question marks or appear as square symbols.

best regards

set encoding in ss basic

I was under the impression that encoding can be set to different values even in the basic edition. Is that correct? As I said, I have already tried using different settings with no luck.

This is what I get in the Last Response tab with code=ISO-8859-1 / tidy=off:

"<"title>Wohnung, Haus, Grundst�ck kaufen und mieten - Immobilien in �sterreich"

This is what I get in the Last Response tab with code=ISO-8859-1 / tidy=on:

"<"title>Wohnung, Haus, Grundst?ck kaufen und mieten - Immobilien in ?sterreich

best regards Christian

cpieler, You're absolutely

cpieler,

You're absolutely right. You can set the character set in Basic Edition using session.setCharacterSet(). Jason and I stand corrected.

I did a simple test where I scraped the home page of http://immoads.oe24.at/. I ran a script before the scraping session begins and in that script I called:

session.setCharacterSet( "ISO-8859-1" );

I created and extractor pattern for the title page and the result was:

Wohnung, Haus, Grundst&uuml;ck kaufen und mieten - Immobilien in &Ouml;sterreich

I did not alter the tidier at all. Are you seeing something different if you do the same on your end?

-Scott

cpieler, Thank you for

cpieler,

Thank you for pointing out the issue with screen-scraper Basic Edition not using the character set settings in the settings dialog box. I've asked one of our developers to look into this.

-Scott

Solved

Scott, thank you. That did it! So far I had used Options/Settings to select the code. It appears that this is being ignored by SS.

As for tidy on/off: I get your result ONLY with tidy=on. Tidy=off delivers the "raw" german characters in the response. Both cases can be handled.

best regards

Christian

If you were to try

If you were to try professional or enterprise edition, they both have an option to set character encoding, and if you set screen-scraper to use the same encoding as your characters, you should see them correctly.

cpieler, Set your character

cpieler,

Set your character set to ISO-8859-1 and you should see

<h1>Wohnung, Haus und Grundst&uuml;ck zu mieten und kaufen in ganz &Ouml;sterreich</h1>

You can then convert the HTML entities using either session.convertHTMLEntitiesInVariable() or sutil.convertHTMLEntities().

-Scott