New line character should be replaced by single space character
While scraping a site I noticed the problem that a "new line" character is being replaced by an empty character.
Example:
The last response:
We love the chill attitude these jeans have. Roll 'em up or roll 'em
down for the ultimate laissez-faire look.
down for the ultimate laissez-faire look.
The extracted data:We love the chill attitude these jeans have. Roll 'em up or roll 'emdown for the ultimate laissez-faire look.
The setting "Trim white spaces" is turned off which could have explained the missing space between the word "em" and "down". Is this a bug?
I don't really know many situations if any at all where I wouldn't want the "new line" character to be replaced by a single space character.
Regards,
Edgar
That does imply that the site
That does imply that the site has either a \n or a \r\l there, but I don't know which. What I do most of the time I see this is to make a script that will run before pattern is applied that will replace all the new line characters with a
tag or a space.
What kind of script?
"What I do most of the time I see this is to make a script that will run before pattern is applied"
How do you do this without first getting the data using an extractor pattern? The only way I could think of was to call .getContentAsString, then .extractData, but that seems kind of clunky. How are you doing this?
Hey Chirs, Here is a script I
Hey Chirs,
Here is a script I use when I want to scrape stuff from ugly JavaScript, and would rather work with a readable format:
test = StringEscapeUtils.unescapeJavaScript(scrapeableFile.getContentAsString());
scrapeableFile.setLastScrapedData(test);
I set that on the first extractor, to run before the pattern is applied, and my "last response" is now filled with less user hostility.
Ah, I didn't know about the
Ah, I didn't know about the .setLastScrapedData method; in fact, I don't see it in the documentation either. That could be useful in many situations. Are there other methods we don't know about? What are you hiding from us?? ;)
Thanks.