Scraping data from <textarea>

I'm trying to extract address data from a text area

 (<textarea></textarea>)
form field on a web page. Each line of the address is shown on a different line in the text box.

When I scrape the data, I lose these line breaks so the address gets joined together (E.g Old Farm BarnPetworth RoadWisborough GreenWest SussexRH14 0BJ)

I have tried scraping with tidy HTML on and off, but it makes no difference. Any suggestions would be much appreciated. Thanks

replace all and then write to file

Hi Scraper

Thanks for the suggestion. I have tried this, but still no line breaks in the output file, no matters what I open it with (Notepad, Wordpad, Word, Excel). Any more thoughts?

Thanks, Gary

Part of the suggestion to

Part of the suggestion to replace the "\n" with "\r\n" is because Windows won't make a line break if only the "\n" is present. It needs both "\r\n" together like that in order for the line break to appear.

Notepad and Wordpad are likely victims to this behavior, but I frequently use "Notepad++" to open my files when a problem like this pops up. I know it's never fun to go on a software hunt just for debugging purposes, but it might help us see the issue more clearly. (It's either that, or a hex-editor, but that's far less fun.)

Assuming that even the Notepad++ test fails, are you processing the variable at all between scraping it and writing it to a file? We've requested that you process it with those replaceAll lines, but I mean other than that. There really shouldn't be a reason for screen-scraper to remove those newline characters unless instructed to.

Notepad++ should hopefully reveal if the "\n"-only newline is there or not. If it is, then it's a simple matter of making it turn into a "\r\n". If it's not there, then I'll have to scratch my head for a moment and try to figure out why it's going away.

Can you post a URL to an example page? I'm curious to take a look at it.

Tim

More info...

Hi Tim

I've tried with Notepad++, but the line breaks definitely aren't there. Also, I'm not doing any other processing.

Here's a sample URL: http://62.189.207.183/PublicAccess/tdc/DcApplication/application_detailview.aspx?caseno=KJ7WMDCB01M00

Also, here's an extarct from my log:

APPLICATIONNO=ADC/0185/09
ADDRESS=104 Southwick StreetSouthwickBrightonWest SussexBN42 4TJ
PROPOSAL=Widening of vehicular access
TYPEAPP=Full Application
DATERECEIVED=06/05/2009
DATEVALID=12/05/2009
APPLICANTNAME=Mr Kenneth Woods
APPLICANTADDRESS=104 Southwick StreetSouthwickBrightonWest SussexBN42 4TJ
Planning Details: Processing scripts after a pattern application.
Processing script: "CAPS write data to a file"
Writing data to a file.

Thanks for you time, Gary

Well, I'm blinking in

Well, I'm blinking in astonishment at the lack of newline characters. I'm going to file a bug report about it, but I'm really not sure what else I can do until I can get the lead developer questioned about it.

Hopefully will have news for you soon.

Tim

Any news?

Hi Tim

Did you manage to find out anything from your lead developer?

Thanks, Gary

Sorry for such a long delay

Sorry for such a long delay in correspondence.

The issue relates to the way that screen-scraper will strip out unnecessary whitespace. It's tagging the textarea whitespace as unimportant, so you never get a good chance to do anything about it. It seems, further, that it's rather difficult to change, because it would break backwards compatibility for scraping projects that worked under the way that it currently happens.

This page might be of some help, though: http://community.screen-scraper.com/FAQ/WhiteSpace

There is a link to download an example script of how one might work around this issue. The idea is pretty simple:

// Run this script after the page request, but before your extractor patterns begin to be applied:
scrapeableFile.setLastScrapedData( scrapeableFile.getLastScrapedData().replaceAll( "\\r\\n", "###" ));

and then when any of your patterns match on the scrapableFile, you have to manually convert the "###" back into a newline character, or else you will have lots of extraneous "###"s in your data.

I look forward to hearing

I look forward to hearing what you find out....

Many thanks for your help. Gary

Scraping data from <textarea>

Hi Tim

You've grasped exactly what I'm trying to do.

Thanks for the suggestion, which I've tried to no avail. Looking at the log, it seems like the line breaks are being lost when the page is scraped: eg. ADDRESS=22 Manor RoadLancingWest SussexBN15 0EY

Any more thoughts (I'm using the basic edition)?

Thanks, Gary

replace all and then write to file

Garaldo,

Thanks for the question.

Here's my suggestion. Working with Tim's idea I think you should do the replace all and then write that variable out to a file and open the file to see if the line breaks are there, but that they don't appear in certain platforms (i.e. line breaks are handled differently in notepad than they are in Word).

So, if you're on windows do the following line before you write it out to a file.

session.setVariable("ADDRESS", session.getVariable("ADDRESS").replaceAll("\\n", "\\r\\n"));

Then use the following code to get it out to a file:


// Output a message to the log so we know that we'll be writing the text out to a file.
session.log( "Writing data to a file." );

// Create a FileWriter object that we'll use to write out the text.
out = new FileWriter( "Test.txt" );

// Write out the text.
out.write( session.getVariable( "ADDRESS" ) );

// Close the file.
out.close();

This will write it out to the installation directory in screen-scraper. So, check where you put screen-scraper on your computer and look for a Test.txt file after you run this. Open it in Word (reply back if you don't have Word) and see if the line breaks are there.

If this still doesn't work we'll find another solution.

Thanks
Scraper

Ran into similar myself

I have a similar issue extracting new line characters from a text area - would love to know if this has been fixed in a more recent edition or what plans there are as i'm currently at a loss as to how i can re-introduce the line breaks.

Hello-- If I understand

Hello--

If I understand correctly, you're matching the entire contents of the textarea tag with a single token in your extractor pattern, correct?

Assuming that your matching it out with something close to , then the newlines found on the page should be preserved in your saved value. There's a good chance, though, that if you're using Windows or Mac that it's simply not breaking out the lines as expected.

If you're on Windows, try using a script to do the following operation:

session.setVariable("ADDRESS", session.getVariable("ADDRESS").replaceAll("\\n", "\\r\\n"));

If you're using a Mac, just change the last part of the above line:

... .replaceAll("\\n", "\\r"));

You would only see this take effect if you were printing out to a file. The log in screen-scraper won't do anything with those new-line characters. (The reason is because the log is actually in HTML, where newlines don't normally mean anything.)

Alternatively, you could try to match the contents of your address out with separate tokens:
Extractor pattern text:



sub extractor patterns:
    ~@ADDRESS@~
... etc. It may or may not work with the latter approach, but then at least you can try to isolate the data into various tokens and render it however you wish.

Hope that helps out!

Tim