Saving tidied data using sutil.tidydatarecord
I have got the sutil tidier to tidy my datarecord, but I am unsure how to save the data in a csv file (as instructed below.) It still saves the original untidied record
I call this script after each pattern match for my DATARECORD:
DataRecord tidied = sutil.tidyDataRecord(dataRecord);
// Run code here to save the tidied record
Then I use the standard csv writer as below:
// Retrieve CsvWriter from session variable
writer = session.getv( "WRITER" );
// Write dataRecord to the file (headers already set)
writer.write(dataRecord);
// Flush record to file (write it now)
writer.flush();
The log shows that the data is being tidied, but the data in the file is the untidied data. It is hopefully something very obvious and daft I am doing wrong, but I have spent hours and got nowhere with this. It will save me hours of work trying to tidy unexpected html codes in extracted data!
Many thanks
Jason
What precisely do you see
What precisely do you see that you don't expect? Usually Tidy is just for the HTML that you're extracting from ...
I want to save a tidied record
Where there is a variety of html that is unpredicatable, I can select a whole table, for example, but would then get all the html tags etc. as well as the text I am after. I simply need to strip out these tags before writing them to the csv file
So you're using
So you're using sutil.tidyDataRecord()?
That has a setting to remove HTML. You need to make a script that runs "After the pattern is applied":
for (i=0; i<c; i++)
{
dr = dataSet.getDataRecord(i);
dr = sutil.tidyDataRecord(dr);
// Now write dr
}
I think I have done this but I get the following error:
DESCRIPTION=<li><span style="font-size:18px;">Single axle</span></li><li><span style="font-size:18px;">Super singles</span></li><li><span style="font-size:18px;">£2450 + VAT</span></li>
PandHDetailsPage: DATARECORD: Processing scripts after a pattern application.
The token "REF" in sub-extractor pattern #3 has no regular expression.
The token "DESCRIPTION2" in sub-extractor pattern #5 has no regular expression.
The token "DESCRIPTION" in sub-extractor pattern #14 has no regular expression.
The token "junk" in sub-extractor pattern #8 has no regular expression.
The token "TITLE" in sub-extractor pattern #2 has no regular expression.
The token "DESCRIPTION3" in sub-extractor pattern #6 has no regular expression.
The token "CATEGORY" in sub-extractor pattern #1 has no regular expression.
The token "ID" in sub-extractor pattern #13 has no regular expression.
The token "DESCRIPTION1" in sub-extractor pattern #4 has no regular expression.
The token "DESCRIPTION4" in sub-extractor pattern #7 has no regular expression.
PandHDetailsPage: DATARECORD: Processing scripts once if pattern matches.
PandHDetailsPage: DATARECORD: Processing scripts after all pattern applications.
Processing script: "PandHTidyPostWrite"
Tidying DataRecord
Tidying value for key: DESCRIPTION4
Tidied String from "2450 + VAT" to "2450 + VAT"
Tidying value for key: DESCRIPTION3
Tidied String from "Super singles" to "Super singles"
Tidying value for key: DESCRIPTION2
Tidied String from "Single axle" to "Single axle"
Tidying value for key: DESCRIPTION1
Tidied String from "20 FT Hiab trailer" to "20 FT Hiab trailer"
Tidying value for key: DESCRIPTION
Tidied String from "<li><span style="font-size:18px;">Single axle</span></li><li><span style="font-size:18px;">Super singles</span></li><li><span style="font-size:18px;">£2450 + VAT</span></li>" to "- Single axle - Super singles - £2450 + VAT"
Tidying value for key: TITLE
Tidied String from "20ft Hiab used trailer" to "20ft Hiab used trailer"
Tidying value for key: PIC2
Tidied String from "/ekmps/shops/phmsh/images/20ft-hiab-used-trailer-[2]-384-p[ekm]100x75[ekm].jpg" to "/ekmps/shops/phmsh/images/20ft-hiab-used-trailer-[2]-384-p[ekm]100x75[ekm].jpg"
Tidying value for key: PIC1
Tidied String from "/ekmps/shops/phmsh/images/20ft-hiab-used-trailer-384-p[ekm]300x225[ekm].jpg" to "/ekmps/shops/phmsh/images/20ft-hiab-used-trailer-384-p[ekm]300x225[ekm].jpg"
Tidying value for key: ID
Tidied String from "384" to "384"
Skipping DATARECORD and excluding from tidied record
Tidying value for key: REF
Tidied String from "TRAI150" to "TRAI150"
Key junk wasn't all uppercase, excluding from tidied record
ERROR--PandH: An error occurred while processing the script: PandHTidyPostWrite
PandH: The error message was: class bsh.EvalError (line 11): writer .write ( dataRecord ) -- Error in method invocation: Attempt to pass void argument (position 0) to method: write
My script that runs 'after pattern is applied' is below:
for (i=0; i<c; i++)
{
dr = dataSet.getDataRecord(i);
dr = sutil.tidyDataRecord(dr);
// Retrieve CsvWriter from session variable
writer = session.getv( "WRITER" );
// Write dataRecord to the file (headers already set)
writer.write(dataRecord);
// Flush record to file (write it now)
writer.flush();
}
I tried just using the standard writer 'after each pattern match as well, but it still wrote the original string complete with html?
Jason
The
The line
Should read
Because "dr" is a dataRecord that you've just treated.
Thanks - Works perfectly
To help novices like me, perhaps a simple addition to the csv writer having this as an option would be useful in the documentation, as not many people want to keep the html in the file I would imagine?
Thanks again
Jason