Altering extracted data in the DATARECORD
Is there a way to alter the data in the DATARECORD, and then reinsert that altered data back into the dataRecord after each pattern match of an extractor pattern? In other words, can I make alterations to a DATARECORD array, and have those changes write out to a file after each pattern match without using a session variable?
For example, I am extracting a large list of products by using a sub extractor pattern that grabs all of the products in one big chunk. There is a lot of extra HTML I want to discard before the data writes out to my file, so I then use a script after each pattern match to use a third level extractor pattern that extracts out the products.
I want to then insert that newly extracted/cleaned up data back into my dataRecord and replace the original extraction with this newly cleaned version, so I can write that dataRecord out to a file without having to use a session variable.
I've tried using the dataRecord.put("DATARECORD_SUB EXTRACTOR TOKEN NAME", cleaned_values) at the end of my script, but the result was that the original sub extractor pattern html wrote out to the file, not the newly altered and cleaned up version.
Is it possible for me to make alterations to the dataRecord after each pattern match, and then have those alterations overwrite the existing data already contained in the dataRecord?
Thanks!
EDIT: I think I figured it out. I need to save the new values into a session variable, and then put those values into the dataRecord after each pattern match. I wanted to do it avoiding the session variable because I am using csvWriter to flush the complete DataSet to a csv file, but I don't think it can be done the way I want to do it unless you guys know a trick to alter the dataRecord without using session variables. Thanks!
I have a thing I use a lot.
I have a thing I use a lot. When you use the Jericho tidy, it adds lots of whitespace to the HTML, and or course I don't want to have it in my output, so I have this script I run on each pattern match. I call it "pretty dataRecord"
{
if (str!=null)
{
str = sutil.convertHTMLEntities(str);
str = str.replaceAll("<[^<>]*>", " ");
str = str.replaceAll("\\s{2,}", " ");
str = str.trim();
}
return str;
}
enumeration = dataRecord.keys();
while (enumeration.hasMoreElements())
{
key = enumeration.nextElement();
value = fixStr(dataRecord.get(key));
dataRecord.put(key, value);
}
Thanks Jason!
The object approach to tidying the string is solid. Thank you.