Scrape empty cells

Hi,

I have a completely regular table to scrape, but sometimes the cells are empty, in which case that row is not scraped. Is there a way to scrape all the cells and return a value for the empty cell? Here's an example (pls ignore ~ symbol, I've put it to stop forum software from converting to html):

<~tr>
<~td>1<~/td>
<~td>Sweden<~/td>
<~td><~/td>
<~td>Peter Hanson<~/td>
<~td>66<~/td>
<~td>67<~/td>
<~td><~/td>
<~td><~/td>
<~td>266<~/td>
<~/tr>

cheers
Ian

How about sutil.nullToEmptyString

Hi Ian,

Just to confirm, your extractor pattern includes tokens for each line correct? (So you'd have something like ~@COUNTRY@~ where Sweden is in the 3rd line) If so, you can create a script that uses a handy utility called nullToEmptyString to convert the null value picked up by SS when there's nothing extracted for a line to an empty string that can be written out. Assuming the ~@COUNTRY@~ token, the use would look something like this:

if (session.getVariable("COUNTRY")==null)
 {
  sutil.nullToEmptyString(session.getVariable("COUNTRY"));
 }

HTH and keep on scrapin'!
Justin

Test pattern drops the row

Hi Justin,

Thanks for your reply. I'm afraid I find this software really difficult to work with, so please forgive any stupidities!

I tried to include this in my 'after each pattern match' script but it doesn't do the trick. But indeed when I test the pattern of the extractor the row with the blank cell is not shown - so I imagine that the code you're suggesting would not pick up an empty cell that had already been ignored by the scrape.

The problem as I see it is to influence the scrape in some way before it takes place, so that the empty <~td><~/td>" cell is seen as having something in it.

I'm a little worried that this is a problem that other people must have had and I've completely misunderstood the instructions!

Thanks
Ian

Are there always 9 rows?

Hi Ian,

No problem - we all start off the same way. I think (and maybe one of the SS guys can confirm this) that when you have an extractor pattern where only some of the tokens are matching and you then click "Test Extractor Pattern", only the tokens with matching values show up in the list/grid. The ones with null values (i.e., empty rows) might not appear since there's no value for them.

Out of curiosity, are there always 9 rows between the < TR > and < /TR > or does the number of rows sent back vary depending on the situation? I'm just wondering if a better way to handle this might be via a set of Sub-Extractor patterns...

Regards,
Justin

Yes

Yes the number of rows is consistent, just the cells are sometimes empty,
cheers
Ian

OK, what about this?

So your extractor pattern looks something like this?

<~tr>
<~td>~@FIELD_1@~<~/td>
<~td>~@FIELD_2@~<~/td>
<~td>~@FIELD_3@~<~/td>
<~td>~@FIELD_4@~<~/td>
<~td>~@FIELD_5@~<~/td>
<~td>~@FIELD_6@~<~/td>
<~td>~@FIELD_7@~<~/td>
<~td>~@FIELD_8@~<~/td>
<~td>~@FIELD_9@~<~/td>
<~/tr>

If so, you could try creating a script with code similar to the following that is run after the pattern is applied:

//evaluate & convert the first field
String retreivedVariable_STR1 = sutil.nullToEmptyString(session.getVariable("FIELD_1));
session.saveVariable("FIELD_1", retreivedVariable_STR1);
//evaluate & convert the second field
String retreivedVariable_STR2 = sutil.nullToEmptyString(session.getVariable("FIELD_2));
session.saveVariable("FIELD_2", retreivedVariable_STR2);
...
//evaluate & convert the last (9th) field
String retreivedVariable_STR9 = sutil.nullToEmptyString(session.getVariable("FIELD_9));
session.saveVariable("FIELD_9", retreivedVariable_STR9);

What this script does is to get each variable (FIELD_1 to FIELD_9) in turn and assign it's value to a string (these are the retreivedVariable_STR). During this process, the script checks if the variable is a null value and if so, it converts the null value to an empty string (""), but if the variable has any sort of non-null value assigned, it does nothing. The script then re-saves the variable with the value or the empty string before moving on. Once this is done, you can write the variable values out and any variable that used to have a null will now have an empty string.

Note that you need to change the fields from FIELD_1 to whatever your tokens are and replace the ... with iterations of the above. Also note that my code is pretty rough and isn't meant to be pretty or perfect as I'm not really a programmer by training...but I guess that's pretty obvious! ;)

Error message

Thanks Justin,

I've added your code to my Write_to_CSV script which is set to run after each pattern match, but I get the following error message:

ERROR--scriptA: An error occurred while processing the script: write scriptA
scriptA: The error message was: class bsh.EvalError (line 30): session .saveVariable ( "Position" , gotPosition ) -- Error in method invocation: Method saveVariable( java.lang.String, java.lang.String ) not found in class'com.screenscraper.scraper.ScrapingSession'

Once again, thanks for your patience,
Ian

Is there a space between session & saveVariable

Hi Ian,

In looking over the code relating to the error, I noticed that it looks like there's a space between session & saveVariable: can you check this out? If this is the case, remove this space so the line reads as follows:

session.saveVariable ( "Position" , gotPosition );

Just FYI, the error message is telling you that the standard Java libraries installed with SS don't contain any functionality named "saveVariable", which is true as this is provided through SS via the session keyword preceding it. This reason is why the space between session and saveVariable might be causing this error.

HTH,
Justin

session.saveVariable

Thanks for your reply, but in fact there's no space - I had copied and pasted your code correctly. The error message is misquoting for some reason, adding a space!

Here's a snippet:

//evaluate & convert the first field
String gotPosition = sutil.nullToEmptyString(session.getVariable("Position"));
session.saveVariable("Position", gotPosition);

Any other thoughts?
cheers
Ian

Two more thoughts (one of which shows that I was half-asleep...)

Hi Ian,

OK, I generated a HTML page using your markup above and then tested this with the code that I gave you, which allowed me to replicate your error. I think I know what you need to do to fix it:

1. In your extractor pattern, check that the "Save in session variable" option is checked for each token. This ensures that your script can access the information picked up by the extractor pattern.

2. Replace all instances of
session.saveVariable(...);
with
session.setVariable(...);
so for example
session.saveVariable("Position", gotPosition);
becomes
session.setVariable("Position", gotPosition);

FWIW, I think that the problem was due to my original code quoting "saveVariable" rather than "setVariable" as the former isn't the correct name for this functionality in SS. My apologies for this - I really don't know what I was thinking...but for now my excuse is that it was early in the morning and I hadn't had my 2nd coffee yet! ;)

Let me know if this works...

Still no luck

Hi Justin,

Well I've tried every perm I can think of without success. No matter what I do, if any cell is blank the row is dropped, so only a full match will work. In other words, it seems that only the rows showing up in the 'test pattern' are available for the script.

Also, using your code as shown below overwrites the existing data and creates a null so I'm clearly not implementing it correctly.

With your test html page, did you have some rows with full data and some with partial data? If it worked for you, then perhaps I'm not applying your code in the right place. I've been using this code, set to run 'after each pattern match'.

I've been putting your code in where it says ****JUSTIN'S CODE HERE****:

// Set name of file to write to. the session variable "CSV_NAME" should be declared in another script such as an init script.
outputFile = session.getVariable ("CSV_NAME");

// another convenient way to set up your output file is to name the output after the scraping session.
// outputFile = session.getName() + ".csv";

// Error catching.
try
{
//the following code is necessary to set up the file to be written to.
File file = new File( outputFile );
fileExists = file.exists();

// Open up the file to be appended to.
out = new FileWriter( outputFile, true );
session.log( "Writing data to a file." );

**** JUSTIN'S CODE START ****

String gotRound5 = sutil.nullToEmptyString(session.getVariable("Round5"));
session.setVariable("Round5", gotRound5);

String gotRound6 = sutil.nullToEmptyString(session.getVariable("Round6"));
session.setVariable("Round6", gotRound6);

**** JUSTIN'S CODE END ****

//this piece of code is responsible to write out the headers only 1 time.
if (!fileExists)
{
out.write("\"" + "Position" + "\"" + ",");
out.write("\"" + "Rank" + "\"" + ",");
out.write("\"" + "Start" + "\"" + ",");
out.write("\"" + "Country" + "\"" + ",");
out.write("CountryPic" + ",");
out.write("\"" + "Name" + "\"" + ",");
out.write("\"" + "toPar" + "\"" + ",");
out.write("\"" + "Thru" + "\"" + ",");
out.write("\"" + "Round1" + "\"" + ",");
out.write("\"" + "Round2" + "\"" + ",");
out.write("\"" + "Round3" + "\"" + ",");
out.write("\"" + "Round4" + "\"" + ",");
out.write("\"" + "Round5" + "\"" + ",");
out.write("\"" + "Round6" + "\"" + ",");
out.write("\"" + "Total" + "\"" + ",");
out.write( "\n" );
}

// Write columns.
// the important part of this code is where the variable comes from (dataRecord or session variable)
// if the variable was not saved as a session variable but instead this script was invoked after a dataRecord match you would use this code
// out.write( prepareStringForOutput(dataRecord.get("VIN")) + "," );
out.write( session.getVariable( "Start" )+ "," );
out.write( session.getVariable( "Country" )+ "," );
out.write( session.getVariable( "toPar" )+ "," );
out.write( session.getVariable( "Thru" )+ "," );
out.write( session.getVariable( "Round1" ) + "," );
out.write( session.getVariable( "Round2" ) + "," );
out.write( session.getVariable( "Round3" ) + "," );
out.write( session.getVariable( "Round4" ) + "," );
out.write( session.getVariable( "gotRound5" ) + "," );
out.write( session.getVariable( "gotRound6" ) + "," );
out.write( session.getVariable( "Total" )+ "," );

//if you would like to include the URL as a field of your csv you'd use this command
out.write( scrapeableFile.getCurrentURL()); //note that the last out.write doesn't have a comma
out.write( "\n" );

// Close up the file.
out.close();

// Clear variables. you only need to clear session variables because dataRecord variables don't persist
session.setVariable("Position","");
session.setVariable("Start","");
session.setVariable("Country","");
session.setVariable("Name","");
session.setVariable("toPar","");
session.setVariable("Thru","");
session.setVariable("Round1","");
session.setVariable("Round2","");
session.setVariable("Round3","");
session.setVariable("Round4","");
session.setVariable("gotRound5","");
session.setVariable("gotRound6","");
session.setVariable("Total","");
}
catch( Exception e )
{
session.log( "An error occurred while writing the data to a file: " + e.getMessage() );
}

As always, thank you for your patience,

cheers
Ian

Solved!

Hi Justin,

I went right back to basics and checked the extractor pattern carefully, testing the pattern for each configuration. I found if I set all my variable to non-HTML tags regardless of whether they were numeric or not I achieved a hit on all the rows. My code then worked fine, without needing your additions.

Many thanks for all your help in this - I hope you don't feel your time has been wasted too much ...

cheers
Ian

Great!

Hi Ian,

Glad to hear that you figured out this problem! Trying to fix weird issues w/extractor patterns can be maddening sometimes, I agree. Take care and keep scraping!

Regards,
Justin