Unknown Number of Fields to a CSV

I am looking for a bit of help regarding a scrape that attempts to grab highway closures from a website and writing them to a csv where the 6 variables below would be in each row. The list of these is arranged by increasing highway numbers, yet the way they update the information is random, so there is never a set number of the above code (yet it will never go above 300). I'm wondering if there is some type of script or loop I could use to pull out the information. The website I'm scraping contains an unknown and varying number of entries of the following form:

Hwy ~@HIGHWAYNUMBER@~~@HIGHWAYLOCATION@~

~@CONDITION@~

~@COMMENT@~

~@DATE@~

~@TIME@~

transportguy on 05/11/2009 at 1:42 pm

screen-scraper public support

Where are you having the

Where are you having the problem? Is it with the scrape pattern being flexible enough to handle the potentially missing entires? Or is it with writing out the data to the CSV file? If it's the latter you should be able to do it but there's a couple of things to watch out for as you might be trying to access null values.

If you're saving the data to session variables instead of using dataset or datarecord then make sure you initialise them first so that when you go to access them you don't risk accessing a null variable.

i.e.:

//begin loop session.setVariable("CONDITION",""); session.setVariable("COMMENT",""); session.setVariable("DATE",""); session.setVariable("TIME","");

//session.scrapefile("your scaprable file");

FileWriter output = new FileWrite("C:\output.csv");
output.write(session.getVariable("CONDITION") + ",");
output.write(session.getVariable("COMMENT") + ",");
output.write(session.getVariable("DATE") + ",");
output.write(session.getVariable("TIME") + "\\r\\n");
output.close();
//end loop

this way if you've got blank fields you'll just get a blank entry in the CSV i.e.
goodcondition,,,today

If you're using DataRecords make sure you check for a null value first i.e.:

FileWriter output = new FileWrite("C:\output.csv"); String outText = ""; if (dataRecord.get("CONDITION") != null) outText = dataRecord.get("CONDITION") + ","; output.write(outText); outText = ""; if (dataRecord.get("COMMENT") != null) outText = dataRecord.get("COMMENT") + ","; output.write(outText); outText = ""; if (dataRecord.get("DATE") != null) outText = dataRecord.get("DATE") + ","; output.write(outText); outText = ""; if (dataRecord.get("TIME") != null) outText = dataRecord.get("TIME") + "\\r\\n"; output.write(outText);

shadders on 05/11/2009 at 8:09 pm

Accessing all variables.

I am using the free version and therefore cannot use dataRecord.get. I'm having trouble storing and printing out the unknown number of variables. For example, if the website just posted the closed highway names and the reasons for each closure in a row of a table (where the code surrounding each row is identical), it could look like:

~@HWY/REASON1@~
~@HWY/REASON2@~
~@HWY/REASON3@~
~@HWY/REASON4@~

or on another day:

~@HWY/REASON1@~
~@HWY/REASON2@~
~@HWY/REASON3@~
~@HWY/REASON4@~
~@HWY/REASON5@~ ... all the way to a maximum of ~@HWY/REASON300@~

My main extractor pattern contains the total section of the table which contains these variables (regardless of how many). If I write out a sub-extractor pattern that includes one secion of the code that surrounds a HWY/REASON variable (since each HWY/REASON variable is surrouded by he same code), the sub-extractor pattern only finds and fills the ~@HWY/REASON1@~ variable and stops, and then I can write a script to write out HWY/REASON1 to a CSV.

If I were to write sub-extractor patterns of long lengths (ex: long enough to fill several HWY/REASON session variables) then the sub-extractor pattern would have to be the exact length of the table to be a) identified within the main extractor pattern, and b) able to fill all the session variables. Yet this sub-extractor pattern (and the script used to print each HWY/REASON out to CSV) would be very long, and would only work if I knew the exact table length (which I won't since it changes as they update the website). It would also take a long time to declare 300 session variables.

Somehow I need to create a loop using only one session variable (~@HWY/REASON@~) where a loop works its way down the table: filling the variable with each HWY/REASON, printing it to a CSV row, clearing the ~@HWY/REASON@~ variable and moving to the next HWY/REASON variable to fill and print (until it reaches the bottom of the table).

Thanks for the help, sorry about the length of the post.

transportguy on 05/12/2009 at 11:55 am

Are you sure?

I'm pretty sure dataRecord.get("varname") works in basic edition...

The "best" solution to this problem is only available in the pro or enterprise editions, though (scrapeableFile.extractData(searchText, extractorPatternName)). It allows you to extract a block of text (which would be the whole list of REASON entries), and then call another extractor pattern on it, which would match more than just the one time.

But, since that won't work for Basic editions...

Let's see..

I think the only "easy" way to do this is to use a script for everything. In other words, use Java/Python/whatever to accomplish the effect of sub-extractor patterns:

// Interpreted Java
// Call this After each pattern application on your main extractor pattern

String reasonsText = dataRecord.get("DATARECORD");

ArrayList reasons = new ArrayList();

import java.util.regex.*;
Matcher reasonMatcher = Pattern.compile("regular expression for a single Reason").matcher(reasonsText);
while (reasonMatcher.find())
{
String reason = reasonMatcher.group(1);
reasons.add(reason);
}

// At this point, you have an ArrayList of your 'reasons'.
//You can turn this into a normal array with the following line:
String[] reasonsArray = reasons.toArray();

// Now you can do whatever you want. To just store the information for later,
// you can put it back into the in-scope dataRecord for later:
dataRecord.put("REASONS", reasonsArray);

So. All you have to do is get that line near the top that says regular expression for a single Reason figured out. Regular expressions in scripts are much like in tokens, except that you have to double your backslashes. For example, "\w" must become "\\w". Also, put a pair of parentheses around the actual "reason" text.

An example of the regular expression might be:
... Pattern.compile(" ([^<>]+)

") ...

(Note the parentheses.)

And then you'd be good to go!

Let me know if you would like any more explanation.

Tim

timv on 05/14/2009 at 11:39 am

Does this work for more than one variable pattern of code?

When I simplified my question, I took out the fact that there are many variables I want to pull out of each section of code. I want to pull out all of the six variables in the following code and then print each group of 6 to a row in a CSV. The pattern of code that repeats is:

<tr><td valign="top" width="191">
<b>Hwy ~@HWY@~</b><br>~@LOC@~
</td>
<td valign="top" width="171">
~@CONDITION@~
</td>
<td valign="top" width="171">
~@COMMENT@~
</td>
<td valign="top">
~@DATE@~
<br/>
~@TIME@~
</td>
</tr>

This section of code repeats for all the highways and their conditions/comments for when they are closed. Is there a way to use the Matcher class to pull out all 6 of these variables and put them in a row in the csv? How would the regex have more than one [^<>]?

transportguy on 05/21/2009 at 8:37 am

The magic has to do with the

The magic has to do with the parentheses surrounding stuff in the regex. Basically you'd put them around each part of your base text that you want to save, and then you can refer to each "captured group" via reasonMatcher.group( n ) (where "n" is the number corresponding to the group you want to retrieve).

You may find a need to nest this routine a few times... In other words, you might find that you have to keep the first page that I gave in my other post's example, but then start another isolated pattern:

// So now that you've got "reasonMatcher.group(1)", you can throw another Matcher on it:

String subText = reasonMatcher.group(1);
Matcher matcher_sub_sub = Pattern.compile("more (stuff) with capturing (groups)").matcher( subText );

// .. And then continue with more sub-sub-matching:
while (matcher_sub_sub.find())
{
// Do what you need... save it to an array, etc...
String someVar = matcher_sub_sub.group(1);
String anotherVar = matcher_sub_sub.group(2);
}

This all starts getting confusing, but if you can successfully store the data you're trying to extract, you can flush it out into a file in the order you're trying to achieve. You might even find it best to write to your CSV as you go through these Matcher object loops.

Does that point you in the right direction? The complicated part will probably be just wrapping your head around the order of writing it out.

Tim

timv on 05/22/2009 at 4:54 pm

Thanks

Thanks for all the help - the program works well and will save about 2 man hours per week.

transportguy on 05/25/2009 at 3:30 pm

Search

Community

screen-scraper

User login