Extracting Several Tables to a CSV File

Dear Screen-Scraper Community,

I've been wrestling with this for days now, it seems kind of basic but I can't seem to figure out a way to make it work.

I want to extract this data to a CSV file:

<table width="100%" border="0" cellpadding="4" cellspacing="0">
<tr>
<td>
<hr size="1" width="100%" noshade="noshade" />
</td>
</tr>

<tr valign="top">
<td><b>Steve Abraham</b><br />
Yellow-Checker Cab Co., Inc.<br />
P.O. Box 25123<br />
Albuquerque, NM 87125<br />
Reservations Phone Number: <b>505-247-8888</b><br />
Fax: <b>505-243-7499</b><br />
Email: <a href="mailto:[email protected]"><b>[email protected]</b></a><br />
<br />
</td>
</tr>

<tr valign="top">
<td>
<table border="0" cellspacing="0" cellpadding="0">
<tr valign="top">
<td>Fleet Information - </td>
<td>Limousines: <b>2</b><br />
</td>
</tr>
</table>
</td>
</tr>

<tr>
<td>
<hr size="1" width="100%" noshade="noshade" />
</td>
</tr>

<tr valign="top">
<td><b>John Acierno</b><br />
The Executive Transportation Group<br />
1440 39th St.<br />
Brooklyn, NY 11218<br />
Reservations Phone Number: <b>718-438-1100</b><br />
Fax: <b>718-438-2930</b><br />
Email: <a href="mailto:[email protected]"><b>[email protected]</b></a><br />
Website: <a href="http://www.executivecharge.com" target="_blank"><b>www.executivecharge.com</b></a><br />
<br />
</td>
</tr>

<tr valign="top">
<td>
<table border="0" cellspacing="0" cellpadding="0">
<tr valign="top">
<td>Fleet Information - </td>
<td>Limousines: <b>1500</b><br />
</td>
</tr>
</table>
</td>
</tr>

I need a CSV file with the correct headers (Name, Company, Address, etc.) and as you can see the second table has a "website". FYI, there are loads of other tables I need to extract.

So I have my first extractor pattern on which is applied a script and another extractor pattern like so:

<tr>
<td>
<hr size="1" width="100%" noshade="noshade" />
</td>
</tr>

<tr valign="top">
<td><b>~@Name@~</b><br />
~@Company@~<br />
~@Address1@~<br />
~@Address2@~<br />
Reservations Phone Number: <b>~@Phone@~</b><br />
Fax: <b>~@Fax@~</b><br />
Email: <a href="mailto:~@Email@~"><b>~@Email@~</b></a><br />
Website: <a href="~@Website@~" target="_blank"><b>~@Website@~</b></a><br />
<br />
<strong>Member Service Description:</strong> ~@Desc@~<br />
</td>
</tr>

<tr valign="top">
<td>
<table border="0" cellspacing="0" cellpadding="0">
<tr valign="top">
<td>Fleet Information - </td>
<td>Limousines: <b>~@Num@~</b><br />
</td>
</tr>
</table>
</td>
</tr>

DataSet companies = scrapeableFile.extractData(dataRecord.get("DATARECORD"), "Pattern");
[...]
dataSet.writeToFile( "C:/extracted_data.csv" );

And I really don't know where to go from there, the file is created properly, but nothing in it. I've tried a lot of things that I am ashamed of posting haha.

Can anyone enlighten me?

Oh and I also tried this http://community.screen-scraper.com/script_repository/Write_to_CSV but it doesn't seem to work either.

Tom_ez on 07/12/2011 at 6:55 am

screen-scraper public support

Would an example help? Things

Would an example help? Things to note:

The Yelp--start CSV script defines the output CSV
In the Yelp--init script, I have a few random profiles I go to check, and see if they have new reviews
In the script Yelp--check date, if they have new reviews I write them to a CSV, and if not I skip them
You'll find the results in the screen-scraper/output directory.

To set it up:

Copy the code below, and paste it into a text editor
Save the file as "yelp.sss"
Import it to screen-scraper
Run the Yelp scrape

Here's the scrape:

<?xml version="1.0" encoding="ISO-8859-1"?>
<scraping-session use-strict-mode="true"><script-instances><script-instances when-to-run="10" sequence="1" enabled="true"><script><script-text>// Create CsvWriter with timestamp
CsvWriter writer = new CsvWriter("output/yelp.csv", true);

// Create Headers Array
String[] header = {"Name", "Date", "Business"};

// Set Headers
writer.setHeader(header);

// Save in session variable for general access
session.setVariable( "WRITER", writer);</script-text><name>Yelp--start CSV</name><language>Interpreted Java</language></script></script-instances><script-instances when-to-run="10" sequence="2" enabled="true"><script><script-text>import java.util.*;
import java.text.*;

// Set number of days to go back
addDays = 50;

Calendar rightNow = Calendar.getInstance();
rightNow.add(Calendar.DATE, addDays*-1);
Date oldestDesired = rightNow.getTime();

// Output the new date.
session.log("+++Seeking reviews newer than " + oldestDesired);
session.setVariable("OLDEST_DESIRED", oldestDesired);

// Manually setting a list of users to check
String[] peopleToCheck = {
"http://www.yelp.com/user_details?userid=-h8OOTM2JQBvjnH8mf8i5w",
"http://www.yelp.com/user_details?userid=k3Oopx0QniRDHGlLA4W2XQ",
"http://www.yelp.com/user_details?userid=tind8sTPbu_i2jLit5Ro4A",
"http://surlyjason.yelp.com/"
};

// Request each person
for (i=0; i<peopleToCheck.length; i++)
{
session.log("Checking person #" + i);
url = peopleToCheck[i];
session.setv("URL", url);
session.scrapeFile("Reviews");
}</script-text><name>Yelp--init</name><language>Interpreted Java</language></script></script-instances><script-instances when-to-run="20" sequence="3" enabled="true"><script><script-text>//scraping session close script
CsvWriter writer = session.getVariable("WRITER");
writer.close();</script-text><name>CSV close</name><language>Interpreted Java</language></script></script-instances><owner-type>ScrapingSession</owner-type><owner-name>Yelp</owner-name></script-instances><name>Yelp</name><notes></notes><cookiePolicy>0</cookiePolicy><maxHTTPRequests>1</maxHTTPRequests><external_proxy_username></external_proxy_username><external_proxy_password></external_proxy_password><external_proxy_host></external_proxy_host><external_proxy_port></external_proxy_port><external_nt_proxy_username></external_nt_proxy_username><external_nt_proxy_password></external_nt_proxy_password><external_nt_proxy_domain></external_nt_proxy_domain><external_nt_proxy_host></external_nt_proxy_host><anonymize>false</anonymize><terminate_proxies_on_completion>false</terminate_proxies_on_completion><number_of_required_proxies>5</number_of_required_proxies><originator_edition>2</originator_edition><logging_level>1</logging_level><date_exported>July 12, 2011 09:32:21</date_exported><character_set>ISO-8859-1</character_set><scrapeable-files sequence="1" will-be-invoked-manually="false" tidy-html="jtidy"><last-scraped-data></last-scraped-data><URL>~#URL#~</URL><last-request></last-request><name>Next page</name><script-instances><owner-type>ScrapeableFile</owner-type><owner-name>Next page</owner-name></script-instances></scrapeable-files><scrapeable-files sequence="-1" will-be-invoked-manually="true" tidy-html="dont"><last-scraped-data></last-scraped-data><URL>~#URL#~</URL><BASICAuthenticationUsername></BASICAuthenticationUsername><last-request></last-request><name>Reviews</name><extractor-patterns sequence="3" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text><a href="~@URL@~"~@junk@~><span>More &raquo;</pattern-text><identifier>Next page</identifier><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="true" replace-html-entities="true" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="1"><regular-expression>/user_details_reviews_self[^"]*</regular-expression><identifier>URL</identifier></extractor-pattern-tokens><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="2"><regular-expression>[^<>]*</regular-expression><identifier>junk</identifier></extractor-pattern-tokens><script-instances><script-instances when-to-run="80" sequence="1" enabled="true"><script><script-text>if (session.getv("ITERATE_PAGES"))
{
session.log("Want more results");
session.setv("URL", dataRecord.get("URL"));
}
else
{
session.log("Done with this guy");
}
</script-text><name>Yelp--iterate pages</name><language>Interpreted Java</language></script></script-instances><owner-type>ExtractorPattern</owner-type><owner-name>Next page</owner-name></script-instances></extractor-patterns><extractor-patterns sequence="2" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text><div class="review clearfix">
~@DATARECORD@~
>Link to this Review<</pattern-text><identifier>Review</identifier><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="1"><identifier>DATARECORD</identifier></extractor-pattern-tokens><extractor-patterns sequence="2" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text>class="smaller">~@REVIEW_DATE@~<</pattern-text><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="1"><regular-expression>\d{1,2}[-/. ]+\d{1,2}[-/. ]+\d{2,4}</regular-expression><identifier>REVIEW_DATE</identifier></extractor-pattern-tokens><script-instances/></extractor-patterns><extractor-patterns sequence="1" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text><h4>
~@ws@~<a href="~@LINK@~">~@BUSINESS@~<</pattern-text><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="2"><regular-expression>[^"]*</regular-expression><identifier>LINK</identifier></extractor-pattern-tokens><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="1"><regular-expression>[\n\t\s]*</regular-expression><identifier>ws</identifier></extractor-pattern-tokens><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="true" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="3"><regular-expression></regular-expression><identifier>BUSINESS</identifier></extractor-pattern-tokens><script-instances/></extractor-patterns><script-instances><script-instances when-to-run="60" sequence="1" enabled="true"><script><script-text>import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.text.ParseException;
import java.util.Date;

// Set oldest desired date
oldestDesired = session.getv("OLDEST_DESIRED");

// Parse the newest review date
newestDate = dataSet.get(0, "REVIEW_DATE");
DateFormat df = new SimpleDateFormat("M/d/yyyy");
reviewDate = df.parse(newestDate);

// Formatting line
line = "=";
while (line.length()<90)
line += "=";

// Compare the dates
if (reviewDate.after(oldestDesired) || reviewDate.equals(oldestDesired))
{
// Within threshold
session.log(line);
session.log("Want this guy's reviews");
numReviews = dataSet.getNumDataRecords();
session.log("Found " + numReviews + " reviews");
for (i=0; i<numReviews; i++)
{
oneReview = dataSet.getDataRecord(i);

// Prep the values
date = oneReview.get("REVIEW_DATE");
date = sutil.reformatDate(date, "M/d/yyyy", "yyyy-MM-dd");
business = oneReview.get("BUSINESS");
session.log(date + ": " + business);

// Concatenate the items to write
HashMap hm = new HashMap();
hm.put("NAME", session.getv("NAME"));
hm.put("DATE", date);
hm.put("BUSINESS", business);

// Get existing writer
writer = session.getv("WRITER");

// Write dataRecord to the file (headers already set)
writer.write(hm);

// Flush record to file (write it now)
writer.flush();
}
session.log(line);
session.setv("ITERATE_PAGES", true);
}
else
{
// Too old
session.log(line);
session.log("This guy is inactive");
session.log(line);
session.setv("ITERATE_PAGES", false);
}</script-text><name>Yelp--check date</name><language>Interpreted Java</language></script></script-instances><owner-type>ExtractorPattern</owner-type><owner-name>Review</owner-name></script-instances></extractor-patterns><extractor-patterns sequence="1" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text>>~@ws@~~@NAME@~'s Profile</pattern-text><identifier>Name</identifier><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="1"><regular-expression>[\n\t\s]*</regular-expression><identifier>ws</identifier></extractor-pattern-tokens><extractor-pattern-tokens optional="false" save-in-session-variable="true" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="true" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="2"><regular-expression>[^<>]*</regular-expression><identifier>NAME</identifier></extractor-pattern-tokens><script-instances><owner-type>ExtractorPattern</owner-type><owner-name>Name</owner-name></script-instances></extractor-patterns><script-instances><owner-type>ScrapeableFile</owner-type><owner-name>Reviews</owner-name></script-instances></scrapeable-files></scraping-session>

jason on 07/12/2011 at 9:34 am

Attempt to invoke method: getNumDataRecords() on undefined varib

Hey,

So I tried everything to adapt it to my problem, but I'm getting a "Attempt to invoke method: getNumDataRecords() on undefined variable or class name" error.

Do you mind taking a quick look? It seems like I'm really close, but I can't figure out the reason of this problem.

<?xml version="1.0" encoding="UTF-8"?>
<scraping-session use-strict-mode="true"><script-instances><script-instances when-to-run="10" sequence="1" enabled="true"><script><script-text>// Create CsvWriter with timestamp
CsvWriter writer = new CsvWriter("C:/TLPA_Extract.csv", true);

// Create Headers Array
String[] header = {"Name", "Company", "Address1", "Address2", "Phone", "Free Phone", "Fax", "Email", "Website", "Desc"};

// Set Headers
writer.setHeader(header);

// Save in session variable for general access
session.setVariable( "WRITER", writer);</script-text><name>TLPA CSV Start</name><language>Interpreted Java</language></script></script-instances><script-instances when-to-run="20" sequence="2" enabled="true"><script><script-text>import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.text.ParseException;
import java.util.Date;

// Set oldest desired date
//oldestDesired = session.getv("OLDEST_DESIRED");

/* Parse the newest review date
newestDate = dataSet.get(0, "REVIEW_DATE");
DateFormat df = new SimpleDateFormat("M/d/yyyy");
reviewDate = df.parse(newestDate);*/

// Formatting line
line = "=";
while (line.length()<90)
line += "=";

/* Compare the dates
if (reviewDate.after(oldestDesired) || reviewDate.equals(oldestDesired))
{*/
// Within threshold
session.log(line);
session.log("Want this guy's reviews");
numReviews = dataSet.getNumDataRecords();
session.log("Found " + numReviews + " reviews");
for (i=0; i<numReviews; i++)
{
oneItem = dataSet.getDataRecord(i);
// Prep the values
Name = oneItem.get("Name");
Company = oneItem.get("Company");
Address1 = oneItem.get("Address1");
Address2 = oneItem.get("Address2");
Phone = oneItem.get("Phone");
freePhone = oneItem.get("freePhone");
Fax = oneItem.get("Fax");
Email = oneItem.get("Email");
Website = oneItem.get("Website");
Desc = oneItem.get("Desc");

// Concatenate the items to write
HashMap hm = new HashMap();
hm.put("Name", session.getv("Name"));
hm.put("Company", Company);
hm.put("Address1", Address1);
hm.put("Address2", Address2);
hm.put("Phone", Phone);
hm.put("freePhone", freePhone);
hm.put("Fax", Fax);
hm.put("Email", Email);
hm.put("Website", Website);
hm.put("Desc", Desc);

// Get existing writer
writer = session.getv("WRITER");

// Write dataRecord to the file (headers already set)
writer.write(hm);

// Flush record to file (write it now)
writer.flush();
}
session.log(line);
//session.setv("ITERATE_PAGES", true);

</script-text><name>Check</name><language>Interpreted Java</language></script></script-instances><script-instances when-to-run="20" sequence="3" enabled="true"><script><script-text>//scraping session close script
CsvWriter writer = session.getVariable("WRITER");
writer.close();</script-text><name>CSV close</name><language>Interpreted Java</language></script></script-instances><owner-type>ScrapingSession</owner-type><owner-name>NewTLPA</owner-name></script-instances><name>NewTLPA</name><notes></notes><cookiePolicy>0</cookiePolicy><maxHTTPRequests>1</maxHTTPRequests><external_proxy_username></external_proxy_username><external_proxy_password></external_proxy_password><external_proxy_host></external_proxy_host><external_proxy_port></external_proxy_port><external_nt_proxy_username></external_nt_proxy_username><external_nt_proxy_password></external_nt_proxy_password><external_nt_proxy_domain></external_nt_proxy_domain><external_nt_proxy_host></external_nt_proxy_host><anonymize>false</anonymize><terminate_proxies_on_completion>false</terminate_proxies_on_completion><number_of_required_proxies>5</number_of_required_proxies><originator_edition>1</originator_edition><logging_level>1</logging_level><date_exported>juillet 13, 2011 19:47:24</date_exported><character_set>UTF-8</character_set><scrapeable-files sequence="1" will-be-invoked-manually="false" tidy-html="jtidy"><last-scraped-data></last-scraped-data><URL>http://www.tlpa.org/members/directory.cfm</URL><last-request></last-request><name>Copy of File from New Proxy Session</name><HTTPParameters sequence="2"><key>login_pass</key><type>POST</type><value>Berthome</value></HTTPParameters><HTTPParameters sequence="1"><key>login_user</key><type>POST</type><value>3114</value></HTTPParameters><HTTPParameters sequence="3"><key>submit</key><type>POST</type><value>Login >></value></HTTPParameters><script-instances><owner-type>ScrapeableFile</owner-type><owner-name>Copy of File from New Proxy Session</owner-name></script-instances></scrapeable-files><scrapeable-files sequence="2" will-be-invoked-manually="false" tidy-html="jtidy"><last-scraped-data></last-scraped-data><URL>http://www.tlpa.org/members/directoryUSA.cfm</URL><last-request></last-request><name>Copy of File from New Proxy Session1</name><extractor-patterns sequence="2" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text></pattern-text><identifier>Untitled Extractor Pattern</identifier><extractor-patterns sequence="1" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text>
</pattern-text><script-instances/></extractor-patterns><script-instances><script-instances when-to-run="60" sequence="1" enabled="false"><script><script-text>import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.text.ParseException;
import java.util.Date;

// Set oldest desired date
//oldestDesired = session.getv("OLDEST_DESIRED");

/* Parse the newest review date
newestDate = dataSet.get(0, "REVIEW_DATE");
DateFormat df = new SimpleDateFormat("M/d/yyyy");
reviewDate = df.parse(newestDate);*/

// Formatting line
line = "=";
while (line.length()<90)
line += "=";

/* Compare the dates
if (reviewDate.after(oldestDesired) || reviewDate.equals(oldestDesired))
{*/
// Within threshold
session.log(line);
session.log("Want this guy's reviews");
numReviews = dataSet.getNumDataRecords();
session.log("Found " + numReviews + " reviews");
for (i=0; i<numReviews; i++)
{
oneItem = dataSet.getDataRecord(i);
// Prep the values
Name = oneItem.get("Name");
Company = oneItem.get("Company");
Address1 = oneItem.get("Address1");
Address2 = oneItem.get("Address2");
Phone = oneItem.get("Phone");
freePhone = oneItem.get("freePhone");
Fax = oneItem.get("Fax");
Email = oneItem.get("Email");
Website = oneItem.get("Website");
Desc = oneItem.get("Desc");

// Concatenate the items to write
HashMap hm = new HashMap();
hm.put("Name", session.getv("Name"));
hm.put("Company", Company);
hm.put("Address1", Address1);
hm.put("Address2", Address2);
hm.put("Phone", Phone);
hm.put("freePhone", freePhone);
hm.put("Fax", Fax);
hm.put("Email", Email);
hm.put("Website", Website);
hm.put("Desc", Desc);

// Get existing writer
writer = session.getv("WRITER");

// Write dataRecord to the file (headers already set)
writer.write(hm);

// Flush record to file (write it now)
writer.flush();
}
session.log(line);
//session.setv("ITERATE_PAGES", true);

</script-text><name>Check</name><language>Interpreted Java</language></script></script-instances><owner-type>ExtractorPattern</owner-type><owner-name>Untitled Extractor Pattern</owner-name></script-instances></extractor-patterns><extractor-patterns sequence="1" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text><hr size="1" width="100%" noshade="noshade" />
</td>
</tr>

<tr valign="top">
~@DATARECORD@~
</td>
</tr>
</table>
</td>
</tr>

<tr>
<td>
</pattern-text><identifier>DataPattern</identifier><extractor-pattern-tokens optional="false" save-in-session-variable="true" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="1"><regular-expression></regular-expression><identifier>DATARECORD</identifier></extractor-pattern-tokens><extractor-patterns sequence="1" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text><td><b>~@myName@~</b><br />
~@Company@~<br />
~@Address1@~<br />
~@Address2@~<br />
Reservations Phone Number: <b>~@resPhone@~</b><br />
Fax: <b>~@Fax@~</b><br />
Email: <a href="mailto:~@Email@~"><b>~@Email@~</b></a><br />
Website: <a href="http://~@Website@~" target="_blank"><b>~@Website@~</b></a><br />
<br />
<strong>Member Service Description:</strong> ~@Desc@~<br />
</pattern-text><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="6"><identifier>Fax</identifier></extractor-pattern-tokens><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="3"><identifier>Address1</identifier></extractor-pattern-tokens><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="2"><identifier>Company</identifier></extractor-pattern-tokens><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="11"><identifier>Desc</identifier></extractor-pattern-tokens><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="1"><identifier>myName</identifier></extractor-pattern-tokens><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="10"><regular-expression>[^"]*</regular-expression><identifier>Website</identifier></extractor-pattern-tokens><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="4"><identifier>Address2</identifier></extractor-pattern-tokens><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="5"><identifier>resPhone</identifier></extractor-pattern-tokens><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="8"><regular-expression>[^"]*</regular-expression><identifier>Email</identifier></extractor-pattern-tokens><script-instances/></extractor-patterns><script-instances><script-instances when-to-run="80" sequence="1" enabled="true"><script><script-text>import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.text.ParseException;
import java.util.Date;

// Set oldest desired date
//oldestDesired = session.getv("OLDEST_DESIRED");

/* Parse the newest review date
newestDate = dataSet.get(0, "REVIEW_DATE");
DateFormat df = new SimpleDateFormat("M/d/yyyy");
reviewDate = df.parse(newestDate);*/

// Formatting line
line = "=";
while (line.length()<90)
line += "=";

/* Compare the dates
if (reviewDate.after(oldestDesired) || reviewDate.equals(oldestDesired))
{*/
// Within threshold
session.log(line);
session.log("Want this guy's reviews");
numReviews = dataSet.getNumDataRecords();
session.log("Found " + numReviews + " reviews");
for (i=0; i<numReviews; i++)
{
oneItem = dataSet.getDataRecord(i);
// Prep the values
Name = oneItem.get("Name");
Company = oneItem.get("Company");
Address1 = oneItem.get("Address1");
Address2 = oneItem.get("Address2");
Phone = oneItem.get("Phone");
freePhone = oneItem.get("freePhone");
Fax = oneItem.get("Fax");
Email = oneItem.get("Email");
Website = oneItem.get("Website");
Desc = oneItem.get("Desc");

// Concatenate the items to write
HashMap hm = new HashMap();
hm.put("Name", session.getv("Name"));
hm.put("Company", Company);
hm.put("Address1", Address1);
hm.put("Address2", Address2);
hm.put("Phone", Phone);
hm.put("freePhone", freePhone);
hm.put("Fax", Fax);
hm.put("Email", Email);
hm.put("Website", Website);
hm.put("Desc", Desc);

// Get existing writer
writer = session.getv("WRITER");

// Write dataRecord to the file (headers already set)
writer.write(hm);

// Flush record to file (write it now)
writer.flush();
}
session.log(line);
//session.setv("ITERATE_PAGES", true);

</script-text><name>Check</name><language>Interpreted Java</language></script></script-instances><owner-type>ExtractorPattern</owner-type><owner-name>DataPattern</owner-name></script-instances></extractor-patterns><HTTPParameters sequence="4"><key>City</key><type>POST</type><value></value></HTTPParameters><HTTPParameters sequence="5"><key>State</key><type>POST</type><value></value></HTTPParameters><HTTPParameters sequence="7"><key>SortBy</key><type>POST</type><value>LastName</value></HTTPParameters><HTTPParameters sequence="2"><key>FirstName</key><type>POST</type><value></value></HTTPParameters><HTTPParameters sequence="1"><key>LastName</key><type>POST</type><value></value></HTTPParameters><HTTPParameters sequence="6"><key>limosearch</key><type>POST</type><value>YES</value></HTTPParameters><HTTPParameters sequence="3"><key>Company</key><type>POST</type><value></value></HTTPParameters><script-instances><owner-type>ScrapeableFile</owner-type><owner-name>Copy of File from New Proxy Session1</owner-name></script-instances></scrapeable-files></scraping-session>

Tom_ez on 07/13/2011 at 11:51 am

I don't understand all of that...

Hey Jason,

Thanks for the example, unfortunately it's a little complicated for my current understanding of the software. I managed to use your script and all, it works great, but I tried adapting it to my website and failed miserably. I'll try again tomorrow.

Regards,

Tom

Tom_ez on 07/12/2011 at 1:54 pm

Search

Community

screen-scraper

User login

Extracting Several Tables to a CSV File

Would an example help? Things

Attempt to invoke method: getNumDataRecords() on undefined varib

I don't understand all of that...