IndexOutOfBoundsException Error
Hi there,
I have a script that writes the recordset to a database doing a for( i = 0; i < dataSet.getNumDataRecords(); i++ ) loop.
Everything was working fine, until I introduced this changes in the script:
1. I store at the start of the database write script(before the for loop) the first datarecord:
CurrentDataRecord = dataSet.getDataRecord( 0 );
2. I store in a session variable one of the values of that first datarecord(a date)
3. I call a javascript script where using that session variable I check for the date difference with today´s date
4. I store the date difference in a session variable.
5. back in the database write script I put the whole loop mentioned before, inside an if clause, which only gets executed if the date difference is less than x days.
Now when I execute the scrape, I am getting every now and then this error:
The error message was: IndexOutOfBoundsException (line 9): Index: 0, Size: 0-- Method Invocation dataSet.getDataRecord
Line 9 is the line mentioned at step 1.
Any clue what might be going wrong?
cheers,
boga
You were right. On some
You were right. On some ocasions the date is null ;-)
thanks!
I would need to see the whole
I would need to see the whole scrape to be sure, but I think that something is out of scope. Is there any reason you can't attach the scrape here so we could poke at it?
This is the Write to database
This is the Write to database script, which gets run after the extractor pattern that matches values in a table with several votes at each member´s profile is applied:
import java.sql.*;
CurrentDataRecord = dataSet.getDataRecord( 0 );
session.setVariable("LATEST_VOTE_DATE", CurrentDataRecord.get("DATE"));
session.executeScript("JS_CalculateDateDifference");
// Write this persons´s votes to database only if his last vote is not older than 30 days
if (session.getVariable("DAYS_SINCE_LAST_VOTE") <= 30){
//Set up a connection and a drivermanager.
Class.forName("com.mysql.jdbc.Driver").newInstance();
Connection conn;
conn = DriverManager.getConnection("jdbc:mysql://" + session.getVariable("MYSQL_SERVER_URL") + ":"+session.getVariable("MYSQL_SERVER_PORT") + "/" + session.getVariable("MYSQL_DATABASE"), session.getVariable("MYSQL_SERVER_USER"), session.getVariable("MYSQL_SERVER_PASSWORD"));
for( i = 0; i < dataSet.getNumDataRecords(); i++ )
{
// Store the current data record in the variable CurrentDataRecord.
CurrentDataRecord = dataSet.getDataRecord( i );
date = CurrentDataRecord.get("DATE");
if (CurrentDataRecord.get("TOPVOTE").indexOf("strongly") > 0){
topVote = 1;
}else{
topVote = 0;
}
member = session.getVariable("MEMBER_NAME");
switch(CurrentDataRecord.get("RATING")){
case "one" : rating = 1; break;
case "two" : rating = 2; break;
case "three" : rating = 3; break;
case "four" : rating = 4; break;
case "five" : rating = 5; break;
}
movie = CurrentDataRecord.get("MOVIE");
switch(CurrentDataRecord.get("VOTE")){
case "up" : vote = 1; break;
case "down" : vote = 0; break;
}
Statement stmt = null;
stmt = conn.createStatement();
tabletowrite = session.getVariable("TABLE_TO_WRITE");
mysqlstring="INSERT IGNORE INTO " + tabletowrite + " (date, movie, member, rating, vote, topvote) VALUES(str_to_date('" + date + "','%m/%d/%y'),'" + movie + "','" + member + "','" + rating + "','" + vote + "','" + topVote + "')";
stmt.executeUpdate(mysqlstring);
stmt.close();
}
conn.close();
}
And the javascript script that gets called to calculate the date difference:
var ONE_DAY = 86400000; // 1000 * 60 * 60 * 24
var todaysDate = new Date();
var latestVoteDateString = session.getVariable("LATEST_VOTE_DATE");
var idx = latestVoteDateString.lastIndexOf('/') + 1;
latestVoteDateString = latestVoteDateString.substr(0,idx) + '20' + latestVoteDateString.substr(idx);
var latestVoteDate = new Date(latestVoteDateString);
// Calculate difference, convert it from milliseconds to days and set session variable
session.setVariable("DAYS_SINCE_LAST_VOTE", Math.round(Math.abs((todaysDate.getTime() - latestVoteDate.getTime())/(ONE_DAY))));
Thanks for your help :-)
boga
boga, I believe Jason's first
boga,
I believe Jason's first instinct was correct. I believe the cause of your IndexOutOfBoundsException is due to the fact that the dataSet object is, in fact, out of scope. Take a look at the Variable Scope table and you'll see that you need to be calling your first script under one of three possible conditions.
You may consider using Jason's second suggestion as a template unless you're more comfortable using Javascript for it's built-in functions.
-Scott
I still can't tell what
I still can't tell what you're doing with the dates, so here is how I would do it.
1 - At the beginning, figure out the oldest date I want
2 - After I scrape each date, parse it, and compare to the oldest desired.
I have a sample here. Copy this text into an editor, and save as "Yelp.sss" and import it to your screen-scraper, and you'll be able to see what I did.
<scraping-session use-strict-mode="true"><script-instances><script-instances when-to-run="10" sequence="1" enabled="true"><script><script-text>import java.util.*;
import java.text.*;
// Set number of days to go back
addDays = 30;
Calendar rightNow = Calendar.getInstance();
rightNow.add(Calendar.DATE, addDays*-1);
Date oldestDesired = rightNow.getTime();
// Output the new date.
session.log("+++Seeking reviews newer than " + oldestDesired);
session.setVariable("OLDEST_DESIRED", oldestDesired);</script-text><name>Yelp--init</name><language>Interpreted Java</language></script></script-instances><owner-type>ScrapingSession</owner-type><owner-name>Yelp</owner-name></script-instances><name>Yelp</name><notes></notes><cookiePolicy>0</cookiePolicy><maxHTTPRequests>1</maxHTTPRequests><external_proxy_username></external_proxy_username><external_proxy_password></external_proxy_password><external_proxy_host></external_proxy_host><external_proxy_port></external_proxy_port><external_nt_proxy_username></external_nt_proxy_username><external_nt_proxy_password></external_nt_proxy_password><external_nt_proxy_domain></external_nt_proxy_domain><external_nt_proxy_host></external_nt_proxy_host><anonymize>false</anonymize><terminate_proxies_on_completion>false</terminate_proxies_on_completion><number_of_required_proxies>5</number_of_required_proxies><originator_edition>2</originator_edition><logging_level>1</logging_level><date_exported>June 01, 2011 09:39:03</date_exported><character_set>ISO-8859-1</character_set><scrapeable-files sequence="1" will-be-invoked-manually="false" tidy-html="dont"><last-scraped-data></last-scraped-data><URL>http://surlyjason.yelp.com</URL><BASICAuthenticationUsername></BASICAuthenticationUsername><last-request></last-request><name>Reviews</name><extractor-patterns sequence="1" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text><div class="review clearfix">
~@DATARECORD@~
>Link to this Review<</pattern-text><identifier>Review</identifier><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="1"><identifier>DATARECORD</identifier></extractor-pattern-tokens><extractor-patterns sequence="2" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text>class="smaller">~@REVIEW_DATE@~<</pattern-text><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="1"><regular-expression>\d{1,2}[-/. ]+\d{1,2}[-/. ]+\d{2,4}</regular-expression><identifier>REVIEW_DATE</identifier></extractor-pattern-tokens><script-instances/></extractor-patterns><extractor-patterns sequence="1" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text><h4>
~@ws@~<a href="~@LINK@~">~@BUSINESS@~<</pattern-text><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="2"><regular-expression>[^"]*</regular-expression><identifier>LINK</identifier></extractor-pattern-tokens><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="1"><regular-expression>[\n\t\s]*</regular-expression><identifier>ws</identifier></extractor-pattern-tokens><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="true" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="3"><regular-expression></regular-expression><identifier>BUSINESS</identifier></extractor-pattern-tokens><script-instances/></extractor-patterns><script-instances><script-instances when-to-run="80" sequence="1" enabled="true"><script><script-text>import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.text.ParseException;
import java.util.Date;
// Set oldest desired date
oldestDesired = session.getv("OLDEST_DESIRED");
// Parse the review date
DateFormat df = new SimpleDateFormat("M/d/yyyy");
reviewDate = df.parse(dataRecord.get("REVIEW_DATE"));
// Formatting line
line = "=";
while (line.length()<90)
line += "=";
// Compare the dates
if (reviewDate.after(oldestDesired) || reviewDate.equals(oldestDesired))
{
// Within 30 days
session.log(line);
session.log("Writing review");
session.log(line);
}
else
{
// Too old
session.log(line);
session.log("Don't want this review");
session.log(line);
}</script-text><name>Yelp--check date</name><language>Interpreted Java</language></script></script-instances><owner-type>ExtractorPattern</owner-type><owner-name>Review</owner-name></script-instances></extractor-patterns><script-instances><owner-type>ScrapeableFile</owner-type><owner-name>Reviews</owner-name></script-instances></scrapeable-files></scraping-session>
@ Jason, many thanks. To
@ Jason, many thanks. To explain, the write to database script is called after scraping each member´s profile page. In each member´s profile page there is a table with each movie that the member has voted for. They are already sorted by date, so the most recent vote appears first in the table. I have an extractor pattern that matches each line of this table and "after pattern is applied" I write them to the database with the mentioned script.
I added the date checking because there are a big amount of inactive members whose last vote was done years ago. I want to skip those inactive member´s and their votes from being written to database.
I will import and look into Jason´s code later to see if I understand it ;-)
@ Jason & Scott: But about the dataset being out o scope there is something that doesn´t make sense to me. The write to database script was not giving any error before I added the logic to check the first datarecord´s date(the member´s most recent vote) and only keep looping through the rest of the votes if the date difference is of the first date is less than 30 days.
In fact, I have commented out the first mention to the dataset, the call to the javascript and the IF logic like this:
import java.sql.*;
/*
CurrentDataRecord = dataSet.getDataRecord( 0 );
session.setVariable("LATEST_VOTE_DATE", CurrentDataRecord.get("DATE"));
session.executeScript("JS_CalculateDateDifference");
// Write this persons´s votes to database only if his last vote is not older than 30 days
if (session.getVariable("DAYS_SINCE_LAST_VOTE") <= 30){
*/
//Set up a connection and a drivermanager.
Class.forName("com.mysql.jdbc.Driver").newInstance();
Connection conn;
conn = DriverManager.getConnection("jdbc:mysql://" + session.getVariable("MYSQL_SERVER_URL") + ":"+session.getVariable("MYSQL_SERVER_PORT") + "/" + session.getVariable("MYSQL_DATABASE"), session.getVariable("MYSQL_SERVER_USER"), session.getVariable("MYSQL_SERVER_PASSWORD"));
for( i = 0; i < dataSet.getNumDataRecords(); i++ )
{
// Store the current data record in the variable CurrentDataRecord.
CurrentDataRecord = dataSet.getDataRecord( i );
date = CurrentDataRecord.get("DATE");
if (CurrentDataRecord.get("TOPVOTE").indexOf("strongly") > 0){
topVote = 1;
}else{
topVote = 0;
}
member = session.getVariable("MEMBER_NAME");
switch(CurrentDataRecord.get("RATING")){
case "one" : rating = 1; break;
case "two" : rating = 2; break;
case "three" : rating = 3; break;
case "four" : rating = 4; break;
case "five" : rating = 5; break;
}
movie = CurrentDataRecord.get("MOVIE");
switch(CurrentDataRecord.get("VOTE")){
case "up" : vote = 1; break;
case "down" : vote = 0; break;
}
Statement stmt = null;
stmt = conn.createStatement();
tabletowrite = session.getVariable("TABLE_TO_WRITE");
mysqlstring="INSERT IGNORE INTO " + tabletowrite + " (date, movie, member, rating, vote, topvote) VALUES(str_to_date('" + date + "','%m/%d/%y'),'" + movie + "','" + member + "','" + rating + "','" + vote + "','" + topVote + "')";
stmt.executeUpdate(mysqlstring);
stmt.close();
}
conn.close();
// }
And having these offending lines commented out the script is not throwing any error and completes successfully. And as you can see I make extensive use of the dataset object inside the for loop. This is why it doesn´t make sense to me that it is an out of scope issue of the dataset object.
Do I make sense?
I am still waiting for a
I am still waiting for a reply to my assesment that the error cannot come from an out of scope dataset object as I explained in my previous message.
Any other guess?
thanks,
Boga
I still can't tell looking at
I still can't tell looking at this. Did you look at my scrape? I think it will straighten things out.
I cannot follow your scrape
I cannot follow your scrape cause I am not familiar with the language, however from what I can understand your scrape is for rejecting each and every review that is older than 30 days and writing to database each and any review that is newer that 30 days.
What I am trying to do is reject all of a member´s reviews, if his latest review(which will be the first datarecord in the dataset) is older than 30 days. That is why I check for the date that is stored in the dataSet.getDataRecord( 0 ) before going any further. And if it is not older than 30 days I loop through each of that user´s reviews and write them to database.
All the date comparison logic
All the date comparison logic stays the same. I tweaked the scrape to do more as you indicate though.
<scraping-session use-strict-mode="true"><script-instances><script-instances when-to-run="10" sequence="1" enabled="true"><script><script-text>import java.util.*;
import java.text.*;
// Set number of days to go back
addDays = 30;
Calendar rightNow = Calendar.getInstance();
rightNow.add(Calendar.DATE, addDays*-1);
Date oldestDesired = rightNow.getTime();
// Output the new date.
session.log("+++Seeking reviews newer than " + oldestDesired);
session.setVariable("OLDEST_DESIRED", oldestDesired);
// Manually setting a list of users to check
String[] peopleToCheck = {
"http://www.yelp.com/user_details?userid=-h8OOTM2JQBvjnH8mf8i5w",
"http://surlyjason.yelp.com/",
"http://www.yelp.com/user_details?userid=k3Oopx0QniRDHGlLA4W2XQ",
"http://www.yelp.com/user_details?userid=tind8sTPbu_i2jLit5Ro4A"
};
// Request each person
for (i=0; i<peopleToCheck.length; i++)
{
session.log("Checking person #" + i);
url = peopleToCheck[i];
session.setv("URL", url);
session.scrapeFile("Reviews");
}</script-text><name>Yelp--init</name><language>Interpreted Java</language></script></script-instances><owner-type>ScrapingSession</owner-type><owner-name>Yelp</owner-name></script-instances><name>Yelp</name><notes></notes><cookiePolicy>0</cookiePolicy><maxHTTPRequests>1</maxHTTPRequests><external_proxy_username></external_proxy_username><external_proxy_password></external_proxy_password><external_proxy_host></external_proxy_host><external_proxy_port></external_proxy_port><external_nt_proxy_username></external_nt_proxy_username><external_nt_proxy_password></external_nt_proxy_password><external_nt_proxy_domain></external_nt_proxy_domain><external_nt_proxy_host></external_nt_proxy_host><anonymize>false</anonymize><terminate_proxies_on_completion>false</terminate_proxies_on_completion><number_of_required_proxies>5</number_of_required_proxies><originator_edition>2</originator_edition><logging_level>1</logging_level><date_exported>June 08, 2011 10:03:24</date_exported><character_set>ISO-8859-1</character_set><scrapeable-files sequence="-1" will-be-invoked-manually="true" tidy-html="dont"><last-scraped-data></last-scraped-data><URL>~#URL#~</URL><BASICAuthenticationUsername></BASICAuthenticationUsername><last-request></last-request><name>Reviews</name><extractor-patterns sequence="1" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text><div class="review clearfix">
~@DATARECORD@~
>Link to this Review<</pattern-text><identifier>Review</identifier><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="1"><identifier>DATARECORD</identifier></extractor-pattern-tokens><extractor-patterns sequence="1" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text><h4>
~@ws@~<a href="~@LINK@~">~@BUSINESS@~<</pattern-text><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="true" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="3"><regular-expression></regular-expression><identifier>BUSINESS</identifier></extractor-pattern-tokens><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="1"><regular-expression>[\n\t\s]*</regular-expression><identifier>ws</identifier></extractor-pattern-tokens><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="2"><regular-expression>[^"]*</regular-expression><identifier>LINK</identifier></extractor-pattern-tokens><script-instances/></extractor-patterns><extractor-patterns sequence="2" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text>class="smaller">~@REVIEW_DATE@~<</pattern-text><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="1"><regular-expression>\d{1,2}[-/. ]+\d{1,2}[-/. ]+\d{2,4}</regular-expression><identifier>REVIEW_DATE</identifier></extractor-pattern-tokens><script-instances/></extractor-patterns><script-instances><script-instances when-to-run="60" sequence="1" enabled="true"><script><script-text>import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.text.ParseException;
import java.util.Date;
// Set oldest desired date
oldestDesired = session.getv("OLDEST_DESIRED");
// Parse the newest review date
newestDate = dataSet.get(0, "REVIEW_DATE");
DateFormat df = new SimpleDateFormat("M/d/yyyy");
reviewDate = df.parse(newestDate);
// Formatting line
line = "=";
while (line.length()<90)
line += "=";
// Compare the dates
if (reviewDate.after(oldestDesired) || reviewDate.equals(oldestDesired))
{
// Within 30 days
session.log(line);
session.log("Want this guy's reviews");
session.log(line);
}
else
{
// Too old
session.log(line);
session.log("This guy is inactive");
session.log(line);
}</script-text><name>Yelp--check date</name><language>Interpreted Java</language></script></script-instances><owner-type>ExtractorPattern</owner-type><owner-name>Review</owner-name></script-instances></extractor-patterns><script-instances><owner-type>ScrapeableFile</owner-type><owner-name>Reviews</owner-name></script-instances></scrapeable-files></scraping-session>
Thank you Jason. I think I
Thank you Jason. I think I managed to understand your suggested approach and I inserted your code in mine.
I seemed to get it to work except for the fact that the review dates I grab from the website I scrape(mine, not yours) have the year with 2 digits :-(
So I get "06/07/11" parsed as "Sun Jun 07 00:00:00 CET 11" (Yikes!) therefore the comparison allways goes to the "else" part and all users are evaluated as inactive :-(
(That´s why in my original code, in my javascript function I was adding "20" to latestVoteDateString, eventhough this solution sucks)
How would I solve this in your approach?
Many thanks again,
boga
In the script named
In the script named "Yelp--check date" on line 11, you can see where it's indicating the date format it's dealing with. You'd want to change the date format to be:
See that this is a 2 digit year instead of 4.
right on! End of story...
right on! End of story... everything works as it is supposed to now, plus I learned something ;-)
thanks.
actually I have bad news. I
actually I have bad news. I checked the log of ScreenScraper and I have:
The error message was: IndexOutOfBoundsException (line 13): Index: 0, Size: 0-- Method Invocation dataSet.get
An error occurred while processing the script: Write to database
The error message was: IndexOutOfBoundsException (line 13): Index: 0, Size: 0-- Method Invocation dataSet.get
line 13 is:
:-(
Either the script "write to
Either the script "write to database" is being invoked at the wrong time (I think it should be "after the pattern is applied") or your dataSet is empty.