Scrape Only Recent Information
This script is designed to check how recent a post or advertisement is. If you were gathering time sensitive information and only wanted to reach back a few days then this script would be handy. After evaluating the date there will be a section for calling other scripts from inside this script.
import java.util.Date;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.lang.*;
import java.util.*;
import java.io.*;
// Function to parse the passed string into a date
makeDate(date)
{
//This is the format for your date. It is in the April 20, 1999 format
formatter = new SimpleDateFormat("MMM d, yyyy");
//some other options instead of blank could be null, N/A, etc. Really it just depends on how the site is structured.
if (date.equals("BLANK")){
session.log(" ---NO ATTEMPT TO PARSE BLANK DATE");
}
//if it is not blank go ahead and parse the data and apply the Format above. This will also print the date to the log.
else{
date = (Date)formatter.parse(date);
session.log(" +++Parsed date " + date);
}
return date;
}
// Function to get current date
oldestDate(){
// Set number of days to minus from current date.
minusDays = -5;
// Get the current date or instance, then you are going to add a negative amount of days. If that seems strange
// Just trust us. This is not a double negative thing.
Calendar rightNow = Calendar.getInstance();
rightNow.add( Calendar.DATE, minusDays );
// Substitute the Date variable endDate for rightNow becuase it makes more sense to
// Return endDate than a variable named rightNow which is 5 days in the past.
Date endDate = rightNow.getTime();
session.log("The end date is: " + endDate);
return endDate;
}
// Parse posted date. you are getting this posted date from a dataRecord.
// if you were getting it from a session variable it would say session.getVariable("POSTED_DATE")
posted = makeDate(dataRecord.get("POSTED_DATE"));
// Parse the current Date and return it in a format that you can compare to the advertisement or post date.
desired = oldestDate();
// Compare the two.<br />
if (posted.after(desired) || posted.equals(desired))
{
session.log ("AD IS FRESH. SCRAPING DETAILS.");
// If you are keeping track of URLs this will get it from the scrapeable file.
session.setVariable ("SOURCE_URL", scrapeableFile.getCurrentURL() );
// This is the place in the code where you would execute additional scripts.
session.executeScript("Your script name here");
session.executeScript("Your second script name here");
}
else{
session.log("Posted is too old");
}
Hopefully it is evident that the above code is useful in comparing todays date against a previous one. Depending on your needs you might consider developing a script which will move your scraping session on after it reaches a certain date in a listing. For example if you were scraping an auction website for many terms you might want to move on to the next term after you have reached a specified date for the listings. What are some other ways this script could be useful?
- Printer-friendly version
- Login or register to post comments