Scrape only new content
Hi, is there a way to scrape only new content?
I need to scrape some urls daily, but each url has a table with a lot of rows, so it takes a while. But only a few rows are added daily, so if I could only scrape the new content it would be very quick.
Is it possible? How?
There are ways to do it, but
There are ways to do it, but the method varied based on what data is available.
If the data you want has a posted date/time, you can:
{
// Write output
}
else
{
// Skip
}
If not, you'll need a database to track what items you already have, and check it for each post to add only new ones.
First of all, thanks for the
First of all, thanks for the support.
I got everything working except the skip. How do I tell the program to stop scraping the current page and go to next?
I run the script after each pattern match
It is probably something simple, but I can't figure it out :(
Thank you
I'd need to see your script
I'd need to see your script to tell you ... web pages are so variegated that there's no one means to use on all.
I scrape several similar
I scrape several similar pages, each with a bunch of rows. The way it is now it compares date with last scraped date and only saves the new content, but I want it to "abort" scraping the current page and go to the next.
I have a txt with a bunch of URLs:
// the search terms.
File inputFile = new File( "jogadores.txt" );
// These two objects are needed to read the file.
FileReader in = new FileReader( inputFile );
BufferedReader buffRead = new BufferedReader( in );
// Read the file in line-by-line. Each line in the text file
// will contain a search term.
while( ( url_jogador = buffRead.readLine() )!=null)
{
// Set a session variable corresponding to the search term.
session.setVariable( "url_jogador", url_jogador );
// Get search results for this particular search term.
session.scrapeFile( "Stats" );
}
// Close up the objects to indicate we're done reading the file.
in.close();
buffRead.close();
The extractor pattern matches a single row:
<tr class="~@odd@~">
<td class="dt">~@data@~</td>
<td class="op"><span class="atvs">~@local@~</span><a href="/~@oponente@~/" target="_parent">~@sigla_oponente@~</a><sup></sup></td>
<td class="rs"><a target="_parent" href="~@box_score@~">~@resultado@~</a></td>
<td>~@minutos@~</td>
<td>~@fg@~</td>
<td>~@3p@~</td>
<td>~@ft@~</td>
<td>~@off@~</td>
<td>~@def@~</td>
<td>~@rebotes@~</td>
<td>~@assists@~</td>
<td>~@steals@~</td>
<td>~@blocks@~</td>
<td>~@turnovers@~</td>
<td>~@pf@~</td>
<td>~@pontos@~</td>
</tr>
Then, after each pattern match, I run this script to save in a mysql db:
Class.forName("com.mysql.jdbc.Driver").newInstance();
Connection conn;
conn = DriverManager.getConnection("jdbc:mysql://" + session.getVariable("MYSQL_SERVER_URL") + ":"+session.getVariable("MYSQL_SERVER_PORT") + "/" + session.getVariable("MYSQL_DATABASE"), session.getVariable("MYSQL_SERVER_USER"), session.getVariable("MYSQL_SERVER_PASSWORD"));
nome = session.getVariable( "nome" );
data = dataRecord.get( "data" );
local = dataRecord.get( "local" );
oponente = dataRecord.get( "oponente" );
result = dataRecord.get( "resultado" );
result = result.substring(0,1);
minutos = dataRecord.get( "minutos" );
rebotes = dataRecord.get( "rebotes" );
assists = dataRecord.get( "assists" );
steals = dataRecord.get( "steals" );
blocks = dataRecord.get( "blocks" );
turnovers = dataRecord.get( "turnovers" );
pontos = dataRecord.get( "pontos" );
reb = Double.valueOf(rebotes);
reb = reb.doubleValue();
ast = Double.valueOf(assists);
ast = ast.doubleValue();
stl = Double.valueOf(steals);
stl = stl.doubleValue();
blk = Double.valueOf(blocks);
blk = blk.doubleValue();
to = Double.valueOf(turnovers);
to = to.doubleValue();
pts = Double.valueOf(pontos);
pts = pts.doubleValue();
double fp = ((reb+ast)*0.8)+((stl+blk)*1.5)-(to*0.5)+(pts*0.6);
fp = Math.round(fp);
int fps = (int)fp;
fpoints = String.valueOf(fps);
SimpleDateFormat sdf = new SimpleDateFormat("MMM d");
Calendar c = Calendar.getInstance();
c.setTime(sdf.parse(data));
java.sql.Date data2 = new java.sql.Date(sdf.parse(data).getTime());
java.sql.Date outubro = java.sql.Date.valueOf("1970-10-01");
if (data2.after(outubro)){
c.add(Calendar.YEAR, 41);
}
else {
c.add(Calendar.YEAR, 42);
}
data = new java.sql.Date(c.getTime().getTime());
java.sql.Date lastScrapedDate = java.sql.Date.valueOf("2012-01-17");
String nome_certo = nome.replaceAll("'","\\\\\'");
if (data.after(lastScrapedDate))
{
//Create statements and run queries
// on your database.
Statement stmt = null;
stmt = conn.createStatement();
mysqlstring="INSERT IGNORE INTO jogos VALUES('"+nome_certo+"','"+data+"','"+local+"','"+oponente+"','"+result+"','"+minutos+"','"+rebotes+"','"+assists+"','"+steals+"','"+blocks+"','"+turnovers+"','"+pontos+"','"+fpoints+"')";
stmt.executeUpdate(mysqlstring);
//Be sure to close up your
// statements and connection.
stmt.close();
}
else {
//skip
}
conn.close();
Since there is no navigation
Since there is no navigation in this, I gather you want to halt the extractor patten if the date is triggered? I don't really have a way to stop mid-extractor, though there are some things you could do to stop the script from running.
break;
Class.forName("com.mysql.jdbc.Driver").newInstance();
Connection conn;
conn = DriverManager.getConnection("jdbc:mysql://" + session.getVariable("MYSQL_SERVER_URL") + ":"+session.getVariable("MYSQL_SERVER_PORT") + "/" + session.getVariable("MYSQL_DATABASE"), session.getVariable("MYSQL_SERVER_USER"), session.getVariable("MYSQL_SERVER_PASSWORD"));
nome = session.getVariable( "nome" );
data = dataRecord.get( "data" );
local = dataRecord.get( "local" );
oponente = dataRecord.get( "oponente" );
result = dataRecord.get( "resultado" );
result = result.substring(0,1);
minutos = dataRecord.get( "minutos" );
rebotes = dataRecord.get( "rebotes" );
assists = dataRecord.get( "assists" );
steals = dataRecord.get( "steals" );
blocks = dataRecord.get( "blocks" );
turnovers = dataRecord.get( "turnovers" );
pontos = dataRecord.get( "pontos" );
reb = Double.valueOf(rebotes);
reb = reb.doubleValue();
ast = Double.valueOf(assists);
ast = ast.doubleValue();
stl = Double.valueOf(steals);
stl = stl.doubleValue();
blk = Double.valueOf(blocks);
blk = blk.doubleValue();
to = Double.valueOf(turnovers);
to = to.doubleValue();
pts = Double.valueOf(pontos);
pts = pts.doubleValue();
double fp = ((reb+ast)*0.8)+((stl+blk)*1.5)-(to*0.5)+(pts*0.6);
fp = Math.round(fp);
int fps = (int)fp;
fpoints = String.valueOf(fps);
SimpleDateFormat sdf = new SimpleDateFormat("MMM d");
Calendar c = Calendar.getInstance();
c.setTime(sdf.parse(data));
java.sql.Date data2 = new java.sql.Date(sdf.parse(data).getTime());
java.sql.Date outubro = java.sql.Date.valueOf("1970-10-01");
if (data2.after(outubro))
{
c.add(Calendar.YEAR, 41);
}
else {
c.add(Calendar.YEAR, 42);
}
data = new java.sql.Date(c.getTime().getTime());
java.sql.Date lastScrapedDate = java.sql.Date.valueOf("2012-01-17");
String nome_certo = nome.replaceAll("'","\\\\\'");
if (data.after(lastScrapedDate))
{
//Create statements and run queries
// on your database.
Statement stmt = null;
stmt = conn.createStatement();
mysqlstring="INSERT IGNORE INTO jogos VALUES('"+nome_certo+"','"+data+"','"+local+"','"+oponente+"','"+result+"','"+minutos+"','"+rebotes+"','"+assists+"','"+steals+"','"+blocks+"','"+turnovers+"','"+pontos+"','"+fpoints+"')";
stmt.executeUpdate(mysqlstring);
//Be sure to close up your
// statements and connection.
stmt.close();
}
else
{
// Stop script from running again
session.setv("HALT", true);
}
finally
conn.close();
You'd just need to set that "HALT" back to false when you want to resume.
Thank you
That was my fear, that the whole extractor ran before testing for matches :(
But it is ok, the program does everything I need, I'm very happy.
I just need some help with auto-running it via Windows Task Scheduler. I made a bat with this:
@echo off
jre\bin\java -Xmx1024M -jar screen-scraper.jar -s "FFBL" > "log\ffbl.log"
it doesnt work.
i tried this:
C:\"Program Files (x86)"\"screen-scraper basic edition"\jre\bin\java -Xmx1024M -jar screen-scraper.jar -s "FFBL" > "log\ffbl.log"
but it says "Unable to access jarfile screen-scraper.jar"
im running win7
Just run the command from the
Just run the command from the directory "\program files (x86)\screen-scraper basic edition", so the screen-scraper.jar is in the same directory. Then your command should look like:
The bat file runs fine when I
The bat file runs fine when I run it, but it doesnt work if I schedule it to run with windows task scheduler.
Is there any other way to schedule it or make it work with WTS?
thank you
Add a line to you bath file
Add a line to you bath file to change to the screen-scraper directory:
jre\bin\java -Xmx1024M -jar screen-scraper.jar -s "FFBL" > "log\ffbl.log"
If that doesn't work, please get me the precise error you see. It may show up in the Event Viewer.