Scraping all URL's on a site

Hey All,

I have created a SS file that will find and log all files found on a site. THis I will use later on with jMeter to do load testing.

My issue is that this process takes a very long time (30 minutes for 750URLS) (server does 10 req/sec, give or take).

This is the process:

Init -> Scapre homepage -> call write-URL's script -> Load next page ->scrape URL -> call write-URL's script etc etc

The problem is that for each found URL it calls up my "Write URL's" scrip, and as such it's get called very often for all pages found in the menu's.

My second approach was to use runnableScrapingSession.

But then I have two issues:

When called like this:

RunnableScrapingSession = new com.screenscraper.scraper.RunnableScrapingSession( "scrape" );
I get 'session cannot be null'

when called like this:
RunnableScrapingSession = new com.screenscraper.scraper.RunnableScrapingSession( "scrape" );
I get only for professional edition

When using a RunnableScrapingSession it would allow me to just get the URL's and parse them later as a list and I would only start a scraper session once for every page.

In comparison between the two methods

1) 'Standard scraper' would call my script for each and every URL, anf this takes a very long time
2) 'Scripted version' calls a scaper session for each and every unique page, but this doesn't work because I am not using a professional version.

So my question is, what is the recommended approach to just get ALL URL's from a website?

Ries

rvt on 04/03/2011 at 4:53 pm

screen-scraper public support

There's no particular best

There's no particular best method to do this. I would use a database to record what URLs I found, go to each new one in the scrape (rather than as runnable scraping sessions), and use the DB to make sure I wasn't going the the same URL many time to avoid an infinite loop.

jason on 04/04/2011 at 8:54 am

Jason, thank you for your

Jason,

thank you for your response. Currently my 'DB' is just a Java List and since there are just a couplf of hunderd URL's this easely fit's into memory, no need to have a DB:

My issue is that the screenscaper session will always find a lot of duplicate URL's that calls up my script, this particularly happens a lot with all menu links. And it seems like calling interpreted java is very resource intensive.

I would love to use runnable scraping sessions but this doesn't seem to work because I get 'need a professional version' or I get for no apparent reason 'session is null'.

Ries

rvt on 04/04/2011 at 9:07 am

The RunnableScrapeingSession

The RunnableScrapeingSession is only available in professional or enterprise edition, so the "need a professional version" makes sense.

I'd need to see what you're running to get "session is null", but I imagine it's related.

jason on 04/05/2011 at 9:07 am

Search

Community

screen-scraper

User login

Scraping all URL's on a site

There's no particular best

Jason, thank you for your

The RunnableScrapeingSession