lots-o-files to parse

Hey all,

I've modified the 'hello world' tutorial to parse a couple things from some html files I have.

I have roughly 250 html files already downloaded I need to parse. All the names are slightly different, so I can't easily make a loop to go through them.

Currently, I wrote a C# program to read all the html file names, and use that to write a script file in vbScript. I then copy the script file into screen-scraper and attempt to run it. Below is a sample of the output from my C# program.

~#URL#~ is set to the entire URL in the scraping session.

...
Set runnableScrapingSession = CreateBean( "com.screenscraper.scraper.RunnableScrapingSession", "Hello World" )
Call RunnableScrapingSession.SetVariable( "URL" , "C:\test\html\school004c.html")
Call RunnableScrapingSession.Scrape

Set runnableScrapingSession = CreateBean( "com.screenscraper.scraper.RunnableScrapingSession", "Hello World" )
Call RunnableScrapingSession.SetVariable( "URL" , "C:\test\html\school0055.html")
Call RunnableScrapingSession.Scrape
....

It works just fine for a few files, but when I paste all ~250 of them, I only get about a dozen outputs to my file, then screen-scraper crashes and closes. With 256MB of ram allowed, I got the 'beep' and crash. I upped that to 512MB, and now it doesn't do that anymore.

I know there must be a better way to do this. I've thought about putting a delay in-between each session, but that doesn't appear to be easy to do.
I've looked for the 'limit sessions to 1' option in the settings box, but that seems to have been on an earlier version? (I don't see that option)

Any other ideas would be greatly appreciated,

thanks in advance,
-Dave

lots-o-files to parse

That's great, Dave. Thanks for posting your solution.

Best wishes,

Todd

lots-o-files to parse

Hi Todd,

I got it to work, but in a different manner than I thought I would need to do.

I tried the same method above, but this time I used java rather than vbScript. Same thing happened; after a dozen or so files screen-scraper would just close.

So, I saw in one of the tutorials about starting a scrape from the command line. Modified that code and got it working for a couple files.

Then, I modified my C# program to create a batch file containing a command to execute a scrape on all ~250 html files sequentially. I originally had a small delay (sleep 1) inbetween each invocation of screen-scraper, but it appears to work without it also.

I'm sure this isn't the fastest method, but it works for what I need, so I'm happy :D

here is what I ended up doing, for anyone else in the same boat.

script file in screen-scraper in interpretive java
// Generate a new "Hello World" scraping session.
runnableScrapingSession = new com.screenscraper.scraper.RunnableScrapingSession( "Hello World" );

// Put the text to be submitted in the form into a session variable so we can reference it later.
runnableScrapingSession.setVariable( "URL", params.get( "URL" ) );

// Tell the scraping session to scrape.
runnableScrapingSession.scrape();

snip from batch file:
@echo off
echo Hello
jre\bin\java -jar screen-scraper.jar --run-script "java batch" --params "URL="C:\test\html\school004c.html"
jre\bin\java -jar screen-scraper.jar --run-script "java batch" --params "URL="C:\test\html\school0055.html"
...
pause

Thanks for pointing me in the right direction,
-Dave

lots-o-files to parse

Hi Dave,

This may be a result of your using VBScript as the scripting language. Please see this FAQ for a few more details: here. If you're planning on running a large number of scraping sessions you should use Interpreted Java.

I assume you're using the basic edition of screen-scraper. If you're open to using the professional edition an even better solution would be to run screen-scraper as a server, then execute your scraping sessions remotely from a .NET app.

Hopefully that helps. Please feel free to post back if we can offer any other suggestions.

Kind regards,

Todd Wilson