Pull entire page contents?

schnitz411,

Well, tie me to an anthill and smother my ears with honey. That's two strikes for me. Yeah, sure, I'm updating the documentation as we go along here but that's no help to you.

Um, I may be out of rabbits here. Rather than walk you any further down this path let me give us both a possible reprieve. If what you're wanting to do is download complete documents as links from a web page may I ask that you take a look at [url=http://www.httrack.com/]HTTrack[/url]?

I've never used it but it's free and it looks like it may work for what you need.

Firefox add-on alternatives:

[url=https://addons.mozilla.org/en-US/firefox/addon/1616]SpiderZilla[/url]
[url=https://addons.mozilla.org/en-US/firefox/addon/201]DownThemAll![/url]
[url=https://addons.mozilla.org/en-US/firefox/addon/427]Scrapbook[/url]

Jeesh. Sorry for this.

-Scott

swilsonmc on 04/03/2008 at 11:39 am

Login or register to post comments

Pull entire page contents?

not a problem at all....one last small problem...it seems to think this function isn't available in the basic version.

ArticlePage: Processing scripts before a file is scraped.
Processing script: "SaveFirst"
ArticlePage: Returning becase fileToSaveToBeforeTidying was called in the basic edition.

schnitz411 on 04/03/2008 at 10:24 am

Login or register to post comments

Pull entire page contents?

schnitz411,

'scuse me while I wipe the last bit of yoke off my brow. I lead you astray. My apologies. Rather than using a backslash after the drive letter you'll want to use either a single forward slash or two backslashes.

Seems crazy but it's so the script interpreter (Beanshell) knows which way is up.

Either...

scrapeableFile.saveFileBeforeTidying( "C:/" + FileName );

or...

scrapeableFile.saveFileBeforeTidying( "C:\\" + FileName );

Thanks for putting up with my indignance earlier.

-Scott

swilsonmc on 04/03/2008 at 10:05 am

Login or register to post comments

Pull entire page contents?

Oh that was actually an attempt to clear the similar error i got using exactly your code

ERROR FROM LOG
Processing script: "ScriptArticlePage"
Scraping file: "ArticlePage"
ArticlePage: Processing scripts before a file is scraped.
Processing script: "SaveFirst"
An error occurred while processing the script: SaveFirst
The error message was: Token Parsing Error: Lexical error at line 5, column 58. Encountered: after : "\"C:\\\" + FileName );".

CODE:
date=new java.util.Date();

FileName = "myFileName_" + (date.getYear()+1900) + (date.getMonth()+1) + date.getDate() + date.getHours() + date.getMinutes() + ".txt";

scrapeableFile.saveFileBeforeTidying( "C:\" + FileName );

schnitz411 on 04/02/2008 at 4:35 pm

Login or register to post comments

Pull entire page contents?

schnitz411,

I recommend you not include the drive letter as part of the file name.

Careful how you're modifying the sample code I posted.

-Scott

swilsonmc on 04/02/2008 at 4:13 pm

Login or register to post comments

Pull entire page contents?

Thanks for continuing help...almost there but running into a wierd error that doesn't make any sense. Its hitting a "lexical error" at teh end of the FileName line

CURRENT SCRIPT (in Interpreted Java)
date=new java.util.Date();

FileName = "C:\" + (date.getYear()+1900) + (date.getMonth()+1) + date.getDate() + date.getHours() + date.getMinutes() + ".txt";

scrapeableFile.saveFileBeforeTidying( FileName );

ERROR

Scraping file: "ArticlePage"
ArticlePage: Processing scripts before a file is scraped.
Processing script: "SaveFirst"
An error occurred while processing the script: SaveFirst
The error message was: Token Parsing Error: Lexical error at line 3, column 128. Encountered: "\n" (10), after : "\";".

schnitz411 on 04/02/2008 at 3:29 pm

Login or register to post comments

Pull entire page contents?

schnitz411,

Yes, it would overwrite if you use the same file name. You have the ability to alter the file name. Here's one approach you may consider.

date=new java.util.Date();

FileName = "myFileName_" + (date.getYear()+1900) + (date.getMonth()+1) + date.getDate() + date.getHours() + date.getMinutes() + ".txt";

scrapeableFile.saveFileBeforeTidying( "C:\" + FileName );

This would give you a file name like...

myFileName_2008331100.txt

Ensuring that the file names are unique.

-Scott

swilsonmc on 04/02/2008 at 3:07 pm

Login or register to post comments

Pull entire page contents?

One more question...
how do I actually use this?

If i use it on multiple files will it overwrite itself?

// Causes the non-tidied HTML from the scrapeable file
// to be output to the file path.
scrapeableFile.saveFileBeforeTidying( "C:\non-tidied.html" );

schnitz411 on 04/02/2008 at 2:24 pm

Login or register to post comments

Pull entire page contents?

schnitz411,

Whoop. My apologies. I meant to point you to [url=http://screen-scraper.com/support/docs/api_documentation.php#saveFileBeforeTidying]saveFileBeforeTidying[/url].

Try that,

-Scott

swilsonmc on 04/02/2008 at 2:02 pm

Login or register to post comments

Pull entire page contents?

According to the docuumentation getNonTidiedHTML() isn't in the free version.

schnitz411 on 04/02/2008 at 1:58 pm

Login or register to post comments

Pull entire page contents?

schnitz411,

I'd recommend not using extractor patterns to do this. Instead, we have a handful of methods that could accomplish what you're after one way or another. The only one that is available in the free basic edition looks to be, [url=http://screen-scraper.com/support/docs/api_documentation.php#scrapeableFile.getNonTidiedHTML]getNonTidiedHTML()[/url].

At first this may not look like what you'd want but it should accomplish the same thing for you as using [url=http://screen-scraper.com/support/docs/api_documentation.php#downloadFile]downloadFile()[/url].

Take a look and let us know how it goes.

-Scott

swilsonmc on 04/02/2008 at 1:41 pm

Login or register to post comments

Search

Community

screen-scraper

User login

Pull entire page contents?

Pull entire page contents?

Pull entire page contents?

Pull entire page contents?

Pull entire page contents?

Pull entire page contents?

Pull entire page contents?

Pull entire page contents?

Pull entire page contents?

Pull entire page contents?

Pull entire page contents?

Pull entire page contents?