Pull entire page contents?
Hi all-
I am stuck on this problem. The individual pages that I am trying to scrape (after various parsing of link lists etc) are actually plain text.
that is to say there are no html codes or headers or footers.
I need to save teh entire contents of the page (as one datafield is fine).
But I can't figure out an extractor pattern to simply pull the entire contents of the page...
thoughts?
Pull entire page contents?
schnitz411,
Well, tie me to an anthill and smother my ears with honey. That's two strikes for me. Yeah, sure, I'm updating the documentation as we go along here but that's no help to you.
Um, I may be out of rabbits here. Rather than walk you any further down this path let me give us both a possible reprieve. If what you're wanting to do is download complete documents as links from a web page may I ask that you take a look at [url=http://www.httrack.com/]HTTrack[/url]?
I've never used it but it's free and it looks like it may work for what you need.
Firefox add-on alternatives:
[url=https://addons.mozilla.org/en-US/firefox/addon/1616]SpiderZilla[/url]
[url=https://addons.mozilla.org/en-US/firefox/addon/201]DownThemAll![/url]
[url=https://addons.mozilla.org/en-US/firefox/addon/427]Scrapbook[/url]
Jeesh. Sorry for this.
-Scott
Pull entire page contents?
not a problem at all....one last small problem...it seems to think this function isn't available in the basic version.
ArticlePage: Processing scripts before a file is scraped.
Processing script: "SaveFirst"
ArticlePage: Returning becase fileToSaveToBeforeTidying was called in the basic edition.
Pull entire page contents?
schnitz411,
'scuse me while I wipe the last bit of yoke off my brow. I lead you astray. My apologies. Rather than using a backslash after the drive letter you'll want to use either a single forward slash or two backslashes.
Seems crazy but it's so the script interpreter (Beanshell) knows which way is up.
Either...
scrapeableFile.saveFileBeforeTidying( "C:/" + FileName );
or...
scrapeableFile.saveFileBeforeTidying( "C:\\" + FileName );
Thanks for putting up with my indignance earlier.
-Scott
Pull entire page contents?
Oh that was actually an attempt to clear the similar error i got using exactly your code
ERROR FROM LOG after : "\"C:\\\" + FileName );".
Processing script: "ScriptArticlePage"
Scraping file: "ArticlePage"
ArticlePage: Processing scripts before a file is scraped.
Processing script: "SaveFirst"
An error occurred while processing the script: SaveFirst
The error message was: Token Parsing Error: Lexical error at line 5, column 58. Encountered:
CODE:
date=new java.util.Date();
FileName = "myFileName_" + (date.getYear()+1900) + (date.getMonth()+1) + date.getDate() + date.getHours() + date.getMinutes() + ".txt";
scrapeableFile.saveFileBeforeTidying( "C:\" + FileName );
Pull entire page contents?
schnitz411,
I recommend you not include the drive letter as part of the file name.
Careful how you're modifying the sample code I posted.
-Scott
Pull entire page contents?
Thanks for continuing help...almost there but running into a wierd error that doesn't make any sense. Its hitting a "lexical error" at teh end of the FileName line
CURRENT SCRIPT (in Interpreted Java)
date=new java.util.Date();
FileName = "C:\" + (date.getYear()+1900) + (date.getMonth()+1) + date.getDate() + date.getHours() + date.getMinutes() + ".txt";
scrapeableFile.saveFileBeforeTidying( FileName );
ERROR
Scraping file: "ArticlePage"
ArticlePage: Processing scripts before a file is scraped.
Processing script: "SaveFirst"
An error occurred while processing the script: SaveFirst
The error message was: Token Parsing Error: Lexical error at line 3, column 128. Encountered: "\n" (10), after : "\";".
Pull entire page contents?
schnitz411,
Yes, it would overwrite if you use the same file name. You have the ability to alter the file name. Here's one approach you may consider.
FileName = "myFileName_" + (date.getYear()+1900) + (date.getMonth()+1) + date.getDate() + date.getHours() + date.getMinutes() + ".txt";
scrapeableFile.saveFileBeforeTidying( "C:\" + FileName );
This would give you a file name like...
myFileName_2008331100.txt
Ensuring that the file names are unique.
-Scott
Pull entire page contents?
One more question...
how do I actually use this?
If i use it on multiple files will it overwrite itself?
// Causes the non-tidied HTML from the scrapeable file
// to be output to the file path.
scrapeableFile.saveFileBeforeTidying( "C:\non-tidied.html" );
Pull entire page contents?
schnitz411,
Whoop. My apologies. I meant to point you to [url=http://screen-scraper.com/support/docs/api_documentation.php#saveFileBeforeTidying]saveFileBeforeTidying[/url].
Try that,
-Scott
Pull entire page contents?
According to the docuumentation getNonTidiedHTML() isn't in the free version.
Pull entire page contents?
schnitz411,
I'd recommend not using extractor patterns to do this. Instead, we have a handful of methods that could accomplish what you're after one way or another. The only one that is available in the free basic edition looks to be, [url=http://screen-scraper.com/support/docs/api_documentation.php#scrapeableFile.getNonTidiedHTML]getNonTidiedHTML()[/url].
At first this may not look like what you'd want but it should accomplish the same thing for you as using [url=http://screen-scraper.com/support/docs/api_documentation.php#downloadFile]downloadFile()[/url].
Take a look and let us know how it goes.
-Scott