Writing output straight to a text file

Hi Everyone.

I am scraping a website thats just a .txt file. So all I want to do is scrape my site with no tidying and take what came in and send it straight to a text file. Is there a way for me to possibly write a script and attach to run after the scrape? If so what javascript do I use? I am just starting to learn javascript. Thanks.

Scrape Student on 11/21/2012 at 1:29 pm

screen-scraper public support

Just to confirm my understanding...

OK, so if I understand you correctly, the website/page you are visiting is just displaying a text file, correct? (i.e., the contents of the text file are 'wrapped' within HTML to make it into a page, correct)

If this is correct, would you want to:
a) download the text file directly from the site?
b) grab all the text and dump it into a text file?

I think that SS can do either a) or b), so could you indicate your preference to help us help you? OTOH, if my understanding of what you want to do is NOT correct, let me know where I'm wrong.

Thanks & regards,
Justin

Justin_S on 11/23/2012 at 11:29 am

I want to do a) download the

I want to do
a) download the text file directly from the site.

Thing is that I don't see any HTML wrapped around the text. If I click on source for the site its just the text without HTML. Thats why if I turn tidying off I get all my text after the scrape. Now I just want to save all that to a text file on my drive. I am thinking a small output program using javascript. But I don't know the correct syntax. Thanks so much for the help and let me know if you need more info.

Scrape Student on 11/24/2012 at 12:45 pm

Is there a .txt in the URL?

Hi again,

Out of curiosity, does the URL contain the name of the text file? (e.g., www.fubar.com/Hello_World.txt) If so, you could set up a script to simply download the background text file once the 'page' is loaded. I have done this for JPG images on a page, so if this is the case let me know and I'll share my code.

Otherwise, I think you could create a single extractor pattern that only contains a token without any regular expressions (although I guess you could set the regular expression to Non-HTML Tags just to be safe...). This should capture all the text on the screen and you would then 'write' the contents of this extractor pattern to a text file using the sample Java script in tutorial 1 (see http://community.screen-scraper.com/tutorials/tutorial_1/6_write_script).

HTH!
Justin

Justin_S on 11/26/2012 at 8:31 am

Hi Justin You are exactly

Hi Justin

You are exactly right in the first instance. The URL contains the name of the text file. (www.blahblah/something.txt) I set my scrape to not tidy and ran it. The results show me the exact text from the site. If you have a sample script that can save this to my hard drive that would rock! Thanks!

Scrape Student on 11/26/2012 at 3:26 pm

Here's some code I use to download JPGs

OK, so here's part of some code I adapted to download a JPG file from a URL (you'll see ) You can probably adapt some of this to download your text file without too much effort, and I've included the source of my code in the code comments.

Since you're only seeing part of the code, bear the following in mind:
- the code is used for downloading a JPG, which is why there are references to 'images' all over the place
- DestinationFile_STR is a string variable I use to hold the location of the folder/directory where the file will be saved
- Image_Directory_Exists_BOOL is a boolean that I use to determine whether the DestinationFile_STR folder/directory exists or not
- Image_URL_STR is a string variable I use to hold the URL where the JPG file is located, which is similar to your situation in that my URL will be www.blahblah.com/1234567.jpg; if you have SS already at the webpage, you could use the handy scrapeableFile.getCurrentURL method to bring this URL into the script as a string
- you'll need to import some Java libraries for doing the file transfer (InputStream, OutputStream & File) and to resolve URLs (.net.URL); I think these libraries are part of Java's default installation so you should already have them
- step 5a. sets up the parameters needed to access and download the file while step 5b. handles the actual process of downloading & writing out the file
- I've encased the code in a try/catch block but I didn't include the catch portion; you can include this if you want.

NOTE: The following code assumes you are using the Basic version of SS; if you're using the Pro/Enterprise versions, there's a dedicated method for downloading files - see http://community.screen-scraper.com/documentation/api/session/downloadFile

Hope this helps!
Justin

//Pre-flight check: import the java libraries needed
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.File;
import java.net.URL;
//following are addn libraries needed for file conversion
import java.awt.image.RenderedImage;
import javax.imageio.ImageIO;

//... my other code went here ...

//5. Retrieve the image from the stream and save it to the file
//NOTE: Following code adapted from http://www.avajava.com/tutorials/lessons/how-do-i-save-an-image-from-a-url-to-a-file.html

try {
//b. check to see if the file/directory exists
Image_Directory_Exists_BOOL = new File(DestinationFile_STR).exists();
if (Image_Directory_Exists_BOOL==false)
{
//If the file doesn't exist
//access FileWriter & OutputStreamWriter
// to create the file
FileWriter out = null;
try
{
OutputStream out = new FileOutputStream (DestinationFile_STR);
out.close();
session.logInfo("--- Image Access script: image download path created! ---"); //for testing
}

//a. Set up the URL and image streams needed
URL url = new URL(Image_URL_STR);
InputStream is = url.openStream();
OutputStream os = new FileOutputStream(DestinationFile_STR);

//b. If the file does exists (or if it's been newly created)
// write the image data to the file
byte[] b = new byte[2048];
int length;

while ((length = is.read(b)) != -1)
{
os.write(b, 0, length);
}
is.close();
os.close();
session.logInfo("--- Image Access script: Image mark written to JPG file " + DestinationFile_STR + " ---");
// for testing
}
}
} else {
//c. if the mark is NOT an image file, just move onto the next one
session.logInfo("Image Access script: not an image file ---");
}

Justin_S on 11/27/2012 at 7:20 am

Search

Community

screen-scraper

User login