Start a scraping session from a text file / csv
Hi, Im trying to find a simple explanation or example for using URL's in a text file for the target URL for scraping.
I can see from various discussions that it is possible, but Im unable to find any clear instruction on the forum as to exactly what to enter as the URL under the properties tab in the scrapeable file forms.
For example I have a file called linkstoscrape.txt file with the following links to be scraped
http://www.somesite.com/1.htm
http://www.somesite.com/2.htm
etc.
What is the correct way to get the program to open this file and begin to scrape each link
in sequence?
Can anyone tell me how to do this or help me find the appropriate post?
bcb, Take a look at either of
bcb,
Take a look at either of the sample "Input from CSV" sample scripts we have. As you parse your CSV you will be setting each of the URLs in your CSV to session variables.
Let's say you set each one to the variable "URL". You would then enter into the URL field under the Properties tab of your scrapeable file the following.
That will reference the value of the session variable "URL".
-Scott
Input links from Text/CSV . Thanks Scott! + Working example
Scott, thanks for the answer. I didn't realize I needed some code in between to get me to the file.
Here is my working example in case anyone else needs a little help.
As described by Scott in previous posting I used ~@URL@~ as my URL target when setting up the scrape. I then created a script using the file below, which I configured to run Before scraping session begins. Credit for the script goes to the author of the original input from CSV post. I just butchered together this bit.
If anyone can modify it to work so that the if the URL has iterative pages those pages are also scraped it would be a major improvement to this script. As it is now, it will scrape through all of the links in the CSV file but if a page or category has next pages /iterative pages they will be ignored and you will have to scrape them separately. The basic functionality of the script is good though.
// Declare any additional session variables here.
session.setVariable( "PAGE", "p" );
session.setVariable( "NUMBER", "1" );
//you need to point the input file in the right direction. This is a relative path to an input folder in the location where you installed Screen-scraper.
session.setVariable("INPUT_FILE", "input/listoflinksforscraping.csv");
//this buffered reader gathers in the csv one line at a time. Your csv will need to be seperated into lines as well with one entity per line.
BufferedReader buffer = new BufferedReader(new FileReader(session.getVariable("INPUT_FILE")));
//this is the loop that I was referring to earlier. As long as the line from the buffered reader is not null it sets the line as a session variable and //calls the "nameoffiletobescraped" scrapeable file.
while ( (line = buffer.readLine()) != null ){
session.setVariable("URL", line);
session.log("***BeginningLog " + session.getVariable("URL"));
session.scrapeFile("nameoffiletobescraped");
<code>
bcb, From the same scripts
bcb,
From the same scripts repository under Tips, Tricks & Samples you will find a few sample "Next page" scripts that might just solve your problem.
-Scott