scraping a list of imported filenames (from text file)
new user, very little interpreted java code experience.
I have a scraping session set up correctly to scrape, and output my tab delimited text file.
I have set the initializing script to use variable:
runnableScrapingSession.setVariable( "FILE", "1" );
This variable is called by the scrape session URL:
C:\pages\~#FILE#~.html
This works fine. Now I need to extend it to go thru a list of filenames.
How can I load a list of names from a text file, have it scrape each, and append the scraped data to my results text file.
Any ideas would be greatly appreciated. Here are each of the scripts I am using, for your reference:
The scrape session is called from the begin_scraping script:
// Generate a new scraping session.
runnableScrapingSession = new com.screenscraper.scraper.RunnableScrapingSession( "Scrape_Site" );
// Set the session variables for URL: http://www.buycostumes.com/ProductDetail.aspx?ProductID=12353
//runnableScrapingSession.setVariable( "ProductIDNumber", "12353" );
runnableScrapingSession.setVariable( "FILE", "1" );
// Tell the scraping session to scrape.
runnableScrapingSession.scrape();
Then the extractor pattern runs the following Write_Data_To_File script (after pattern is applied):
// Output a message to the log so we know that we'll be writing the text out to a file.
session.log( "Writing data to a file." );
// Create a FileWriter object that we'll use to write out the text.
out = new FileWriter( "extracted.txt" );
// Write out the text.
out.write( session.getVariable( "product_name" )+ "\t");
out.write( session.getVariable( "product_number" )+ "\t");
out.write( session.getVariable( "retail_price" )+ "\t");
out.write( session.getVariable( "product_price" )+ "\t");
out.write( session.getVariable( "short_description" )+ "\t");
out.write( session.getVariable( "long_description" )+ "\t");
out.write( session.getVariable( "product_availability" )+ "\t");
out.write( session.getVariable( "product_theme" )+ "\t");
out.write( session.getVariable( "included_accessories" )+ "\t");
out.write( session.getVariable( "product_image" )+ "\t");
//write out the URL of the page
out.write( session.getVariable( "~#ProductIDNumber#~" )+ "\t");
// Close the file.
out.close();
problem resolved
I changed the following in th efile and it seems to work:
out.write( session.getVariable( "product_image" ));
//Insert the line break to seperate records------------------------------_
out.write("\r");
I will post again if I notice any further errors. Thank you for all your help.
Working except for glitch in appending scraped data to file
OK. I followed the tutorial example. I think I am missing something simple to appendthe data to the text file correctly. Here is what everything I have, although you may be able to skip to the bottom and just check the Write_Data_To_File script to verify I am appending data and starting a new line for each page scraped (see sections in bold):
My Scraping session runs the Read_URLs_To_Scrape script before the session begins The script contains the following code:
//-----------------------------------
// Create a file object that will point to the file containing
// the search terms.
File inputFile = new File( "Pages_To_Scrape.txt" );
// These two objects are needed to read the file.
FileReader in = new FileReader( inputFile );
BufferedReader buffRead = new BufferedReader( in );
// Read the file in line-by-line. Each line in the text file
// will contain a search term.
while( ( urlName = buffRead.readLine() )!=null)
{
// Set a session variable corresponding to the search term.
session.setVariable( "URLTOSCRAPE", urlName );
// Remember we need to initialize the PAGE session variable, just
// in case we need to iterate through multiple pages of search results.
// We begin at page 1 for each search.
session.setVariable( "PAGE", "1" );
// Get search results for this particular search term.
session.scrapeFile( "Scrape_Site_Files" );
}
// Close up the objects to indicate we're done reading the file.
in.close();
buffRead.close();
//-----------------------------------
The scraping session URL is set to ~#URLTOSCRAPE#~ (so that it reads uses the page name stored in the URLTOSCRAPE session variable, from the Pages_To_Scrape.txt file).
The scraping session extractor pattern executes my Write_Data_To_File script "after pattern is applied".
My Pages_To_Scrape.txt file contains the following three lines:
page1.html
page2.html
page3.html
The Pages_To_Scrape.txt file, and files page1.html, page2.html and page3.html are all in the screen-scraper program directory.
[b]The Write_Data_To_File script contains: [/b]
// Output a message to the log so we know that we'll be writing the text out to a file.
session.log( "Writing data to a file." );
// Create a FileWriter object that we'll use to write out the text.
out = new FileWriter( "extracted.txt", true );
// Write out the text.
out.write( session.getVariable( "product_name" )+ "\t");
out.write( session.getVariable( "product_number" )+ "\t");
out.write( session.getVariable( "retail_price" )+ "\t");
out.write( session.getVariable( "product_price" )+ "\t");
out.write( session.getVariable( "short_description" )+ "\t");
out.write( session.getVariable( "long_description" )+ "\t");
out.write( session.getVariable( "product_availability" )+ "\t");
out.write( session.getVariable( "product_theme" )+ "\t");
out.write( session.getVariable( "included_accessories" )+ "\t");
out.write( session.getVariable( "product_image" )+ "\t");
//write out the URL of the page
out.write( session.getVariable( "~#ProductIDNumber#~" )+ "\t");
out.write( "\n" );
// Close the file.
out.close();
[b]It all seems to work except the extracted.txt generated by the Write_Data_To_File script shows the page 1 data on one line, followed by the page 2 data andthen the page 3 data on the second line. So, it is combining the first and second scraped pages data on one line, then going to the next line for the third scraped page. I believed the out = new FileWriter( "extracted.txt", true ); and the out.write( "\n" ); would have each scraped page's data written on an individual line of the extracted.txt file. Please advise if I am missing something.[/b] :?: :?:
scraping a list of imported filenames (from text file)
Hi,
We actually have a tutorial that addresses this very topic
http//www.screen-scraper.com/support/tutorials/tutorial7/tutorial_overview.php
Kind regards,
Todd Wilson