2: Scrape Updates

New Piece: Iterator Script

The changes we'll be making to our Shopping Site scraping session in order to add this new functionality are actually pretty minor. First, let's deal with the trickiest part (which really isn't all that tricky): creating the script that will read in the file containing our search terms, and run each search.

Create a new script by clicking the (Add a new script) icon in the button bar. Give the script the name Read search terms. Leave the Language drop-down list with the value Interpreted Java. Paste in the following for the content of the script text:

// Create a file object that will point to the file containing
// the search terms.
File inputFile = new File( "search_terms.txt" );

// These two objects are needed to read the file.
FileReader in = new FileReader( inputFile );
BufferedReader buffRead = new BufferedReader( in );

// Read the file in line-by-line.  Each line in the text file
// will contain a search term.
while( ( searchTerm = buffRead.readLine() )!=null)
{
    // Set a session variable corresponding to the search term.
    session.setVariable( "SEARCH", searchTerm );

    // Remember we need to initialize the PAGE session variable, just
    // in case we need to iterate through multiple pages of search results.
    // We begin at page 1 for each search.
    session.setVariable( "PAGE", "1" );

    // Get search results for this particular search term.
    session.scrapeFile( "Search results" );
}

// Close up the objects to indicate we're done reading the file.
in.close();
buffRead.close();

First off we create a few objects that are going to allow us to read in search terms from a file called search_terms.txt. We then read the search terms in line-by-line in a while loop. For each search term we're going to invoke the Search results scrapeable file.

Remember that the Search results scrapeable file is the one that handles issuing the search to the e-commerce web site, and walks through all of the product detail pages.

Minor Modifications

Please do each of the following:

  1. Disable the Shopping Site--initialize session script: we'll get the search terms from our external file.

    To disable the script, click on the Shopping Site scraping session in the objects tree, then on the General tab. Un-check the box in the table under the Enabled column.

  2. Stop the Search results from running automatically (in sequence): we will be telling it when to run instead.

    Click on the Search results scrapeable file, then check the box labeled This scrapeable file will be invoked manually from a script. You might notice that the icon will loose the pound sign (#) when it is taken out of sequence and be grouped with other scrapeable files that are not sequenced.

  3. Make sure the Write data to a file script is enabled: if you have been doing multiple tutorials you might have disabled it.

    Click on the Details page scrapeable file, then on the Extractor Patterns tab. For the PRODUCTS extractor pattern, in its Scripts section (below the box for the pattern text), ensure that the Write data to a file script's Enabled box is checked.

  4. Add Read search terms script to the Login scrapeable file: after logging in we want to do our searches.

    Click on the Login scrapeable file, then on the Add Script button (on the Properties tab). In the Script Name select Read search terms and in the When to Run make sure that After file is scraped is selected.