Tutorial 7: Scraping a Site Multiple Times Based on Search Terms

Overview

It's often the case in screen-scraping that you want to submit a form multiple times using different parameters each time. For example, you may be extracting locations from the store locator service on a site, and need to submit the form for a series of zip codes. In this tutorial we'll provide an example on how to go about that. We will continue from where the second tutorial left off.

If you haven't already gone through Tutorial 2, we would encourage you to do so now. If you don't still have the scraping session you created in Tutorial 2, you can download it and import it into screen-scraper.

Tutorial Requirements

This tutorial doesn't require anything beyond an instance of screen-scraper. You can go through it with any of the editions.

Finished Project

If you'd like to see the final version of the scraping session you'll be creating in this tutorial you can download it below.

Attachment Size
Shopping Site (Scraping Session).sss 13.06 KB

1: Tutorial Details

How it Works

Our Shopping Site is pretty limited in that it can only handle one search term. What if we want to extract products for multiple search terms? For example, we may want to scrape various DVD titles that would fit with the other titles in our collection. We could search for the new DVD's using a series of keywords.

We're going to alter the existing Shopping Site scraping session so that it reads in a file containing search terms, and performs a search for each one. Just as before, as it performs a search it will follow the details links and extract out information for each product. Once the information is extracted it will write it out to a file.

2: Scrape Updates

New Piece: Iterator Script

The changes we'll be making to our Shopping Site scraping session in order to add this new functionality are actually pretty minor. First, let's deal with the trickiest part (which really isn't all that tricky): creating the script that will read in the file containing our search terms, and run each search.

Create a new script by clicking the (Add a new script) icon in the button bar. Give the script the name Read search terms. Leave the Language drop-down list with the value Interpreted Java. Paste in the following for the content of the script text:

// Create a file object that will point to the file containing
// the search terms.
File inputFile = new File( "search_terms.txt" );

// These two objects are needed to read the file.
FileReader in = new FileReader( inputFile );
BufferedReader buffRead = new BufferedReader( in );

// Read the file in line-by-line.  Each line in the text file
// will contain a search term.
while( ( searchTerm = buffRead.readLine() )!=null)
{
    // Set a session variable corresponding to the search term.
    session.setVariable( "SEARCH", searchTerm );

    // Remember we need to initialize the PAGE session variable, just
    // in case we need to iterate through multiple pages of search results.
    // We begin at page 1 for each search.
    session.setVariable( "PAGE", "1" );

    // Get search results for this particular search term.
    session.scrapeFile( "Search results" );
}

// Close up the objects to indicate we're done reading the file.
in.close();
buffRead.close();

First off we create a few objects that are going to allow us to read in search terms from a file called search_terms.txt. We then read the search terms in line-by-line in a while loop. For each search term we're going to invoke the Search results scrapeable file.

Remember that the Search results scrapeable file is the one that handles issuing the search to the e-commerce web site, and walks through all of the product detail pages.

Minor Modifications

Please do each of the following:

  1. Disable the Shopping Site--initialize session script: we'll get the search terms from our external file.

    To disable the script, click on the Shopping Site scraping session in the objects tree, then on the General tab. Un-check the box in the table under the Enabled column.

  2. Stop the Search results from running automatically (in sequence): we will be telling it when to run instead.

    Click on the Search results scrapeable file, then check the box labeled This scrapeable file will be invoked manually from a script. You might notice that the icon will loose the pound sign (#) when it is taken out of sequence and be grouped with other scrapeable files that are not sequenced.

  3. Make sure the Write data to a file script is enabled: if you have been doing multiple tutorials you might have disabled it.

    Click on the Details page scrapeable file, then on the Extractor Patterns tab. For the PRODUCTS extractor pattern, in its Scripts section (below the box for the pattern text), ensure that the Write data to a file script's Enabled box is checked.

  4. Add Read search terms script to the Login scrapeable file: after logging in we want to do our searches.

    Click on the Login scrapeable file, then on the Add Script button (on the Properties tab). In the Script Name select Read search terms and in the When to Run make sure that After file is scraped is selected.

3: Running the Scrape

The Search Terms

The last item we need to take care of is creating the text file that will contain our search terms. Let's keep it simple. Fire up your favorite text editor and create a file called search_terms.txt inside of screen-scraper's installation folder (e.g., C:\Program Files\screen-scraper professional edition\search_terms.txt). Add the following three lines to the text file:

bug
speed
blade

Those search terms should yield at least a few DVD's we can add to our collection.

Run the Scrape

All right, now's the moment of truth. Run the updated scraping session by clicking on it in screen-scraper and clicking the Run Scraping Session button, then watch the Log tab to see it do its thing. If all goes well, once it's done, you should have a dvds.txt file in screen-scraper's install folder containing scraped data for all of the search terms.

Scrape Process

Take a look carefully through the log. If it all seems to make sense, you're done. If not, read on so that we can walk through it a bit more carefully.

The flow of events goes like this, once you hit the Run Scraping Session button:

  1. The scraping session starts up, goes through the process of logging in.
  2. The Read search terms script is invoked.
  3. The Read search terms script creates a few objects, then reads in the first line of the search_terms.txt file: "bug".
  4. The Read search terms script sets the SEARCH session variable with the value "bug", then invokes the Search results scrapeable file.

    You'll remember from the earlier tutorial that the SEARCH session variable is used to perform each search. Check the Parameters tab of the Search results scrapeable file for a reminder on where its used.

  5. The Read search terms script initializes the PAGE session variable to "1".

    It turns out, because none of our results pages are more than one page, that this is unnecessary in this particular case. You will, however, want to remember it for future projects where that is not the case. For each search term we're performing a completely separate search, so we need to make sure we start on the first page.

  6. The Read search terms script invokes the Search results scrapeable file. This is essentially the same thing as clicking the Search button on the search form with the current search term (e.g., "bug", in this case).
  7. The Search results scrapeable file makes the HTTP request, then applies the PRODUCT extractor pattern to the HTML in order to get all of the details links.
  8. For each match by the PRODUCT extractor pattern the script Scrape details page gets invoked.
  9. At this point screen-scraper will loop zero or more times. It will scrape the Details page scrapeable file for each link found on the search results page.
  10. Each time the Details page scrapeable file is invoked it requests the page, extracts out the data we want, then invokes the Write data to a file script, which writes out the extracted data to the dvds.txt file.
  11. Once screen-scraper has finished performing the search for "bug" control flows back to our original Read search terms script, where it moves on to the next search term in the file: "speed". From there you can go back to step 3, where it begins the search process again.

Remember that the Log tab is key to understanding the flow of events in screen-scraper. If you're still a bit fuzzy on how things are working, try looking more carefully through the log to piece together how the site is being scraped.

4: Where to Go From Here

Suggestions

Once again, congratulations on completing the tutorial. At this point feel free to experiment a bit. You may want to try adding a few more search terms to the search_terms.txt file. The best way to proceed, from here, would probably be to try this on your own project. If you run into any glitches don't hesitate to post to our forum so that we can lend a hand.

You are as always welcome to continue through the Tutorials or to read the existing documentation.

Still a Little Lost?

If you don't feel comfortable with the process, we invite you to recreate the scrape using the tutorial only for reference. This can be done using only the screen-shots while you work on it. If you are still struggling you can search our forums for others like yourself and ask specific questions to the screen-scraper community.