2: Scrape Updates

Specify Feed Extractor Pattern

If you read over the Generating RSS and Atom Feeds page you can probably guess at how we'll need to modify the scraping session. Let's start by altering the name of the extractor pattern that grabs the product details.

In screen-scraper click on the Details page scrapeable file in the Shopping Site scraping session, then click the Extractor Patterns tab. Change the name of the extractor pattern from PRODUCTS to XML_FEED.

This pattern will extract out the DataSet that will hold our entire feed.

Each DataSet will only hold the results from one Details page, but we want a DataSet with all of the movies. There a a couple of ways to create this larger DataSet but we will use screen-scraper's built in ability to do this for us.

Click on the Details page scrapeable file. On the XML_FEED extractor pattern, select the Advanced tab and check the box next to Automatically save the data set generated by this extractor pattern in a session variable. Now screen-scraper will create our DataSet of movies.

Designate Feed Fields

Let's designate the fields for the individual items in the feed. To start, click on the Sub-Extractor Patterns tab for our feed.

There are several fields we're extracting, but for the sake of simplicity we'll just worry about two of them: TITLE and DESCRIPTION. For the TITLE portion of our feed we're in luck because we already have a TITLE token but for the DESCRIPTION part of the feed item we cannot use a full description from the product details page as there is not one. For the sake of providing an example let's use the MODEL as a substitute for a full description. Change the name of the MODEL sub-extractor token to DESCRIPTION so that it looks like this:

>Model: ~@DESCRIPTION@~<

Scripting Link and Publish Date

There are two more elements we need for our XML feed: LINK and PUBLISHED_DATE. We're obviously not extracting either of these, so let's write a quick script to set them for us.

Create a new script by clicking on (Add a new script) icon in the button bar. Give the script the name Set URL and published date then copy and paste the provided code snippet into the Script Text:

// Set the "LINK" element to the URL of the current product details page.
dataRecord.put( "LINK", scrapeableFile.getCurrentURL() );

// Create a formatted date representing the current date.
dataRecord.put( "PUBLISHED_DATE", new Date() );

Once you've created the script associate it with the XML_FEED extractor pattern by clicking on the Details page scrapeable file, then on the Extractor Patterns tab. In the Scripts section (on the Main tab of the XML_FEED extractor pattern) click on the Add Script button. Select Set URL and published date under the Script Name column, and After each pattern match under the When to Run column.

Script Description

The script is fairly straightforward. We first set the LINK element to the URL of the product details page we're currently on. You'll notice that we're setting the value via the put method on the current DataRecord object. Because this script will get invoked for each pattern application the dataRecord object will be in scope.

Remember that the "dataRecord" object can be thought of as the current row on the spreadsheet of extracted data. Here we're simply adding a cell to the current row of the spreadsheet for the LINK element of the feed.

The second element we set is the PUBLISHED_DATE. For those unfamiliar with Java, passing it new Date() simply indicates that the feed item was published on the current date.

Disable Initialization Script

If you haven't done so previously, disable the Shopping Site--initialize session script (on the Shopping Site scraping session). We'll be passing values in externally, and this script would otherwise overwrite those values.

To disable the script, click on the Shopping Site" scraping session in the objects tree, then on the . Un-check the box in the table under the Enabled column.

Moving On

That's it for setting up the scraping session. We're now going to generate the feed.

Take a minute now to save your work.