In this tutorial will go over configuring screen-scraper to generate an RSS or Atom feed based on extracted data. We will be using the Shopping Site scraping session we generated in Tutorial 2.
If you haven't already gone through Tutorial 2, we would encourage you to do so now. If you don't still have the scraping session you created in Tutorial 2, you can download it and import it into screen-scraper.
The methods used in this tutorial require that you use an Enterprise edition of screen-scraper.
If you'd like to see the final version of the scraping session you'll be creating in this tutorial you can download it below.
Attachment | Size |
---|---|
Shopping Site (Scraping Session).sss | 12.37 KB |
Before going on, take a minute to read over the Generating RSS and Atom Feeds page in our documentation. It will give you a basic overview of how we will proceed. When you have read it return here and proceed with the tutorial.
We're going to configure our Shopping Site scraping session so that it generates a feed of products based on a search parameter. That is, we'll give it a search keyword (e.g., bug or dvd), it will extract the product data, then create an XML feed out of the scraped data. For testing purposes we'll just access the XML feed from a web browser, though you could just as easily access it from an RSS/Atom reader.
If you read over the Generating RSS and Atom Feeds page you can probably guess at how we'll need to modify the scraping session. Let's start by altering the name of the extractor pattern that grabs the product details.
In screen-scraper click on the Details page scrapeable file in the Shopping Site scraping session, then click the Extractor Patterns tab. Change the name of the extractor pattern from PRODUCTS to XML_FEED.
This pattern will extract out the DataSet that will hold our entire feed.
Each DataSet will only hold the results from one Details page, but we want a DataSet with all of the movies. There a a couple of ways to create this larger DataSet but we will use screen-scraper's built in ability to do this for us.
Click on the Details page scrapeable file. On the XML_FEED extractor pattern, select the Advanced tab and check the box next to Automatically save the data set generated by this extractor pattern in a session variable. Now screen-scraper will create our DataSet of movies.
Let's designate the fields for the individual items in the feed. To start, click on the Sub-Extractor Patterns tab for our feed.
There are several fields we're extracting, but for the sake of simplicity we'll just worry about two of them: TITLE and DESCRIPTION. For the TITLE portion of our feed we're in luck because we already have a TITLE token but for the DESCRIPTION part of the feed item we cannot use a full description from the product details page as there is not one. For the sake of providing an example let's use the MODEL as a substitute for a full description. Change the name of the MODEL sub-extractor token to DESCRIPTION so that it looks like this:
There are two more elements we need for our XML feed: LINK and PUBLISHED_DATE. We're obviously not extracting either of these, so let's write a quick script to set them for us.
Create a new script by clicking on (Add a new script) icon in the button bar. Give the script the name Set URL and published date then copy and paste the provided code snippet into the Script Text:
Once you've created the script associate it with the XML_FEED extractor pattern by clicking on the Details page scrapeable file, then on the Extractor Patterns tab. In the Scripts section (on the Main tab of the XML_FEED extractor pattern) click on the Add Script button. Select Set URL and published date under the Script Name column, and After each pattern match under the When to Run column.
The script is fairly straightforward. We first set the LINK element to the URL of the product details page we're currently on. You'll notice that we're setting the value via the put method on the current DataRecord object. Because this script will get invoked for each pattern application the dataRecord object will be in scope.
Remember that the "dataRecord" object can be thought of as the current row on the spreadsheet of extracted data. Here we're simply adding a cell to the current row of the spreadsheet for the LINK element of the feed.
The second element we set is the PUBLISHED_DATE. For those unfamiliar with Java, passing it new Date()
simply indicates that the feed item was published on the current date.
If you haven't done so previously, disable the Shopping Site--initialize session script (on the Shopping Site scraping session). We'll be passing values in externally, and this script would otherwise overwrite those values.
To disable the script, click on the Shopping Site" scraping session in the objects tree, then on the . Un-check the box in the table under the Enabled column.
That's it for setting up the scraping session. We're now going to generate the feed.
Take a minute now to save your work.
Let's make sure the scraping session works before we add a few more bells and whistles.
Close the workbench and start up screen-scraper in server mode. Once that's up, assuming you haven't altered the default Web/SOAP Server port (which is also the web server port), and that you're running screen-scraper on your local machine, try entering this URL in to your browser:
http://localhost:8779/ss/xmlfeed?scraping_session=Shopping+Site&SEARCH=bug
If you are running screen-scraper on a machine that is not your local machine or on a different SOAP port then make the required changes to the URL.
If all goes well the browser should take a little bit to load, then you should see an XML document appear containing the extracted information. If you got an error message or the document didn't appear as you expected it to, check screen-scraper's log.
Just as with scraping sessions run remotely, screen-scraper will create a log file in its log folder corresponding to each RSS/Atom scraping session.
Dealing with the URL directly can be a bit cryptic, what with the encoding and all. As such, let's make use of a little HTML file that will allow us to generate feeds using different search parameters and formats. You can access it at http://www.screen-scraper.com/support/tutorials/tutorial6/xml_feed_generator.htm.
This HTML file assumes that you're running screen-scraper as a server on your local machine on port 8779. If any of that isn't the case you'll want to download the HTML file to your local machine, alter it with your settings, then open it back up in your browser.
Try experimenting with the form a bit. It gives you control over most all of the features that are available, including the format of the feed. Also take a close look at the URL. screen-scraper simply converts the GET parameters in the URL to session variables in the scraping session. If you'd like, you can even open the feed in your favorite RSS/Atom reader to ensure that the format is valid.
Once again, congratulations on completing the tutorial. The ability to generate RSS/Atom feeds open up a world of possibilities to you. The best way to proceed, from here, would probably be to try this on your own project. If you run into any glitches don't hesitate to post to our forum so that we can lend a hand.
You are as always welcome to continue through the Tutorials or to read the existing documentation.
If you don't feel comfortable with the process, we invite you to recreate the scrape using the tutorial only for reference. This can be done using only the screen-shots while you work on it. If you are still struggling you can search our forums for others like yourself and ask specific questions to the screen-scraper community.