Generating RSS and Atom Feeds

Overview

This feature is only available to the Enterprise edition of screen-scraper.

screen-scraper has the ability to automatically generate RSS and Atom feeds from extracted data. If you're unfamiliar with RSS and Atom feeds you might take a minute to read up on the topic first.

The documentation on this page is a bit abstract. If you're interested in building RSS/Atom feeds with screen-scraper it would probably be a good idea for you to go through our Sixth Tutorial, which will walk you through the process in detail.

How it Works

A small web server runs within screen-scraper that interacts with the scraping engine. As such, you can access a URL within a browser or RSS/Atom reader that will cause screen-scraper to invoke a scraping session, then return back an RSS or Atom feed.

The basic syntax for the URL you'll use to generate a feed looks like this:

 http://(host:port)/ss/xmlfeed?scraping_session=(scraping session name)[&key1=value1&key2=value2...]

For example, if you were running screen-scraper on your local machine, and wanted to generate a feed for the "Shopping Site" example used in our tutorials with the search term "bug" the URL would look like this:

 http://localhost/ss/xmlfeed?scraping_session=Shopping+Site&SEARCH=bug

As with any other URL, each of the parameters must be properly URL-encoded. Key/value pairs can also be passed in as POST parameters.

The only required parameter is "scraping_session". screen-scraper will create session variables out of any other parameters that get passed in.

Setting Up the Scraping Session

The scraping session must have certain named elements present in order to generate the feed. They are as follows:

  • XML_FEED_TITLE optional: A String session variable containing the name that will be used for the entire feed. (e.g., "CNN Headlines")
  • XML_FEED_LINK optional: A String session variable containing the link associated with the feed. (e.g., "http://www.cnn.com/")
  • XML_FEED_DESCRIPTION optional: A String session variable containing the description of the feed. (e.g., "The latest news headlines from CNN.com")
  • XML_FEED_FORMAT optional: A String session variable indicating the format of the feed. Valid values are atom_0.3, rss_0.9, rss_0.91N, rss_0.91U, rss_0.92, rss_0.93, rss_0.94, rss_1.0, and rss_2.0. If omitted, the default value is rss_1.0.
  • XML_FEED (required): This session variable should hold a DataSet consisting of DataRecords that will make up the various feed items (e.g., each news headline). Each DataRecord should contain values using the names given below.
  • TITLE: The title of the feed item.
  • LINK: The link of the feed item.
  • DESCRIPTION: The description of the feed item.
  • PUBLISHED_DATE: The published date of the feed item. This should be a Java Date object.

When the XML feed is requested through your browser or reader screen-scraper will invoke the scraping session named by the "scraping_session" parameter. Once the scraping session completes screen-scraper will look for a DataSet called "XML_FEED", iterate over its constituent DataRecord objects, building the feed from them.