Tutorial 4: Scraping a Shopping Site from an External Program

Overview

This tutorial illustrates invoking screen-scraper from other programs in ways more complex than those presented in Tutorial 3. From our external program we'll be passing to screen-scraper search parameters, invoking the scraping process, getting the scraped data from screen-scraper, then iterating over the data, and outputting it within our application.

Before proceeding it would be a good idea to go through Tutorial 2, if you haven't done so already.

If you haven't gone through Tutorial 2, or don't still have the scraping session you created in it, you can download it and import it into screen-scraper.

Tutorial Requirements

This tutorial requires you to be using the Professional or Enterprise edition of screen-scraper. And it requires that you have access to a server (remote or local) that can run one of the external scripting languages that screen-scraper has drivers for: ASP, C#.NET, ColdFusion, Java, PHP, Python, or VB.NET.

Finished Project

If you'd like to see the final version of the scraping session you'll be creating in this tutorial you can download it below.

Attachment Size
Shopping Site (Scraping Session).sss 11.63 KB

1: Scrape Updates

Scrape Process

screen-scraper can be invoked from software applications written in most modern programming languages, including Java, Active Server Pages, PHP, .NET, and anything that supports SOAP. In this tutorial we'll give some examples of applications that do just that.

Our application will pass parameters to screen-scraper corresponding to login information as well as a key phrase for which to search. As in the third tutorial, we're going to pretend that the web site requires us to log in before we can search, for the sake of providing an example. Once we pass the parameters to screen-scraper we'll tell it to start scraping. screen-scraper will then run the scraping session using the parameters we gave it. Once it's done, we'll ask it for the extracted information, then output it for the user to see.

Updates

Before we begin we'll first need to make a couple of minor changes to the Shopping Site scraping session from the third tutorial. If you haven't already, start up screen-scraper.

Login Parameters

Under the Shopping Site scraping session click on the Login scrapeable file, then on the Parameters tab. We're going to alter the email_address and password POST parameters so that we can pass those parameters in rather than hard-coding them. For the email_address parameter change the value [email protected] to ~#EMAIL_ADDRESS#~, and change the testing value for the password parameter to ~#PASSWORD#~.

Remember tokens surrounded by the ~# #~ delimiters indicate that the value of a session variable should be inserted. For example, in our case we're going to create an EMAIL_ADDRESS session variable and give it the value [email protected] such that screen-scraper substitutes it in for the corresponding POST parameter at runtime.

Products Extractor Pattern

To simplify the process of giving an external script access to the extracted product details, we will save the data set into a session variable.

Click on the Details page scrapeable file. On the PRODUCTS extractor pattern, select the Advanced tab and check the box next to Automatically save the data set generated by this extractor pattern in a session variable.

Initialization Script

The code that we'll be writing in our external application will essentially take the place of the Shopping Site--initialize session script. Let's disable the association since it would otherwise overwrite the values we'll be passing in externally.

To do that click on the Shopping Site scraping session in the objects tree and un-check the Enabled checkbox for the Shopping Site--initialize session script.

Prepare screen-scraper for Application

Save your changes and exit screen-scraper. Also, so that the external scripts will be able to interact with screen-scraper, start screen-scraper running as a server.

2: External Script

Choose a Language

Where you go next depends on which programming language you're interested in. Select the the link below that corresponds the the language that you will be using.

2.1: Using ASP

Warning!

In order to invoke screen-scraper from ASP, screen-scraper needs to be running in server mode. If you'd like a refresher on how to start up screen-scraper in server mode go ahead and follow that link, then come back here.

Run the Script

Right-click and download the shopping.asp file, then save it to a directory where it will be web-accessible (i.e., within your IIS web dir).

Open up your web browser and go to the URL corresponding to the shopping.asp file (e.g, "http://localhost/screen-scraper/shopping.asp"). You'll see a simple search form. Type in a product keyword, such as bug, then hit the Go button. If all goes well the page will take a little while to load (it's waiting as screen-scraper extracts the data), then it will output the corresponding products.

Troubleshoot Problems (if any arise)

If that didn't go quite as you expected here are some things to check:

  • Make sure screen-scraper is running as a server, and that nothing is blocking its ports (such as a firewall running on your machine).
  • If you're running screen-scraper on a different machine than the one your ASP file resides on, make sure that screen-scraper is allowing connections from the ASP machine. In the screen-scraper workbench click on the (wrench) icon, then on the Servers button, and check that the Hosts to allow to connect includes the IP address (or perhaps just the first part of the IP address) of the ASP machine. You might also try blanking that property out entirely, which will allow connections from any host. When developing, this is usually the easiest approach.
  • Check screen-scraper's log folder for a Shopping Site log file. If you find one it means that screen-scraper is at least receiving the request. Open the log file in a text editor to see if you find any error messages.
  • If you still can't seem to get it to work feel free to post to our forum.

Understand the Script

Assuming the test worked, fire up your favorite ASP editor and open the shopping.asp file in it. The file is pretty heavily commented, so hopefully it makes sense what's going on. If not, try reviewing our COM documentation or posting to our forum.

View the Log

When you invoke screen-scraper as a server it creates log files corresponding to each run of your scraping sessions in its log folder. Take a look in that folder for your Shopping Site log and take a look through it. It should look similar to what you see when you run scraping sessions in the workbench.

2.1: Using C#.NET

Warning!

In order to invoke screen-scraper from C#.NET, screen-scraper needs to be running in server mode. If you'd like a refresher on how to start up screen-scraper in server mode go ahead and follow the link, then return here.

Run the Script

Right-click and download the shopping.cs file. Move it into the desired directory.

From your .NET environment compile and execute the shopping.cs file.

Troubleshoot Problems (if any arise)

If that didn't go quite as you expected here are some things to check:

  • Make sure screen-scraper is running as a server, and that nothing is blocking its ports (such as a firewall running on your machine).
  • If you're using Visual Studio 2008 or later, the project Target Framework will need to be set to .NET 3.5 or later. However, do not use any .NET client frameworks since they do not have the required libraries for your project to compile.
  • If you're running screen-scraper on a different machine than the one your C# class resides on, make sure that screen-scraper is allowing connections from the C# machine. In the screen-scraper workbench click on the (wrench) icon, then on the Servers button, and check the Hosts to allow to connect includes the IP address (or perhaps just the first part of the IP address) of the C# machine. You might also try blanking that property out entirely, which will allow connections from any host. When developing, this is usually the easiest approach.
  • Check screen-scraper's log folder for a Shopping Site log file. If you find one it means that screen-scraper is at least receiving the request. Open the log file in a text editor to see if you find any error messages.
  • If you still can't seem to get it to work feel free to post to our forum.

Understand the Script

Assuming that test worked, take a closer look over the shopping.cs class. The file is pretty heavily commented, so hopefully it makes sense what's going on. If not, try reviewing our .NET documentation or posting to our forum.

View the Log

When you invoke screen-scraper as a server it creates log files corresponding to your scraping session in its log folder. Take a look in that folder for your Shopping Site log file and take a look through it. It should look similar to what you see when you run scraping sessions in the workbench.

2.1: Using Cold Fusion

Warning!

In order to invoke screen-scraper from ColdFusion, screen-scraper needs to be running in server mode. If you'd like a refresher on how to start up screen-scraper in server mode go ahead and follow the link, then return here.

Run the Script

Download the shopping.cfm file, then save it in a directory that will be accessible from your web server. Rename the file from shopping.cfm.txt to shopping.cfm.

Open up your web browser and go to the URL corresponding to the shopping.cfm file (e.g, "http://localhost/screen-scraper/shopping.cfm"). You'll see a simple search form. Type in a product keyword, such as bug, then hit the Go button. If all goes well the page will take a little while to load (it's waiting as screen-scraper extracts the data), then it will output the corresponding products.

Troubleshoot Problems (if any arise)

If that didn't go quite as you expected here are some things to check:

  • Make sure screen-scraper is running as a server, and that nothing is blocking its ports (such as a firewall running on your machine).
  • If you're running screen-scraper on a different machine than the one your ColdFusion file resides on, make sure that screen-scraper is allowing connections from the ColdFusion machine. In the screen-scraper workbench click on the (wrench) icon, then on the Servers button, and check the Hosts to allow to connect includes the IP address (or perhaps just the first part of the IP address) of the ColdFusion machine. You might also try blanking that property out entirely, which will allow connections from any host. When developing, this is usually the easiest approach.
  • Ensure that the permissions on the shopping.cfm file are such that your web server can execute it.
  • Check screen-scraper's log folder for a Shopping Site log file. If you find one it means that screen-scraper is at least receiving the request. Open the log file in a text editor to see if you find any error messages.
  • If you still can't seem to get it to work feel free to post to our forum.

Understand the Script

Assuming that test worked, fire up your favorite ColdFusion editor and open the shopping.cfm file in it. The file is pretty heavily commented, so hopefully it makes sense what's going on. If not, try reviewing ColdFusion documentation or posting to our forum.

View the Log

When you invoke screen-scraper as a server it creates log files corresponding to your scraping session in its log folder. Take a look in that folder for your Shopping Site log file and take a look through it. It should look similar to what you see when you run scraping sessions in the workbench.

2.1: Using Java

Warning!

In order to invoke screen-scraper from Java, screen-scraper needs to be running in server mode. If you'd like a refresher on how to start up screen-scraper in server mode go ahead and follow the link, then return here.

Run the Script

Before we dig into the code let's review a few things related to invoking screen-scraper via Java. First, your Java code will need to have two jars in its classpath: screen-scraper.jar (found in the root screen-scraper install folder) and log4j.jar (found in screen-scraper's lib folder). For convenience we've packaged all of the files you'll need. Download the file and unzip it. You'll notice that we also include an Ant build file that you can use to compile and run the sample class.

If you're using Ant simply type ant run at a command prompt inside of the folder where the build.xml file is found.

Troubleshoot Problems (if any arise)

If that didn't go quite as you expected here are some things to check:

  • Make sure screen-scraper is running as a server, and that nothing is blocking its ports (such as a firewall running on your machine).
  • If you're running screen-scraper on a different machine than the one your Java class resides on, make sure that screen-scraper is allowing connections from the Java machine. In the screen-scraper workbench click on the (wrench) icon, then on the Servers button, and check the Hosts to allow to connect includes the IP address (or perhaps just the first part of the IP address) of the Java machine. You might also try blanking that property out entirely, which will allow connections from any host. When developing, this is usually the easiest approach.
  • Check screen-scraper's log folder for a Shopping Site log file. If you find one it means that screen-scraper is at least receiving the request. Open the log file in a text editor to see if you find any error messages.
  • If you still can't seem to get it to work feel free to post to our forum.

Understand the Script

Assuming that test worked, fire up your favorite Java editor and open the Shopping.java file in it. The file is pretty heavily commented, so hopefully it makes sense what's going on. If not, try reviewing our Java documentation or posting to our forum.

View the Log

When you invoke screen-scraper as a server it creates log files corresponding to your scraping session in its log folder. Take a look in that folder for your Shopping Site log file and take a look through it. It should look similar to what you see when you run scraping sessions in the workbench.

2.1: Using PHP

Warning!

In order to invoke screen-scraper from PHP, screen-scraper needs to be running in server mode. If you'd like a refresher on how to start up screen-scraper in server mode go ahead and follow that link, then come back here.

Run the Script

Your PHP code will need to refer to screen-scraper's PHP driver, called remote_scraping_session.php. You can find this file in the misc\php\ folder of your screen-scraper installation. You'll want to copy the file into the directory where you plan on putting the PHP file that will invoke screen-scraper.

Download the shopping.php file and then save it in the same directory where you copied the remote_scraping_session.php file. Rename the file from shopping.php.txt to shopping.php.

Open up your web browser and go to the URL corresponding to the shopping.php file (e.g, "http://localhost/screen-scraper/shopping.php"). You'll see a simple search form. Type in a product keyword, such as bug, then hit the Go button. If all goes well the page will take a little while to load (it's waiting as screen-scraper extracts the data), then it will output the corresponding products.

Troubleshoot Problems (if any arise)

If that didn't go quite as you expected here are some things to check:

  • Make sure screen-scraper is running as a server, and that nothing is blocking its ports (such as a firewall running on your machine).
  • If you're running screen-scraper on a different machine than the one your PHP file resides on, make sure that screen-scraper is allowing connections from the PHP machine. In the screen-scraper workbench click on the (wrench) icon, then on the Servers button, and check the Hosts to allow to connect includes the IP address (or perhaps just the first part of the IP address) of the PHP machine. You might also try blanking that property out entirely, which will allow connections from any host. When developing, this is usually the easiest approach.
  • Ensure that the permissions on the shopping.php and remote_scraping_session.php files are such that your web server can execute them.
  • Check screen-scraper's log folder for a Shopping Site log file. If you find one it means that screen-scraper is at least receiving the request. Open the log file in a text editor to see if you find any error messages.
  • If you still can't seem to get it to work feel free to post to our forum.

Understand the Script

Assuming that test worked, fire up your favorite PHP editor and open the shopping.php file in it. The file is pretty heavily commented, so hopefully it makes sense what's going on. If not, try reviewing the PHP documentation or posting to our forum.

View the Log

When you invoke screen-scraper as a server it creates log files corresponding to your scraping session in its log folder. Take a look in that folder for your Shopping Site log file and take a look through it. It should look similar to what you see when you run scraping sessions in the workbench.

2.1: Using Python

Warning!

In order to invoke screen-scraper from Python, screen-scraper needs to be running in server mode. If you'd like a refresher on how to start up screen-scraper in server mode go ahead and follow the link, then return here.

Run the Script

Your Python code will need to refer to screen-scraper's Python driver, called remote_scraping_session.py. You can find this file in the misc\python\ folder of your screen-scraper installation. You'll want to put a copy of the file into the directory where you plan on putting the Python file that will invoke screen-scraper.

Download the shopping.py file, then save it in the same directory where you copied the remote_scraping_session.py file. Rename the file from shopping.py.txt to shopping.py.

Run the command python shopping.py in your console. You'll be asked which keyword to search. Type in a product keyword, such as bug, then press the Enter key. If all goes well the program will take a little while to load (it's waiting as screen-scraper extracts the data), then it will output the corresponding products.

Troubleshoot Problems (if any arise)

If that didn't go quite as you expected here are some things to check:

  • Make sure screen-scraper is running as a server, and that nothing is blocking its ports (such as a firewall running on your machine).
  • If you're running screen-scraper on a different machine than the one your Python file resides on, make sure that screen-scraper is allowing connections from the Python machine. In the screen-scraper workbench click on the (wrench) icon, then on the Servers button, and check the Hosts to allow to connect includes the IP address (or perhaps just the first part of the IP address) of the Python machine. You might also try blanking that property out entirely, which will allow connections from any host. When developing, this is usually the easiest approach.
  • Ensure that the permissions on the shopping.py and remote_scraping_session.py files are such that you can execute them.
  • Check screen-scraper's log folder for a Shopping Site log file. If you find one it means that screen-scraper is at least receiving the request. Open the log file in a text editor to see if you find any error messages.
  • If you still can't seem to get it to work feel free to post to our forum.

Understand the Script

Assuming that test worked, fire up your favorite Python editor and open the shopping.py file in it. The file is pretty heavily commented, so hopefully it makes sense what's going on. If not, try reviewing the Python documentation or posting to our forum.

View the Log

When you invoke screen-scraper as a server it creates log files corresponding to your scraping session in its log folder. Take a look in that folder for your Shopping Site log file and take a look through it. It should look similar to what you see when you run scraping sessions in the workbench.

2.1: Using Ruby

Warning!

In order to invoke screen-scraper from Ruby, screen-scraper needs to be running in server mode. If you'd like a refresher on how to start up screen-scraper in server mode go ahead and follow that link, then come back here.

Run the Script

Your Ruby code will need to refer to screen-scraper's Ruby driver, called remote_scraping_session.rb. You can find this file in the misc\ruby\ folder of your screen-scraper installation. You'll want to copy that file into the directory where you plan on putting the Ruby file that will invoke screen-scraper.

Download the shopping.rb.txt file then save it in the same directory where you copied the remote_scraping_session.rb file. Rename the file from shopping.rb.txt to shopping.rb.

Run the command ruby shopping.rb in your console. You'll be asked which keyword to search. Type in a product keyword, such as bug, then press the Enter key. If all goes well the program will take a little while to load (it's waiting as screen-scraper extracts the data), then it will output the corresponding products.

Troubleshoot Problems (if any arise)

If that didn't go quite as you expected here are some things to check:

  • Make sure screen-scraper is running as a server, and that nothing is blocking its ports (such as a firewall running on your machine).
  • If you're running screen-scraper on a different machine than the one your Ruby file resides on, make sure that screen-scraper is allowing connections from the Ruby machine. In the screen-scraper workbench click on the (wrench) icon, then on the Servers button, and check the Hosts to allow to connect includes the IP address (or perhaps just the first part of the IP address) of the Ruby machine.
  • Ensure that the permissions on the shopping.rb and remote_scraping_session.rb files are such that you can execute them.
  • Check screen-scraper's log folder for a Shopping Site log file. If you find one it means that screen-scraper is at least receiving the request. Open the log file in a text editor to see if you find any error messages.
  • If you still can't seem to get it to work feel free to post to our forum.

Understand the Script

Assuming that test worked, fire up your favorite Ruby editor and open the shopping.rb file in it. The file is pretty heavily commented, so hopefully it makes sense what's going on. If not, try reviewing the Ruby documentation, or posting to our forum.

View the Log

When you invoke screen-scraper as a server it creates log files corresponding to your scraping session in its log folder. Take a look in that folder for your "Shopping Site" log file and take a look through it. It should look similar to what you see when you run scraping sessions in the workbench.

2.1: Using VB.NET

Warning!

In order to invoke screen-scraper from VB.NET, screen-scraper needs to be running in server mode. If you'd like a refresher on how to start up screen-scraper in server mode go ahead and follow the link, then return here.

Run the Script

Download the shopping.vb file. Rename the file from shopping.vb.txt to shopping.vb. From your .NET environment compile and execute the file.

Troubleshoot Problems (if any arise)

If that didn't go quite as you expected here are some things to check:

  • Make sure screen-scraper is running as a server, and that nothing is blocking its ports (such as a firewall running on your machine).
  • If you're running screen-scraper on a different machine than the one your VB class resides on, make sure that screen-scraper is allowing connections from the VB machine. In the screen-scraper workbench click on the (wrench) icon, then on the Servers button, and check the Hosts to allow to connect includes the IP address (or perhaps just the first part of the IP address) of the VB machine. You might also try blanking that property out entirely, which will allow connections from any host. When developing, this is usually the easiest approach.
  • Check screen-scraper's log folder for a Shopping Site log file. If you find one it means that screen-scraper is at least receiving the request. Open the log file in a text editor to see if you find any error messages.
  • If you still can't seem to get it to work feel free to post to our forum.

Understand the Script

Assuming that test worked, take a closer look over the shopping.vb class. The file is pretty heavily commented, so hopefully it makes sense what's going on. If not, try reviewing our .NET documentation or posting to our forum.

View the Log

When you invoke screen-scraper as a server it creates log files corresponding to your scraping session in its log folder. Take a look in that folder for your Shopping Site log file and take a look through it. It should look similar to what you see when you run scraping sessions in the workbench.

3: Where to Go From Here

Suggestions

First off, Congratulations! You have made it through another tutorial and are progressing in your abilities to extract information from the web. The approach outlined in this tutorial works great for relatively small sets of data. When we extract records from the shopping site we're probably not going to extract more than 25 or so. When screen-scraper extracts the data it is saved in memory (remember we checked the Automatically save the data set generated by this extractor pattern in a session variable checkbox for the PRODUCTS extractor pattern, which is what causes this to happen), so it works fine because there aren't that many products.

More Training/Tutorials

Where to next? Well, what would happen if we needed to extract and save large numbers of records? The simple answer is that you need to save them out as they're extracted rather than having screen-scraper keep them in memory. Usually this means either inserting the scraped records into a database or writing them out to a text file.

Tutorial 2 already illustrated how to write the data out to a file but Tutorial 5 will walk your through saving scraped data to a database (if you interested in this you might also find this FAQ helpful).

Just remember that if you're writing the data out to a file you'll want to uncheck the box labeled Automatically save the data set generated by this extractor pattern in a session variable for the extractor pattern that pulls out the data you want to save. If it's checked it will cause screen-scraper to store all of the data in memory, which could cause it to run out of memory while it's running.

Tutorial 6 will use screen-scraper to create an XML Feed from the e-commerce site while Tutorial 7 will go through using a file of search terms to run the search scrape multiple times and write it to a file.

Still a Little Lost?

If you don't feel comfortable with the process, we invite you to recreate the scrape using the tutorial only for reference. This can be done using only the screen-shots while you work on it. If you are still struggling you can search our forums for others like yourself and ask specific questions to the screen-scraper community.