It's often the case in screen-scraping that you want to submit a form multiple times using different parameters each time. For example, you may be extracting locations from the store locator service on a site, and need to submit the form for a series of zip codes. In this tutorial we'll provide an example on how to go about that. We will continue from where the second tutorial left off.
If you haven't already gone through Tutorial 2, we would encourage you to do so now. If you don't still have the scraping session you created in Tutorial 2, you can download it and import it into screen-scraper.
This tutorial doesn't require anything beyond an instance of screen-scraper. You can go through it with any of the editions.
If you'd like to see the final version of the scraping session you'll be creating in this tutorial you can download it below.
Attachment | Size |
---|---|
Shopping Site (Scraping Session).sss | 13.06 KB |
Our Shopping Site is pretty limited in that it can only handle one search term. What if we want to extract products for multiple search terms? For example, we may want to scrape various DVD titles that would fit with the other titles in our collection. We could search for the new DVD's using a series of keywords.
We're going to alter the existing Shopping Site scraping session so that it reads in a file containing search terms, and performs a search for each one. Just as before, as it performs a search it will follow the details links and extract out information for each product. Once the information is extracted it will write it out to a file.
The changes we'll be making to our Shopping Site scraping session in order to add this new functionality are actually pretty minor. First, let's deal with the trickiest part (which really isn't all that tricky): creating the script that will read in the file containing our search terms, and run each search.
Create a new script by clicking the (Add a new script) icon in the button bar. Give the script the name Read search terms. Leave the Language drop-down list with the value Interpreted Java. Paste in the following for the content of the script text:
First off we create a few objects that are going to allow us to read in search terms from a file called search_terms.txt. We then read the search terms in line-by-line in a while loop. For each search term we're going to invoke the Search results scrapeable file.
Remember that the Search results scrapeable file is the one that handles issuing the search to the e-commerce web site, and walks through all of the product detail pages.
Please do each of the following:
To disable the script, click on the Shopping Site scraping session in the objects tree, then on the General tab. Un-check the box in the table under the Enabled column.
Click on the Search results scrapeable file, then check the box labeled This scrapeable file will be invoked manually from a script. You might notice that the icon will loose the pound sign (#) when it is taken out of sequence and be grouped with other scrapeable files that are not sequenced.
Click on the Details page scrapeable file, then on the Extractor Patterns tab. For the PRODUCTS extractor pattern, in its Scripts section (below the box for the pattern text), ensure that the Write data to a file script's Enabled box is checked.
Click on the Login scrapeable file, then on the Add Script button (on the Properties tab). In the Script Name select Read search terms and in the When to Run make sure that After file is scraped is selected.
The last item we need to take care of is creating the text file that will contain our search terms. Let's keep it simple. Fire up your favorite text editor and create a file called search_terms.txt inside of screen-scraper's installation folder (e.g., C:\Program Files\screen-scraper professional edition\search_terms.txt). Add the following three lines to the text file:
Those search terms should yield at least a few DVD's we can add to our collection.
All right, now's the moment of truth. Run the updated scraping session by clicking on it in screen-scraper and clicking the Run Scraping Session button, then watch the Log tab to see it do its thing. If all goes well, once it's done, you should have a dvds.txt file in screen-scraper's install folder containing scraped data for all of the search terms.
Take a look carefully through the log. If it all seems to make sense, you're done. If not, read on so that we can walk through it a bit more carefully.
The flow of events goes like this, once you hit the Run Scraping Session button:
You'll remember from the earlier tutorial that the SEARCH session variable is used to perform each search. Check the Parameters tab of the Search results scrapeable file for a reminder on where its used.
It turns out, because none of our results pages are more than one page, that this is unnecessary in this particular case. You will, however, want to remember it for future projects where that is not the case. For each search term we're performing a completely separate search, so we need to make sure we start on the first page.
Remember that the Log tab is key to understanding the flow of events in screen-scraper. If you're still a bit fuzzy on how things are working, try looking more carefully through the log to piece together how the site is being scraped.
Once again, congratulations on completing the tutorial. At this point feel free to experiment a bit. You may want to try adding a few more search terms to the search_terms.txt file. The best way to proceed, from here, would probably be to try this on your own project. If you run into any glitches don't hesitate to post to our forum so that we can lend a hand.
You are as always welcome to continue through the Tutorials or to read the existing documentation.
If you don't feel comfortable with the process, we invite you to recreate the scrape using the tutorial only for reference. This can be done using only the screen-shots while you work on it. If you are still struggling you can search our forums for others like yourself and ask specific questions to the screen-scraper community.