3: Running the Scrape
The Search Terms
The last item we need to take care of is creating the text file that will contain our search terms. Let's keep it simple. Fire up your favorite text editor and create a file called search_terms.txt inside of screen-scraper's installation folder (e.g., C:\Program Files\screen-scraper professional edition\search_terms.txt). Add the following three lines to the text file:
speed
blade
Those search terms should yield at least a few DVD's we can add to our collection.
Run the Scrape
All right, now's the moment of truth. Run the updated scraping session by clicking on it in screen-scraper and clicking the Run Scraping Session button, then watch the Log tab to see it do its thing. If all goes well, once it's done, you should have a dvds.txt file in screen-scraper's install folder containing scraped data for all of the search terms.
Scrape Process
Take a look carefully through the log. If it all seems to make sense, you're done. If not, read on so that we can walk through it a bit more carefully.
The flow of events goes like this, once you hit the Run Scraping Session button:
- The scraping session starts up, goes through the process of logging in.
- The Read search terms script is invoked.
- The Read search terms script creates a few objects, then reads in the first line of the search_terms.txt file: "bug".
- The Read search terms script sets the SEARCH session variable with the value "bug", then invokes the Search results scrapeable file.
You'll remember from the earlier tutorial that the SEARCH session variable is used to perform each search. Check the Parameters tab of the Search results scrapeable file for a reminder on where its used.
- The Read search terms script initializes the PAGE session variable to "1".
It turns out, because none of our results pages are more than one page, that this is unnecessary in this particular case. You will, however, want to remember it for future projects where that is not the case. For each search term we're performing a completely separate search, so we need to make sure we start on the first page.
- The Read search terms script invokes the Search results scrapeable file. This is essentially the same thing as clicking the Search button on the search form with the current search term (e.g., "bug", in this case).
- The Search results scrapeable file makes the HTTP request, then applies the PRODUCT extractor pattern to the HTML in order to get all of the details links.
- For each match by the PRODUCT extractor pattern the script Scrape details page gets invoked.
- At this point screen-scraper will loop zero or more times. It will scrape the Details page scrapeable file for each link found on the search results page.
- Each time the Details page scrapeable file is invoked it requests the page, extracts out the data we want, then invokes the
Write data to a file script, which writes out the extracted data to the dvds.txt file. - Once screen-scraper has finished performing the search for "bug" control flows back to our original Read search terms script, where it moves on to the next search term in the file: "speed". From there you can go back to step 3, where it begins the search process again.
Remember that the Log tab is key to understanding the flow of events in screen-scraper. If you're still a bit fuzzy on how things are working, try looking more carefully through the log to piece together how the site is being scraped.
- Printer-friendly version
- Login or register to post comments