Welcome to the Script Repository. Here you will find a continually expanding resource for sharing scripts and ideas. The purpose of this resource is reduce the amount of programming experience you will need to successfully use screen-Scraper.
Throughout this Drupal Book you will find chapters with scripts on initializing, writing, iterating, and more! We hope that these will be a useful addition to your scraping experience.
Most of these scripts are written in Java, the development language of our choice. If you would like to suggest a script that you have created yourself, and wish for it to be publicly available, then send us an email from our contact us page.
The basic idea of initializing is discussed in the second and third tutorials and serves one of two purposes:
As you can guess, you might have both of these needs in a single script of in two different scripts. Regardless, here we present different methods for initializations scripts including such variables as where you get the values of your variables.
This script is extremely useful because it's purpose is to enable you to read inputs in from a csv list. For Example, if you wanted to input all 50 state abbreviations as input parameters for a scrape then this script would cycle through them all. Furthermore, this script truly begins to show the power of an Initialize script as a looping mechanism.
This particular example uses a csv of streets in Bristol RI. Each street in Bristol is seperated by commas and only one street per line. The "while" loop at the bottom of the example retrieves streets one by one until the buffered reader runs out of lines. These streets are stored as a session variable named STREET and used as an input later on. Each time the buffered reader brings in a new street it blasts the last one out of the STREET session variable.
Reading in from a CSV is incredibly powerful; however, it is not the only way to use a loop. For information on how to use an array for inputs please see the "Moderate Initialize -- Input from Array".
The next script (below) deals with input CSV files that have more than one piece of information per row (more than one column).
Sometimes a CSV file will use quotes to wrap data (in case that data contains a comma that does not signify a new field). Since it's a common thing to do, a script to read a CSV should anticipate and deal that that eventuality. The main workhorse of this script is the function. By passing a CSV line to it, it will parse the fields into an array.
Alternatively you can read the csv via the opencsv package that is included with screen-scraper. This may be more robust for different formats of csv
The following script is really useful when you need to loop through a short series of input parameters. Using an array will allow you to rapidly develop a group of inputs that you would like to use; however, you will need to know every input parameter. For example, if you wanted to use the following state abbreviations as inputs [UT, NY, AZ, MO] then building an array would be really quick, but if you needed all 50 states it would probably be easier to access those from a csv (need to know how to use a csv input? check out my other post titled "Moderate Initialize -- Input from CSV").
Many sites requiring the user to input a zip code when performing a search. For example, when searching for car listings, a site will ask for the zip code where you would like to find a car (and perhaps distance from the entered zip code that would be acceptable). The follow script is designed to iterate through a set of input files, which each contain a list of zip codes for that state. The input files in this case are located within a folder named "input" in the screen-scraper directory. The files are named in the format "zips_CA", for example, which would contain California's zip codes.
Attachment | Size |
---|---|
zips_AL.csv | 5.73 KB |
zips_AR.csv | 4.16 KB |
zips_AZ.csv | 3.03 KB |
zips_CA.csv | 20.7 KB |
zips_CO.csv | 4.53 KB |
When a Scraping Session is started it can be a good idea to feed certain pieces of information to the session before it begins resolving URLs. This simple version of the Initialize script is to demonstrate how you might start on a certain page. While basic, understanding when a script like this would be used is pivotal in making screen scraper work for you.
The above code is useful where "PAGE" is an input parameter in the first page you would like to scrape.
Occasionally a site will be structured so that instead of page numbers the site displays records 1-10 or 20-29. If this is the case your Initialize script could look something like this:
Once again "DISPLAY_RECORD_MIN" and "DISPLAY_RECORD_MAX" are input parameters on the first page you would like to scrape.
If you feel you understand this one, I'd encourage you to check out the other Initialize scripts in this code repository.
The following files contains zipcodes for the that state. The file "zips_US.CSV" contains all US zip codes within one file. If you wish to download all of the CSVs at once you may choose to download the file "zips_all_states.zip".
Note: If you've forgotten the state abbreviations please visit http://www.usps.com/ncsc/lookups/usps_abbreviations.html
Last updated 5/8/2008
Attachment | Size |
---|---|
zips_AL.csv | 5.73 KB |
zips_AR.csv | 4.16 KB |
zips_AZ.csv | 3.03 KB |
zips_CA.csv | 20.7 KB |
zips_CO.csv | 4.53 KB |
zips_CT.csv | 2.58 KB |
zips_DE.csv | 686 bytes |
zips_FL.csv | 10.1 KB |
zips_GA.csv | 5.92 KB |
zips_IA.csv | 6.25 KB |
zips_ID.csv | 1.94 KB |
zips_IL.csv | 9.31 KB |
zips_IN.csv | 5.79 KB |
zips_KY.csv | 6.87 KB |
zips_LA.csv | 4.21 KB |
zips_MA.csv | 4.17 KB |
zips_MD.csv | 4.23 KB |
zips_ME.csv | 2.98 KB |
zips_MI.csv | 6.84 KB |
zips_MN.csv | 6.05 KB |
zips_MO.csv | 6.98 KB |
zips_NC.csv | 7.43 KB |
zips_ND.csv | 2.41 KB |
zips_NE.csv | 3.65 KB |
zips_NH.csv | 1.65 KB |
zips_NJ.csv | 4.33 KB |
zips_NM.csv | 2.5 KB |
zips_NV.csv | 1.47 KB |
zips_NY.csv | 13.04 KB |
zips_OH.csv | 8.54 KB |
zips_OK.csv | 4.55 KB |
zips_OR.csv | 2.82 KB |
zips_PA.csv | 15.06 KB |
zips_RI.csv | 546 bytes |
zips_SC.csv | 3.68 KB |
zips_SD.csv | 2.36 KB |
zips_TN.csv | 5.43 KB |
zips_TX.csv | 18.09 KB |
zips_UT.csv | 2 KB |
zips_VA.csv | 8.51 KB |
zips_VT.csv | 1.8 KB |
zips_WA.csv | 4.21 KB |
zips_WI.csv | 5.31 KB |
zips_WV.csv | 5.89 KB |
zips_WY.csv | 1.14 KB |
zips_all_states.zip | 178.54 KB |
zips_US.csv | 295.08 KB |
The form class can be a life saver when it comes to dealing with sites that use forms for their inputs and have a lot of dynamic parameters
There are really only two cases in which using the form class is preferrable to doing the paramenters any other way. Those cases are:
In general though, it'll be easier for debugging if you can stick with the regular parameter tab
One of the most common things to need is the ability to iterate over the results of a search. This usually requires the ability to iterate over the same page with changes to the parameters that are passed. There are examples of this in the second and third tutorials.
There are different methods to use and one thing to keep in mind: memory. This is especially important on larger scrapes and for basic users where the number of scripts on the stack needs to be watched. Below are some examples of Next Page scripts. Which you choose to use will depend on what is available and what your needs are.
If you're scraping a site with lots of "next page" links, you are well advised to use the following script, instead of the other two listed here.
Conceptually, the problem with calling a script at the end of a scrapeableFile, which calls the same scrapeableFile over and over again, is that you're stacking the scrapeableFiles on top of one another. They'll never leave memory until the last page has completed, at which point the stack quickly goes away. This style of scraping is called "recursive".
If you can't predict how many pages there will be, then this idea should scare you :) Instead, you should use an "iterative" approach. Instead of chaining the scrapeableFiles on the end of one another, you call one, let it finish and come back to the script that called it, and then the script calls another. A while/for loop is very fit for this.
Here's a quick illustration of a comparison, so that you can properly visualize the difference. Script code to follow.
Much more effective.
So here's how to do it. When you get to the point where you need to start iterating search results, call a script which will be a little controller for the iteration of pages. This will handle page numbers and offset values (in the event that page iteration isn't using page numbers).
First, your search results page should match some extractor pattern which hints that there is a next page. This helps remove what the page number actually is, and reduces next pages to a simple boolean true or false. The pattern should match some text that signifies a next page is present. In the example code below, I've named the variable "HAS_NEXT_PAGE". Be sure to save it to a session variable. If there is no next page, then this variable should not be set at all. That will be the flag for the script to stop trying to iterate pages.
The script provides to you a "PAGE" session variable, and an "OFFSET" session variable. Feel free to use either one, whichever your situation calls for.
OFFSET will (given the default values in the script), be 0, 20, 40, 60, etc, etc.
PAGE will be 1, 2, 3, 4, 5, etc, etc.
The following script is called upon completion of scraping the first page of a site's details. This script is useful when matching the current page number in the HTML is preferable or simpler than matching the next page number. Depending on how a site is coded, the number of the next page may not even appear on the current page. In this case, we would match for the word "Next", to simply determine if a next page exists or not. The regular expression used for the word next would be used as follows:
The regular expression for the lone token ~@NEXT@~ would be the text that suggests that a next page exists, such as Next Page or maybe a simple >> link.
The only change you should have to make to the code below is to set any variable names properly (if different than in your own project), and to set the correct scrapeableFile name near the bottom.
One of our fellow contributors of this site posted a Next Page script which can be very useful, but may be more code than what you might need. Because every site is constructed differently, iterating through pages can be one of the most difficult parts for a new screen-scraper to master. Indeed, the design of how to get from page to page typically takes some creativity and precision.
One initial word of warning about going from page to page. Occasionally a site will be designed so you can get to the next page at the top and the bottom of the current page. Everybody has seen these before. For example, you're looking through a site which sells DVDs and at the top and the bottom of the list there is a group of numbers that shows what page you are currently viewing, the previous page, the next page, and sometimes the last page. The problem occurs when your pattern matches for the next page before you get to the data you want extracted. If that is the case, your session begins to flip through pages at a very fast rate without retrieving any information at all! Do yourself a favor and match for the one at the bottom of the page.
After you have a successful match, the following script can be applied "Once if pattern matches".
We realize that it is only one line of code, but in many cases that is all that it needs to be.
A sub-extractor pattern can only match one element but manual data extraction allows you to give the same additional context information as using a sub-extractor pattern but allows you the ability to extract multiple data records.
This example makes use of the extractData() method.
The code and examples below demonstrate how to first isolate and extract a portion of a page's total HTML, so that a second extractor pattern may then be applied to just the extracted portion. Doing so can limit the results to only those found on a specific part of the page. This can be useful when you have 100 apples that all look the same but you really only want five of them.
The following screen shots show an example of when the script above might be used. In this example, we are only interested in the active (shown with green dots) COMPANY APPOINTMENTS, and not the LICENSE AUTHORITIES (sample HTML available at the end).
When applied to the all of the HTML of the current scrapeable file, the following extractor pattern will retrieve ALL of the html that makes up the COMPANY APPOINTMENTS table above. But, remember, we only want the active appointments.
Use the extractor pattern below to match against the HTML above. It will return two results: 21ST CENTURY INSURANCE COMPANY, and AIG CENTENNIAL INSURANCE COMPANY, since those are the only two active company appointments. Note that the "Appointment" Extractor Pattern includes the word "GREEN", so that the "RED"(Inactive) company appointments are excluded.
Be sure to check the box that says "This extractor pattern will be invoked manually from a script". This will ensure that the extractor pattern will not run in the sequence with the other extractor patterns.
This script is designed to check how recent a post or advertisement is. If you were gathering time sensitive information and only wanted to reach back a few days then this script would be handy. After evaluating the date there will be a section for calling other scripts from inside this script.
Hopefully it is evident that the above code is useful in comparing todays date against a previous one. Depending on your needs you might consider developing a script which will move your scraping session on after it reaches a certain date in a listing. For example if you were scraping an auction website for many terms you might want to move on to the next term after you have reached a specified date for the listings. What are some other ways this script could be useful?
There are many ways to output scraped data from screen-scraper. Below are sample scripts of some common ways.
The following script contains a method that you may instead wish to call from within your "Write to CSV" script. The purpose of the script is to put phone numbers into a standard format (123-456-7890 x 1234) prior to output. Note: Be careful when using this script to work with non-U.S. phone numbers, since other countries may have more or fewer digits.
The following script proves useful in most cases when there is a need to separate a full name into first name, middle name, surname, and suffixes (if applicable). The suffixes include JR, SR, I, II, III, 3rd, IV, V, VI, VII. The script is also set up to work with names in the "LASTNAME, FIRSTNAME SUFFIX" format.
The following code is used to split zip codes from a pattern match. The code below takes a zip code and assigns the first five digits to the variable "ZIP". If the zip code is in the longer format (12345-6789), as opposed to the shorter format (12345), then the second part of the zip code, which comes after the "-" character, is assigned to the "ZIP4" variable (so named for the 4 digits following the "-" character). This script would be useful in cases where zip codes must be standardized.
This is a simple script used from removing all non-numerical characters from numbers. This is particularly useful when attempting to normalize data before insertion into a database.
Probably the easiest way to write to a comma-seperated value (CSV) document is to use screen-scrapers included CsvWriter. If for some reason you can't/don't wish to use the CsvWriter the following code will also accomplish the task. CSV files are very useful for viewing in spreadsheets or inserting values into a database.
Also, you'll notice that the session variables are cleared out at the end of the script. This would be done when you don't want a session variable to persist into the next dataRecord. For more about scope and dataRecords please go here.
Overview
Oftentimes once you've extracted data from a page you'll want to write it out to an XML file. screen-scraper contains a special XmlWriter class that makes this a snap.
This script uses objects and methods that are only available in the enterprise edition of screen-scraper.
To use the XmlWriter class you'll generally follow these steps:
The trickiest part is understanding which of the various addElement and addElements methods to call.
Examples
If you're scripting in Interpreted Java, the script in step 1 might look something like this:
In subsequent scripts, you can get a reference to that same XmlWriter object like this:
You could then add elements and such to the XML file. The following three examples demonstrate the various ways to go about that. Each of the scripts are self-contained in that they create, add to, then close the XmlWriter object. Bear in mind that this process could be spread across multiple scripts, as described above.
Example 1
This script would produce the following XML file:
Example 2
This script would produce the following XML file:
Example 3
This script would produce the following XML file:
Consider using the SqlDataManager as an alternative way to interact with your JDBC-compliant databases.
This example is designed to give you an idea of how to interact with MySQL, a JDBC-compliant database, from within screen-scraper.
You will need to have MySQL already installed and the service running.
To start, download the JDBC Driver for MySQL connector Jar file and place it in the lib/ext folder where screen-scraper is installed.
Next, create a script wherein you set the different values used to connect to your database. It is recommended that you call this script from your scraping session before scraping session begins.
Create another script to set up your connection and perform queries on your database. Note, it is necessary to include the connection to your database within the same script as your queries.
You will be calling this script after you have extracted data. Typically this will either be after a scrapeable file runs or after an extractor pattern's matches are applied.
Overview
Oftentimes once you've extracted data from a page you'll want to write it to a database. Screen-scraper contains a special SqlDataManager class that makes this easy.
This script uses objects and methods that are only available in the professional and enterprise editions of screen-scraper.
To use the SqlDataManager class you'll generally follow these steps:
The trickiest part is understanding when to call the commit method when writing to related tables.
Examples
If you're scripting in Interpreted Java and using a MySQL database, the script for steps 1-3 might look something like this:
Note that if you are using a database other the MySQL, the only change to this script will be the String passed to the setUrl method of the BasicDataSource.
In subsequent scripts, you can get a reference to that same SqlDataManager object like this:
You could then add data to the data manager. The following examples demonstrate various ways to go about that. Each of the scripts assume you already created an SqlDataManager object in a previous script and saved it to the session variable _DBMANAGER.
Saving to a single table using a data record
If the data record saved above had key-value pairs:
NAME = John Doe
AGE = 37
WEIGHT = 160
and the table 'people' had columns 'name', 'age', and 'gender', the script above would produce the following row in the people table.
Saving to a single table manually
If the session variable GENDER had the value male and the table structure was the same as in the example above, this script would produce the following rows in the people table.
Note that you can mix the two methods shown above. Data can be added from multiple data records and/or manually for the same row.
Saving to multiple tables that are related.
This example assumes that you have a table in the database named people with fields 'id' (primary key/autoincrement), 'name', and 'address', and another table named phones with fields 'person_id', 'phone_number'.
Also, there is a foreign key relation between person_id in phones and id in people. This can be setup either in the database or when setting up the datamanger and calling the addForeignKey method.
In order to make it easier to see inserted values, all calls to addData in this example will enter data manually. In many cases, however, adding a data record is much easier.
Also, remember that data does not have to be added and committed all at once. Usually tables with a parent/child relation will have one script called after each pattern match of an extractor pattern that adds and commits a row of child data, and then a separate script called elsewhere to add and commit the parent data.
Note the order in which tables were committed. All data in child tables must be committed before the data in the parent table.
This script would produce the following rows in the database:
The SqlDataManager takes care of filling in the data for the related fields. We never had to add the data for the person_id column in the phones table. Since id in people is an autoincrement field, we didn't have to add data for that field either.
Close the data manager
Once all data has been written to the database, close the data manager like this:
The SqlDataManager can be set to automatically link data connected in a many-to-many relation. To enable this feature, use the following code:
When this setting is enabled, the data manager will attempt to relate data across multiple tables when possible. For example, if there is a people table, an address table, and a person_has_address table used to relate the other two tables, you would only need to insert data into the people and addresses tables. The data manager would then link the person_has_address table in since it has foreign keys relating it to both people and addresses. See the example below.
This would produce the following result:
When extracting data that will contain many duplicate entries, it can be useful to filter values so that duplicate entries are not written to the database multiple times. The data manager can use a duplicate filter to check data being added to the database against data that is added, and either update or ignore duplicates. This is accomplished with an SqlDuplicateFilter object. To create a duplicate filter, call the SqlDuplicateFilter.register method, set the parent table it checks for duplicates on, and then add the constraints that indicate a duplicate. See the code below for an example of how to filter duplicates on a person table.
Duplicate filters are checked in the order they are added, so consider perfomance when creating duplicate filters. If, for instance, most duplicates will match on the social security number, create that filter before the others. Also make sure to add indexes into your database on those columns that you are selecting by or else performance will rapidly degrade as your database gets large.
Duplicates will be filtered by any one of the filters created. If multiple fields must all match for an entry to be a duplicate, create a single filter and add each of those fields as constraints, as shown in the third filter created above. In other words, constraints added to a single filter will be ANDed together, while seperate filters will be ORed.
This script is handy when the site you are scraping separates out a lot of pieces of information that you would like to put back together. For example, let's say you were searching for apartments, and the site you are scraping separates out the number of bedrooms, bathrooms, size of garage, number of living/family rooms, etc. You would like to be able to stick all of this information together into one string. To do this you need to concatenate all of the pieces from the session variables or dataRecord together like this:
While the above code isn't rocket science, hopefully the value of putting multiple strings together can be easy to see. Now pulling them apart again could be a little bit more troubling. :)
There are times when you need to debug what is going on in your scrapes. The following can help with tracking down various issues.
If a scrape is taking a long time, using the scrape profiler can help you see which scrapeable files and/or scripts are using all the time, so you could optimize their runtimes.
Another reason to consider using the scrape profiler is that there is a function to breakpoint when you overwrite a session variable, so similar to a breakpoint on variable change. Using this you can determine when a session variable is being overwritten when you don't expect it to be.
The EventCallback method of the session provides many ways for you to attach listeners to various parts of your scrape enabling you to have even greater control as to what happens and when it comes to your scrapes.
Listed below are some examples on how to make use of this powerful class in addition you can check out the Session Profiler in the debugging section of the script repository to find more examples of using the Event Handler.
Frequently there are tasks that you will perform on a regular basis. While you can write separate scripts for each of these, sometimes it is more useful to create an object that can store information to be used between scripts, much like an object in java. Below is general utility script that contains many useful functions. The first few hundred lines list the methods and what they are used for. The script is rather large (over 6500 lines), so please download it to view it.
The script is setup to create a Utility object when run, and store it in the session variable "_GENERAL_UTILITY". Generally when using this script, it should run before anything else. Then, to use it during the scrape it can be accessed by retrieving it from the session.
The output in the log from the above example would be something like the following, depending on the value of other variables that had been set.
The monitored variables section tries to correctly output common types of data. For instance, DataSet and DataRecord objects are output as shown above with the DATASET variable. Other classes where similar output occurs are: List, Set, Map, and Exception. Also for Enterprise Edition, the a monitored ScrapeableFile will output in the web interface with a clickable link to view the URL with the same POST request as the file used. This will not set cookies, so the page may or may not display as expected.
This script will periodically be updated with new functionality. Recently it was converted to a .jar file to increase the speed during execution. Because of this, if the jar version is not in the lib/ext directory of Screen-Scraper, an error will be logged when the script is run, but everything should still work. The error simply informs you that the script version is being run, and so it will not be as fast and may be missing a few features that could not be put in script form.
Attachment | Size |
---|---|
GeneralUtility.jar | 236 KB |
On occasion rather than downloading an entire web page you may only want to know when it was last updated, or perhaps its content type. This can be done via an HTTP HEAD request (as opposed to a GET or POST). This script shows you how to go about that.
The following script is only 1 line of code. You may be thinking "Why would this script deserve a place in the repository?" and I'd answer, "I'll show you."
This code is called the breakpoint. When a script is being developed it is common to run it from inside of screen-scraper. In fact, it is a best practice to run scraping sessions often to ensure that you are getting the results you want by checking the log. It is during development that you might want to consider using this script.
First create a new script and label it breakpoint.
Then add this single line of code to it.
Now, when you want to check which variables are in scope you can include this script to run after a pattern is matched. This will come in very handy when you want to see what is in a dataRecord and what is saved as a session variable.
Then when your testing is done simply disable the script from running by removing the check mark in the enabled box wherever you have placed this script.
If you need your scraping session to run multiple times in succession, consider this script, which will repeat multiple times until it either hits the "quitTime" specified (24-hour clock), or when it hits the "maxRuns" allowed. To quickly (and dirtily) disable the "maxRuns" factor, set it to 0, or something negative. To disable the time restraint in "quitTime", just make sure it starts with something greater than or equal to 24 (for example, "24:00" or "123412:43").
The following script is useful in cases where you would like to restart a scrape from a specific point. It will generally be called from your "Search Results" page. This may come in handy if for some reason your scrape stops or breaks. Rather than starting your scrape over from the beginning, you may use this script to start scraping the "Details" page only after a value has been reached. This script may also be useful when you wish to skip to a point in the search results before proceeding onto the "Details" page. This script is an example of a scrape that stopped on Georgia, while scraping information from all 50 states. With this script in place, details will be scraped for every state after(and including) Georgia.
Note: If you are writing to a .csv file (say, using one of the "Write to File" scripts here in the script repository) the new values will be concatenated to the file.
This script was designed because while working for a client requesting building information, we needed to grab data about available square footage. Some targets sites had such sporatically formatted data that it was sometimes impossible to retrieve without a gauntlet of extractor patterns to catch every possible case of formatting. At times, the input was probably just a text box, so the user making the listing could have formatted the information however s/he wished, thus making it impossible to actually be able to guarantee that the pattern would match future listings.
So, although this script is huge, don't let it scare you. The point is that you save to a session variable (or to an in-scope dataRecord) the general region of a page. This region should predictably contain the square footage information, regardless of how its formatted. There are many optional variables that you may set to tweak the behavior of this script. Read about them in the header.
The idea here is to be able to pass a block of text/html from a page, and for this script to make heads or tails of it, and to save two variables: LISTING_MAX_SF and LISTING_MIN_SF.
(Sorry for the ugly formatting. The file is attached at the bottom of this post in a ".sss" format which you can import to screen-scraper, preserving the formatting.)
If you encounter any errors or problems, post comments here or on the forum for help. There could very well be cases that have gone untested in this script. We're looking to make it as robust as possible.
Attachment | Size |
---|---|
SF (Script).sss | 22.06 KB |
The content of the following script is very similar to some other scripts in the repository. The tokenizer takes a string and breaks it into smaller strings at every space. So if I had a sentence like: "the answer is 42" the tokenizer would give me an array of strings like this:
the
answer
is
42
Broken on every space.
in this example a state is seperated by a zip with only a space between them.
6.0.63a or newer
Invoke the script at the beginning of the scrape to use the Async client for HTTP connections