The basic idea of initializing is discussed in the second and third tutorials and serves one of two purposes:
As you can guess, you might have both of these needs in a single script of in two different scripts. Regardless, here we present different methods for initializations scripts including such variables as where you get the values of your variables.
This script is extremely useful because it's purpose is to enable you to read inputs in from a csv list. For Example, if you wanted to input all 50 state abbreviations as input parameters for a scrape then this script would cycle through them all. Furthermore, this script truly begins to show the power of an Initialize script as a looping mechanism.
This particular example uses a csv of streets in Bristol RI. Each street in Bristol is seperated by commas and only one street per line. The "while" loop at the bottom of the example retrieves streets one by one until the buffered reader runs out of lines. These streets are stored as a session variable named STREET and used as an input later on. Each time the buffered reader brings in a new street it blasts the last one out of the STREET session variable.
Reading in from a CSV is incredibly powerful; however, it is not the only way to use a loop. For information on how to use an array for inputs please see the "Moderate Initialize -- Input from Array".
The next script (below) deals with input CSV files that have more than one piece of information per row (more than one column).
Sometimes a CSV file will use quotes to wrap data (in case that data contains a comma that does not signify a new field). Since it's a common thing to do, a script to read a CSV should anticipate and deal that that eventuality. The main workhorse of this script is the function. By passing a CSV line to it, it will parse the fields into an array.
Alternatively you can read the csv via the opencsv package that is included with screen-scraper. This may be more robust for different formats of csv
The following script is really useful when you need to loop through a short series of input parameters. Using an array will allow you to rapidly develop a group of inputs that you would like to use; however, you will need to know every input parameter. For example, if you wanted to use the following state abbreviations as inputs [UT, NY, AZ, MO] then building an array would be really quick, but if you needed all 50 states it would probably be easier to access those from a csv (need to know how to use a csv input? check out my other post titled "Moderate Initialize -- Input from CSV").
Many sites requiring the user to input a zip code when performing a search. For example, when searching for car listings, a site will ask for the zip code where you would like to find a car (and perhaps distance from the entered zip code that would be acceptable). The follow script is designed to iterate through a set of input files, which each contain a list of zip codes for that state. The input files in this case are located within a folder named "input" in the screen-scraper directory. The files are named in the format "zips_CA", for example, which would contain California's zip codes.
Attachment | Size |
---|---|
zips_AL.csv | 5.73 KB |
zips_AR.csv | 4.16 KB |
zips_AZ.csv | 3.03 KB |
zips_CA.csv | 20.7 KB |
zips_CO.csv | 4.53 KB |
When a Scraping Session is started it can be a good idea to feed certain pieces of information to the session before it begins resolving URLs. This simple version of the Initialize script is to demonstrate how you might start on a certain page. While basic, understanding when a script like this would be used is pivotal in making screen scraper work for you.
The above code is useful where "PAGE" is an input parameter in the first page you would like to scrape.
Occasionally a site will be structured so that instead of page numbers the site displays records 1-10 or 20-29. If this is the case your Initialize script could look something like this:
Once again "DISPLAY_RECORD_MIN" and "DISPLAY_RECORD_MAX" are input parameters on the first page you would like to scrape.
If you feel you understand this one, I'd encourage you to check out the other Initialize scripts in this code repository.
The following files contains zipcodes for the that state. The file "zips_US.CSV" contains all US zip codes within one file. If you wish to download all of the CSVs at once you may choose to download the file "zips_all_states.zip".
Note: If you've forgotten the state abbreviations please visit http://www.usps.com/ncsc/lookups/usps_abbreviations.html
Last updated 5/8/2008
Attachment | Size |
---|---|
zips_AL.csv | 5.73 KB |
zips_AR.csv | 4.16 KB |
zips_AZ.csv | 3.03 KB |
zips_CA.csv | 20.7 KB |
zips_CO.csv | 4.53 KB |
zips_CT.csv | 2.58 KB |
zips_DE.csv | 686 bytes |
zips_FL.csv | 10.1 KB |
zips_GA.csv | 5.92 KB |
zips_IA.csv | 6.25 KB |
zips_ID.csv | 1.94 KB |
zips_IL.csv | 9.31 KB |
zips_IN.csv | 5.79 KB |
zips_KY.csv | 6.87 KB |
zips_LA.csv | 4.21 KB |
zips_MA.csv | 4.17 KB |
zips_MD.csv | 4.23 KB |
zips_ME.csv | 2.98 KB |
zips_MI.csv | 6.84 KB |
zips_MN.csv | 6.05 KB |
zips_MO.csv | 6.98 KB |
zips_NC.csv | 7.43 KB |
zips_ND.csv | 2.41 KB |
zips_NE.csv | 3.65 KB |
zips_NH.csv | 1.65 KB |
zips_NJ.csv | 4.33 KB |
zips_NM.csv | 2.5 KB |
zips_NV.csv | 1.47 KB |
zips_NY.csv | 13.04 KB |
zips_OH.csv | 8.54 KB |
zips_OK.csv | 4.55 KB |
zips_OR.csv | 2.82 KB |
zips_PA.csv | 15.06 KB |
zips_RI.csv | 546 bytes |
zips_SC.csv | 3.68 KB |
zips_SD.csv | 2.36 KB |
zips_TN.csv | 5.43 KB |
zips_TX.csv | 18.09 KB |
zips_UT.csv | 2 KB |
zips_VA.csv | 8.51 KB |
zips_VT.csv | 1.8 KB |
zips_WA.csv | 4.21 KB |
zips_WI.csv | 5.31 KB |
zips_WV.csv | 5.89 KB |
zips_WY.csv | 1.14 KB |
zips_all_states.zip | 178.54 KB |
zips_US.csv | 295.08 KB |
The form class can be a life saver when it comes to dealing with sites that use forms for their inputs and have a lot of dynamic parameters
There are really only two cases in which using the form class is preferrable to doing the paramenters any other way. Those cases are:
In general though, it'll be easier for debugging if you can stick with the regular parameter tab