Writing extracted data to a database
Overview
Oftentimes once you've extracted data from a page you'll want to write it to a database. Screen-scraper contains a special SqlDataManager class that makes this easy.
This script uses objects and methods that are only available in the professional and enterprise editions of screen-scraper.
To use the SqlDataManager class you'll generally follow these steps:
- To start, download the appropriate JDBC Driver connector Jar file for your particular database and place it in the lib/ext folder where screen-scraper is installed.
- Create an instance of SqlDataManager in a script.
- Build the database schema and any foreign key relations.
- Store the SqlDataManager in a session variable.
- Extract data.
- In a script, get a reference to the SqlDataManager object stored in step 3, then call addData to build the rows in your database tables.
- Once all data for a row has been added to the SqlDataManager object, call the commit method.
- After committing all data related to a record - which can include multiple rows across multiple tables, call the flush method to write the record to the database.
- Repeat steps 4 - 7 as many times as you'd like.
- In a script, get a reference to the SqlDataManager object, then call the close method on it.
The trickiest part is understanding when to call the commit method when writing to related tables.
Examples
If you're scripting in Interpreted Java and using a MySQL database, the script for steps 1-3 might look something like this:
import com.screenscraper.datamanager.sql.*;
import org.apache.commons.dbcp.BasicDataSource;
String hostpath = "localhost"; //Location of the database
String database = "database_name"; //The name of the database
String username = "your_username"; //Put your database username here
String password = "your_password"; //Put the password here
String port = "3306";
String dbparams = "autoReconnect=true&useCompression=true";
SqlDataManager dm;
try
{
// Connect to database using a BasicDataSource
BasicDataSource ds = new BasicDataSource();
ds.setDriverClassName( "com.mysql.jdbc.Driver" );
ds.setUsername( username );
ds.setPassword( password );
ds.setUrl( "jdbc:mysql://" + hostpath + ":" + port + "/" + database + "?" + dbparams );
ds.setMaxActive( 100 );
// Create Data Manager
dm = new SqlDataManager( ds, session );
dm.setLoggingLevel( org.apache.log4j.Level.DEBUG );
// Call this to have the data manager read information about the database, such as what tables exist,
// what fields they have, and how they relate to other tables in the database.
dm.buildSchemas();
// Setup the foreign key relationships, if needed
// If the database had this relationship built-in (InnoDB engine only), we wouldn't have to add this here.
// buildSchemas() would have added it when it read the database structure.
dm.addForeignKey( "child_table", "child_field", "parent_table", "parent_field" );
// Set the data manager to overwrite values in the table but not write NULL over values that are already there.
dm.setGlobalUpdateEnabled( true );
dm.setGlobalMergeEnabled( true );
// Save the SqlDataManager object in a session variable
session.setVariable( "_DBMANAGER", dm );
}
catch (Exception e)
{
session.logError( "Database initialization error" );
session.stopScraping();
}
Note that if you are using a database other the MySQL, the only change to this script will be the String passed to the setUrl method of the BasicDataSource.
In subsequent scripts, you can get a reference to that same SqlDataManager object like this:
You could then add data to the data manager. The following examples demonstrate various ways to go about that. Each of the scripts assume you already created an SqlDataManager object in a previous script and saved it to the session variable _DBMANAGER.
Saving to a single table using a data record
import com.screenscraper.datamanager.sql.SqlDataManager;
// Get the data manager from the session variable it is stored in
SqlDataManager dm = session.getVariable( "_DBMANAGER" );
// Add the current data record to the table.
// All values whose key in the data record match (case in-sensitive) a column in the table will be saved.
dm.addData( "people", dataRecord );
// Once all data has been added to the table, it should be committed
dm.commit( "people" );
// Flush the data so it is written to the database
dm.flush();
If the data record saved above had key-value pairs:
NAME = John Doe
AGE = 37
WEIGHT = 160
and the table 'people' had columns 'name', 'age', and 'gender', the script above would produce the following row in the people table.
| name | age | gender |
+----------+-----+--------+
| John Doe | 37 | NULL |
+----------+-----+--------+
Saving to a single table manually
import com.screenscraper.datamanager.sql.SqlDataManager;
// Get the data manager from the session variable it is stored in
SqlDataManager dm = session.getVariable( "_DBMANAGER" );
// Add data to the table.
dm.addData( "people", "name", "John Doe" );
dm.addData( "people", "age", "37" );
dm.addData( "people", "gender", session.getVariable("GENDER") );
// Once all data has been added to the table, it should be committed
dm.commit( "people" );
// Add another row of data to the table.
// As long as the previous data has been committed, this data will be added to a new row instead of overwriting previous data
dm.addData( "people", "name", "Sally Doe" );
dm.addData( "people", "gender", "female" );
// Flush the data so it is written to the database
dm.flush();
If the session variable GENDER had the value male and the table structure was the same as in the example above, this script would produce the following rows in the people table.
| name | age | gender |
+-----------+------+--------+
| John Doe | 37 | male |
+-----------+------+--------+
| Sally Doe | NULL | female |
+-----------+------+--------+
Note that you can mix the two methods shown above. Data can be added from multiple data records and/or manually for the same row.
Saving to multiple tables that are related.
This example assumes that you have a table in the database named people with fields 'id' (primary key/autoincrement), 'name', and 'address', and another table named phones with fields 'person_id', 'phone_number'.
Also, there is a foreign key relation between person_id in phones and id in people. This can be setup either in the database or when setting up the datamanger and calling the addForeignKey method.
In order to make it easier to see inserted values, all calls to addData in this example will enter data manually. In many cases, however, adding a data record is much easier.
Also, remember that data does not have to be added and committed all at once. Usually tables with a parent/child relation will have one script called after each pattern match of an extractor pattern that adds and commits a row of child data, and then a separate script called elsewhere to add and commit the parent data.
import com.screenscraper.datamanger.sql.SqlDataManager;
// Get the data manager from the session variable it is stored in
SqlDataManager dm = session.getVariable( "_DBMANAGER" );
// Add multiple phone numbers that will relate to John Doe
// Maybe he has a cell phone, home phone, and work phone
// After adding each row of data (simply a phone number in this case),
// commit the data so we can begin adding a new row.
dm.addData( "phones", "phone_number", "(123) 456-7890" );
dm.commit( "phones" );
dm.addData( "phones", "phone_number", "(800) 555-7777" );
dm.commit( "phones" );
dm.addData( "phones", "phone_number", "(333) 987-6543" );
dm.commit( "phones" );
// Now add the parent table's data and commit it
dm.addData( "people", "name", "John Doe" );
dm.addData( "people", "address", "123 Someplace Drv, Cityville, WY 12345" );
dm.commit( "people" );
// Add multiple phone numbers that will relate to Sally Doe
dm.addData( "phones", "phone_number", "(321) 654-0987" );
dm.commit( "phones" );
dm.addData( "phones", "phone_number", "(333) 987-6543" );
dm.commit( "phones" );
// Now add the parent table's data and commit it
dm.addData( "people", "name", "Sally Doe" );
dm.addData( "people", "address", "123 Someplace Drv, Cityville, WY 12345" );
dm.commit( "people" );
// Flush the data so it is written to the database
dm.flush();
Note the order in which tables were committed. All data in child tables must be committed before the data in the parent table.
This script would produce the following rows in the database:
| people |
+----+-----------+----------------------------------------+
| id | name | address |
+----+-----------+----------------------------------------+
| 1 | John Doe | 123 Someplace Drv, Cityville, WY 12345 |
+----+-----------+----------------------------------------+
| 2 | Sally Doe | 123 Someplace Drv, Cityville, WY 12345 |
+----+-----------+----------------------------------------+
+----------------------------+
| phones |
+-----------+----------------+
| person_id | phone_number |
+-----------+----------------+
| 1 | (123) 456-7890 |
+-----------+----------------+
| 1 | (800) 555-7777 |
+-----------+----------------+
| 1 | (333) 987-6543 |
+-----------+----------------+
| 2 | (321) 654-0987 |
+-----------+----------------+
| 2 | (333) 987-6543 |
+-----------+----------------+
The SqlDataManager takes care of filling in the data for the related fields. We never had to add the data for the person_id column in the phones table. Since id in people is an autoincrement field, we didn't have to add data for that field either.
Close the data manager
Once all data has been written to the database, close the data manager like this:
dm = session.getVariable( "_DBMANAGER" );
// Flushing the data here is optional, but if any data hasn't been written yet
// it will not be written when close() is called, and will be lost.
dm.flush();
// Close the datamanager
dm.close();
Automatically link many to many relations (Advanced)
The SqlDataManager can be set to automatically link data connected in a many-to-many relation. To enable this feature, use the following code:
When this setting is enabled, the data manager will attempt to relate data across multiple tables when possible. For example, if there is a people table, an address table, and a person_has_address table used to relate the other two tables, you would only need to insert data into the people and addresses tables. The data manager would then link the person_has_address table in since it has foreign keys relating it to both people and addresses. See the example below.
/*
Perform the setup of the SqlDataManager, as shown previously, and name the variable dm.
Also use a duplicate filter (see example below) to check for duplicate addresses
*/
// The setAutoManyToMany method must be called before any data is added to the data manager for the first time.
dm.setAutoManyToMany( true );
// Everything beyond this point might appear in a script other than the initialization script
dm.addData( "people", "name", "John" );
dm.addData( "addresses", "address", "123 Street" );
dm.commit( "addresses" );
dm.addData( "addresses", "address", "456 Drive" );
dm.commit( "addresses" );
dm.commit( "people" );
dm.addData( "people", "name", "Sally" );
dm.addData( "addresses", "address", "123 Street" );
dm.commit( "addresses" );
dm.commit( "people" );
This would produce the following result:
| people |
+-----------+-------+
| person_id | name |
+-----------+-------+
| 1 | John |
+-----------+-------+
| 2 | Sally |
+-----------+-------+
+-------------------------+
| addresses |
+------------+------------+
| address_id | address |
+------------+------------+
| 1 | 123 Street |
+------------+------------+
| 2 | 456 Drive |
+------------+------------+
+------------------------+
| person_has_address |
+-----------+------------+
| person_id | address_id |
+-----------+------------+
| 1 | 1 |
+-----------+------------+
| 1 | 2 |
+-----------+------------+
| 2 | 1 |
+-----------+------------+
Filtering Duplicate Entries (Advanced)
When extracting data that will contain many duplicate entries, it can be useful to filter values so that duplicate entries are not written to the database multiple times. The data manager can use a duplicate filter to check data being added to the database against data that is added, and either update or ignore duplicates. This is accomplished with an SqlDuplicateFilter object. To create a duplicate filter, call the SqlDuplicateFilter.register method, set the parent table it checks for duplicates on, and then add the constraints that indicate a duplicate. See the code below for an example of how to filter duplicates on a person table.
/*
Perform the setup of the SqlDataManager, as shown previously, and name the variable dm.
*/
//register an SqlDuplicateFilter with the DataManager for the social security number
SqlDuplicateFilter ssnDuplicate = SqlDuplicateFilter.register( "person", dm );
ssnDuplicate.addConstraint( "person", "ssn" );
//register an SqlDuplicateFilter with the DataManager for the drivers license number
SqlDuplicateFilter licenseDuplicate = SqlDuplicateFilter.register( "person", dm );
licenseDuplicate.addConstraint( "person", "drivers_license" );
//register an SqlDuplicateFilter with the DataManager for the name/phone number
//where the person table has a child table named phone.
SqlDuplicateFilter namePhoneDuplicate = SqlDuplicateFilter.register( "person", dm );
namePhoneDuplicate.addConstraint( "person", "first_name" );
namePhoneDuplicate.addConstraint( "person", "last_name" );
namePhoneDuplicate.addConstraint( "phone", "phone_number" );
Duplicate filters are checked in the order they are added, so consider perfomance when creating duplicate filters. If, for instance, most duplicates will match on the social security number, create that filter before the others. Also make sure to add indexes into your database on those columns that you are selecting by or else performance will rapidly degrade as your database gets large.
Duplicates will be filtered by any one of the filters created. If multiple fields must all match for an entry to be a duplicate, create a single filter and add each of those fields as constraints, as shown in the third filter created above. In other words, constraints added to a single filter will be ANDed together, while seperate filters will be ORed.
- Printer-friendly version
- Login or register to post comments