How do I insert the data screen-scraper extracts into a database?
There are several ways this can be done:
- Create a scrapeable file that POST's extracted data to a local web-enabled script (e.g., one written in PHP or ASP.NET) that accepts the data and inserts it into a database. We provide an example of this in our Fifth Tutorial.
- Write the data out to a delimited file, then have a separate program read the data in from the file and insert it into the database. For example, this could be done with the file generated in our second tutorial. You might write a PHP script or Visual Basic application that reads in the file and inserts it into a database. This technique is easy to implement, and also allows you to alter or clean up the data in your own code before inserting it into your database.
- Write the extracted data to a file as SQL statements. Most database have some kind of import feature that allows you to specify a file containing SQL statements. The database reads in the file and executes each of the SQL statements. One of the primary advantages to this approach is that it's simple to implement. It also doesn't require writing any further code to get the data into the database.
- Insert the data directly into a database via screen-scraper scripts. This would be done either via JDBC (for scripts written in Interpreted Java, Python, or JavaScript) or ADO (for scripts written in VBScript or JScript). The advantage to this approach is that it's relatively simple to implement and debug, and doesn't require going through the intermediate step of writing the data to a file.
- Pass the extracted data to compiled code and have it insert it into the database. For example, you might create Java classes that can insert the data into a database. You would then jar up these classes, place them into screen-scraper's "lib\ext" folder, and screen-scraper would add them to its classpath. Once that's done you can then import your classes into screen-scraper scripts and make use of the objects by passing them the extracted data. This could also be done with COM DLL's registered on your system. Of all the approaches suggested here this is probably the fastest and most robust, but can be a bit trickier to debug.
- Invoke screen-scraper from an external application, retrieve the extracted data from screen-scraper, then have the external application handle the database interaction. For example, you might create a PHP script that invokes screen-scraper, tells it to extract product information (as in our third tutorial), requests the extracted product information from screen-scraper, then inserts it into a database (that is, all of the SQL statements and such would be in the PHP code). If you're using the Enterprise Edition, the best way to do this is to handle the data in real-time (i.e., as it is being scraped). Documentation on doing that can be found under the "Handling Scraped Data in Real Time" section on this page. In the Professional Edition the data will need to be stored up in a session variable as a data set, then requested at the end from the calling application. This technique can work great for smaller data sets, but it should be avoided if the data sets will be large. screen-scraper will need to store the data in memory (saved in session variables) as it's being extracted so that it can later pass it along to the external application. If a large amount of data is extracted and stored in memory it could cause screen-scraper to run out of memory.
As a side note, it is by design that screen-scraper doesn't insert information automatically into a database for you. The approach we've taken to the design of screen-scraper is to ensure that it does one thing very well: extract information from web sites. Generally related to that process, however, are subsequent steps that involve manipulating and cleaning up the information, as well as storing it in some persistent mechanism (such as a database or text file). All of those things can be done by screen-scraper, but we've designed screen-scraper primarily to handle data extraction.