Manual Data Extraction

A sub-extractor pattern can only match one element but manual data extraction allows you to give the same additional context information as using a sub-extractor pattern but allows you the ability to extract multiple data records.

This example makes use of the extractData() method.

The code and examples below demonstrate how to first isolate and extract a portion of a page's total HTML, so that a second extractor pattern may then be applied to just the extracted portion. Doing so can limit the results to only those found on a specific part of the page. This can be useful when you have 100 apples that all look the same but you really only want five of them.

The following screen shots show an example of when the script above might be used. In this example, we are only interested in the active (shown with green dots) COMPANY APPOINTMENTS, and not the LICENSE AUTHORITIES (sample HTML available at the end).

When applied to the all of the HTML of the current scrapeable file, the following extractor pattern will retrieve ALL of the html that makes up the COMPANY APPOINTMENTS table above. But, remember, we only want the active appointments.

As indicated, call the following script "after each pattern match" (there will only be one match

import com.screenscraper.common.*;

//Create a local variable called appointments to store the dataset that is generated when you
//MANUALLY apply the "Appointments" extractor pattern to the already extracted data that
//resulted from the application of the COMPANY_APPOINTMENTS extractor pattern.
DataSet appointments = scrapeableFile.extractData(dataRecord.get("COMPANY_APPOINTMENTS"), "Appointments");
//                                                                  ^^token name^^      ^^extractor id^^

// Start the local variable allAppointments where we will one-by-one append the values of each
//matching appointment.  Separate them with the pipe character "|".
allAppointments = "";

// Take the appointments dataSet generated from above and loop through
//each of the successful matches that are stored as records.
for (i=0; i < appointments.getNumDataRecords(); i++)
{
     // Grab the current dataRecord from the looping dataSet
     appointmentRecord = appointments.getDataRecord(i);

      // Grab the results of the applied ~@APPOINTMENT@~ token
     // referencing it by name.
     // Note: it's possible to reference more than one token here
     appointment = appointmentRecord.get("APPOINTMENT");

     // Append the current appoinment to the growing list of matches
     allAppointments += appointment + " | ";
}

// When the loop is done, store the results in a session variable
session.setVariable("APPOINTMENTS", allAppointments);

// Write them out to log to see if they look right
session.log("The appointments are: " + allAppointments);

Results of applying the COMPANY_APPOINTMENTS above

</b></blockquote>
<div id="Level3" style="Display: Block; position: relative; text-align: center">
<table class="verysmalltext" width="90%" border="1" cellpadding="1" cellspacing="0" bordercolor="#BBBBBB">
<tr bgcolor="#CCCCCC">
<th class="bold">COMPANY</th>
<th class="bold">APPOINTMENT STATUS</th>
<th class="bold">ISSUE DATE</th>
<th class="bold">CANCEL DATE</th>
</tr>
<tr bgcolor="#CDDEFF">
<td class="small">21ST CENTURY INSURANCE COMPANY&nbsp;</td>
<td class="small" style="color: GREEN"><b>ACTIVE</b>&nbsp;</td>
<td class="small">05/05/2006&nbsp;</td>
<td class="small">&nbsp;</td>
</tr>
<tr bgcolor="#EFEFEF">
<td class="small">AIG CENTENNIAL INSURANCE COMPANY&nbsp;</td>
<td class="small" style="color: GREEN"><b>ACTIVE</b>&nbsp;</td>
<td class="small">01/30/2008&nbsp;</td>
<td class="small">&nbsp;</td>
</tr>
<tr bgcolor="#CDDEFF">
<td class="small">BALBOA INSURANCE COMPANY&nbsp;</td>
<td class="small" style="color: RED"><b>INACTIVE</b>&nbsp;</td>
<td class="small">05/15/2006&nbsp;</td>
<td class="small">04/23/2008&nbsp;</td>
</tr>
</table>

<blockquote><img name="Image4" class="mouseover" onmouseover="this.style.cursor=" src="/MEDIA/images/gifs/squareminus.gif" onclick="visAction('Level4')" />&nbsp;&nbsp;&nbsp;<b>

Use the extractor pattern below to match against the HTML above. It will return two results: 21ST CENTURY INSURANCE COMPANY, and AIG CENTENNIAL INSURANCE COMPANY, since those are the only two active company appointments. Note that the "Appointment" Extractor Pattern includes the word "GREEN", so that the "RED"(Inactive) company appointments are excluded.

Be sure to check the box that says "This extractor pattern will be invoked manually from a script". This will ensure that the extractor pattern will not run in the sequence with the other extractor patterns.

HTML from the first Web page screen shot that contained the License Authorities and Company Appointment tables from the example above

LICENSE AUTHORITIES</b></blockquote>

<div id="Level2" style="Display: Block; position: relative; text-align: center">
<table class="verysmalltext" width="90%" border="1" cellpadding="1" cellspacing="0" bordercolor="#BBBBBB">
<tr bgcolor="#CCCCCC">
<th class="bold">ORIGINAL ISSUE DATE</th>
<th class="bold">DESCRIPTION</th>
<th class="bold">STATUS</th>
<th class="bold">EXPIRATION DATE</th>
<th class="bold">EXPIRATION REASON</th>
</tr>
<tr bgcolor="#CDDEFF">
<td>01/31/2006&nbsp;</td>
<td>Agent - Property&nbsp;</td>
<td style="color: GREEN"><b>ACTIVE</b>&nbsp;</td>
<td>&nbsp;</td>
<td style='cursor:hand' onmouseover="this.style.cursor='pointer'" title='no information'><b style="color: #CA6C04">&nbsp;</b></td>
</tr>
<tr bgcolor="#EFEFEF">
<td>01/31/2006&nbsp;</td>
<td>Agent - Casualty&nbsp;</td>
<td style="color: GREEN"><b>ACTIVE</b>&nbsp;</td>
<td>&nbsp;</td>
<td style='cursor:hand' onmouseover="this.style.cursor='pointer'" title='no information'><b style="color: #CA6C04">&nbsp;</b></td>
</tr>
</table>
</div>

<blockquote><img name="Image3" class="mouseover" onmouseover="this.style.cursor=" src="/MEDIA/images/gifs/squareminus.gif" onclick="visAction('Level3')" />&nbsp;&nbsp;&nbsp;<b>COMPANY APPOINTMENTS</b></blockquote>

<div id="Level3" style="Display: Block; position: relative; text-align: center">
<table class="verysmalltext" width="90%" border="1" cellpadding="1" cellspacing="0" bordercolor="#BBBBBB">
<tr bgcolor="#CCCCCC">
<th class="bold">COMPANY</th>
<th class="bold">APPOINTMENT STATUS</th>
<th class="bold">ISSUE DATE</th>
<th class="bold">CANCEL DATE</th>
</tr>
<tr bgcolor="#CDDEFF">
<td class="small">21ST CENTURY INSURANCE COMPANY&nbsp;</td>
<td class="small" style="color: GREEN"><b>ACTIVE</b>&nbsp;</td>
<td class="small">05/05/2006&nbsp;</td>
<td class="small">&nbsp;</td>
</tr>
<tr bgcolor="#EFEFEF">
<td class="small">AIG CENTENNIAL INSURANCE COMPANY&nbsp;</td>
<td class="small" style="color: GREEN"><b>ACTIVE</b>&nbsp;</td>
<td class="small">01/30/2008&nbsp;</td>
<td class="small">&nbsp;</td>
</tr>
<tr bgcolor="#CDDEFF">
<td class="small">BALBOA INSURANCE COMPANY&nbsp;</td>
<td class="small" style="color: RED"><b>INACTIVE</b>&nbsp;</td>
<td class="small">05/15/2006&nbsp;</td>
<td class="small">04/23/2008&nbsp;</td>
</tr>
</table>
</div>

<blockquote><img name="Image4" class="mouseover" onmouseover="this.style.cursor=" src="/MEDIA/images/gifs/squareminus.gif" onclick="visAction('Level4')" />&nbsp;&nbsp;&nbsp;<b>CONTINUING EDUCATION