Square Footage Catcher

This script was designed because while working for a client requesting building information, we needed to grab data about available square footage. Some targets sites had such sporatically formatted data that it was sometimes impossible to retrieve without a gauntlet of extractor patterns to catch every possible case of formatting. At times, the input was probably just a text box, so the user making the listing could have formatted the information however s/he wished, thus making it impossible to actually be able to guarantee that the pattern would match future listings.

So, although this script is huge, don't let it scare you. The point is that you save to a session variable (or to an in-scope dataRecord) the general region of a page. This region should predictably contain the square footage information, regardless of how its formatted. There are many optional variables that you may set to tweak the behavior of this script. Read about them in the header.

The idea here is to be able to pass a block of text/html from a page, and for this script to make heads or tails of it, and to save two variables: LISTING_MAX_SF and LISTING_MIN_SF.

(Sorry for the ugly formatting. The file is attached at the bottom of this post in a ".sss" format which you can import to screen-scraper, preserving the formatting.)

If you encounter any errors or problems, post comments here or on the forum for help. There could very well be cases that have gone untested in this script. We're looking to make it as robust as possible.

/*//// Notes and Information //////////////////////////////////////////////
//////////////////////////////////////////////////////////////////////////
Retrieves text in the sessionVariable/dataRecord called "LISTING_MAX_SF".  This will be processed and finally altered by the end of the script to
reflect the parsed data. "LISTING_MIN_SF" will also be set.  By default, the script will return the data to the source type from which it found
"LISTING_MAX_SF".  For example, if this script finds LISTING_MAX_SF in a dataRecord, it will overwrite the value in that dataRecord, and will
create a new entry in the dataRecord called "LISTING_MIN_SF".
Source priority: sessionVariable, dataRecord  (again, by default, values will be returned automatically to the location from which the data was found.)

This script depends on:
* dataRecord / session variable "LISTING_MAX_SF" -- Contains a String of an entire body of text to parse.  This variable is overwritten at the end of
each call to this script.
* session variable "SF_SPLIT_DELIMITER" -- see below.

This script can optionally accept values from:
* SF_IS_ACRES (anything) -- If this variable is set to anything other than null, the script will assume that you are working in acres,
and that you will need to convert your final numbers into SquareFootage for the BuildingSearch database.  SF:Acres ratio is 43,560:1
* SF_RANGE_MARKER (String) -- A string of characters that will inform the script that a range is being encountered.  This token
may contain a regular expression, as it is simply put into a java "replaceAll" call.  Thus "(abcd|78|a|\\-)" would make the script
interpret all four terms as ways to notate a range (ie, "abcd", "78", "a", and "-" would all make the script try to find the
proposed range.  The default rangeMarker "-" will be used if this variable is left undefined.
* SF_FORCE_NO_RANGE (anything) -- If this variable is set to anything other than null, range handeling will be disabled.  This
may be useful if the default rangeMarker "-" is not desired at all.
* SF_SPLIT_DELIMITER (String) -- Same as "SF_RANGE_MAKER", except that this variable will actually be the token that will
divide the passed text in "LISTING_MAX_SF" to be split into an array.  If this is left undefined, the splitting feature will be
totally disabled, and the script will parse the body of text as a single line.  If SF_UNIT is left undefined in addition, then
the split delimiter will be forced to " " (single space), as each number in the text will need to be parsed.  This is not a regular
expression.
* SF_UNIT (String) -- If there are extraneous numbers in the text, such that they are not followed by some unit that you would
like to limit results to, you may specify a regular expression that will predictably postfix the numbers that ARE in fact relevant.
It will be used for regex "lookahead".  You must NOT include the digits that you are interested in matching.
Be sure to include potential whitespace between the number and the unit you would like to watch for.
Ex: a text containing " Suite 435: 800 SF" may find that setting "SF_UNIT" to "\\sSF" will be useful, as the script will
now ignore any numbers in the text that are not postfixed with the String found in "SF_UNIT".  If SF_UNIT and SF_SPLIT_DELIMITER
are both left blank, SF_SPLIT_DELIMITER will be forced to " ".
* SF_NON_UNIT (String) -- Much like "SF_UNIT", except that this token will instead cause the script to ignore any numbers
postfixed by the String found in "SF_NON_UNIT".  You must match the digits involved with the postfix, so include the "\\d" (or similar)
in the expression for the the script to properly dispose of them.  This is simply done via a String.replaceAll(nonUnit, "") call.
Ex: a text containing "Parcel 4A - 250 acres" can be usefully parsed if "SF_NON_UNIT" is set to "\\d+[A-Z]" or
"[pP]arcel\\s\\d+\\s[A-Z]".  The script will ignore matches found by the regular expressing found in this variable.
* SF_LITTER (String) -- A String that will define individual characters that are acceptably littering the numbers you would like to
preserve.  For instance, numbers that contain "," or "." would require this variable to be set to ".," to tolerate numbers
that have commas and periods littered throughout the number.  If left undefined, the script will automatically tolerate "," as a
valid littering character.  Honestly, this doesn't really need to be manually defined very often.  Only single characters are allowed.
If you write ".,moo" in this variable, the script will tolerate "." "," "m" "o" and "o", all separately.  The effect would be achieved,
however the regex engine will not be matching "moo" as a single token.
* SF_DATA_PUTBACK (String) -- Must contain either "datarecord" or "sessionvariable".  The script will auto-lowercase this String to
check it.  Depending on the value thus contained, the script will put its final answers into the corresponding object.  If anything
else other than the above specified values, the script will try to return the data to two session variables named as the contents of
this SF_DATA_PUTBACK variable, with a "_MIN_SF" and "_MAX_SF" postfix.  For example, "TEMP_SF" will produce two variables called
"TEMP_SF_MIN_SF" and "TEMP_SF_MAX_SF".
* SF_DATA_GET (String) -- Must contain either "datarecord", "sessionvariable", an Integer (ie, 0, 24, etc, String or Integer) for
where to look to get the data we want to process.  If an Integer, the script will look in the current dataSet at the index thus
supplied.  If anything else other than the above specified values, the script will try to retrieve the data from a session variable
named as the contents of this SF_DATA_GET variable.
* SF_CALL_SCRIPT_A (String) -- The name of a script that you would like to execute before the script attempts to replace or split
anything in the variable "LISTING_MAX_SF".  When this optional script is called, the variable itself has not yet been retrieved from
its source, so you may access and alter the "LISTING_MAX_SF" variable from the same source that you expect the variable to be retrieved
later in this script.  Be sure to save any changes to the correct location (dataRecord or sessionVariable, etc)
* SF_CALL_SCRIPT_B (String) -- The name of a script that you would like to execute after the script has done basic splitting and
replaceAll calls.  The data will be available in the "LISTING_MAX_SF" variable, and will now be an array, even if splitting did
not occur (ie, 'session.getVariable("LISTING_MAX_SF").length >= 1' at all times).  You must place the postprocessed data back into
the sessionVariable "LISTING_MAX_SF" in order for the changes to be persistent.

*/


import java.util.regex.*;
import java.util.Hashtable;
import java.element.Util;
int putbackToDataSet = -1; // a variable used only when putting back to the dataSet


String body = null;

//\_/\_/\_/\// ERROR CHECKING FROM PUTBACK TYPE GIVEN IN "SF_DATA_PUTBACK"
// There's no need to error check if "SF_DATA_PUTBACK" wants to putback to a session variable

session.log("//\\_/\\_/\\_/\\// ============================");

String dataPutback = session.getVariable("SF_DATA_PUTBACK");
if (dataPutback != null) // if SF_DATA_PUTBACK was defined by the user
{
 dataPutback = dataPutback.toLowerCase().replaceAll("[^a-z_]", "");
}


boolean noRange = false;
temp = session.getVariable("SF_FORCE_NO_RANGE");
if (temp != null)
 noRange = true;


//\_/\_/\_/\// Optional script call to preprocess the data in LISTING_MAX_SF
if (session.getVariable("SF_CALL_SCRIPT_A") != null)
{
 session.log("//\\_/\\_/\\_/\\// Executing variably called script: \"" + session.getVariable("SF_CALL_SCRIPT_A") + "\".");
 session.executeScript(session.getVariable("SF_CALL_SCRIPT_A"));
 session.log("//\\_/\\_/\\_/\\// Finished executing variably called script: \"" + session.getVariable("SF_CALL_SCRIPT_A") + "\".");
}


//\_/\_/\_/\// ERROR CHECKING FROM GET TYPE GIVEN IN "SF_DATA_GET"

String dataGet = session.getVariable("SF_DATA_GET"); // the source instructions, not the actual string to parse
if (dataGet != null) // if SF_DATA_GET was defined by the user
{
 dataGet = dataGet.toLowerCase().replaceAll("[^a-z0-9_]", ""); // normalize the String

 if (dataGet.equals("datarecord")) // if SF_DATA_GET wants to get from the dataRecord
 {
 body = dataRecord.get("LISTING_MAX_SF");
 if (dataPutback == null) // if the putback variable was left undefined, then set it here
 dataPutback = "dataRecord";
 }
 else if (dataGet.equals("sessionvariable"))
 {
 body = session.getVariable("LISTING_MAX_SF");
 if (dataPutback == null) // if the putback variable was left undefined, then set it here
 dataPutback = "sessionvariable";
 }
 else if (!dataGet.replaceAll("\\D", "").equals("")) // if SF_DATA_GET contained some digits
 {
 getFromDataSet = Integer.parseInt(dataGet.replaceAll("\\D", ""));
 int numDataRecords= -1;
 if (putbackToDataSet >= numDataRecords) // if the user set SF_DATA_GET to putback to a dataRecord that is too large for the in-scope dataSet
 {
 session.log("//\\_/\\_/\\_/\\// You've set SF_DATA_GET to retrieve its data from a dataRecord that is indexed too high (" + getFromDataSet + " when only " + numDataRecords + " exist).  SF_DATA_GET begins its index at 0 and should be strictly less than the total number of dataRecords in the dataSet.");
 session.log("//\\_/\\_/\\_/\\// ============================");
 return;
 }
 }
 else // else, we'll assume that the user wanted to pull from a session variable whose name is given by the string
 {
 body = session.getVariable(dataGet);
 if (dataPutback == null)
 dataPutback = dataGet; // if the dataPutback variable was left undefined, then track the "get" session variable name
 }
}
else // if the user did not give a value for "SF_DATAGET"
{
 session.log("//\\_/\\_/\\_/\\// Defaulting to sessionVariable \"LISTING_MAX_SF\" for input source.  (See header of this script for notes on sessionVariable \"SF_DATA_GET\" if you wish to force the source.)");
 if (session.getVariable("LISTING_MAX_SF") == null) // if no session variable is available...
 {
 session.log("//\\_/\\_/\\_/\\// sessionVariable \"LISTING_MAX_SF\" is null.  Checking the dataRecord... (This will cause a script problem at line 130 if a dataRecord is not in scope.)");
 body = dataRecord.get("LISTING_MAX_SF"); // ...then get it from the dataRecord (hopefully)
 dataGet = "datarecord";
 if (dataPutback == null) // ...and set the return type to also be dataRecord if it was also not specified
 dataPutback = "datarecord";
 }
 else // if there is a valid session variable to read from...
 {
 body = session.getVariable("LISTING_MAX_SF"); // ...then get it from the session variable
 dataGet = "sessionvaraible";
 if (dataPutback == null) // and set the return type to also be sessionVariable if it was also not specified
 dataPutback = "sessionvariable";
 }

 }


//\_/\_/\_/\// Make sure that have some text to parse, now that we have read from the source wanted in the user specification
if (body == null)
{ session.log("//\\_/\\_/\\_/\\// Error: No text was found in the specified parsing source. \"" + dataGet.toUpperCase() + "\".  SF_DATA_GET might be set wrong, or not at all.");
 session.log("//\\_/\\_/\\_/\\// ============================");
 return;
}

String message = "";
if (session.getVariable("SF_DATA_PUTBACK") == null)
 message = ", the source from which it was taken";
session.log("//\\_/\\_/\\_/\\// This execution of the script is set to return its parsed data into the " + dataPutback.toUpperCase() + message + ".");


//\_/\_/\_/\// Check in with the log
session.log("//\\_/\\_/\\_/\\// The text retrieved was \"" + body + "\".");



String[] bodySplit = null; // the array we'll split stuff into


//\_/\_/\_/\// prep for splitting
String splitDelimiter = session.getVariable("SF_SPLIT_DELIMITER");
if (splitDelimiter == null || splitDelimiter.equals(""))
 splitDelimiter = "";


//\_/\_/\_/\// Prepare for possible SF_UNIT and SF_NON_UNIT usage
String unit = session.getVariable("SF_UNIT"); // things to watch for
String nonUnit = session.getVariable("SF_NON_UNIT"); // things to exlude
if (unit == null)
{
 unit = ""; // if there's no unit supplied, we'll need to parse every number, so split on spaces
 session.log("//\\_/\\_/\\_/\\// Warning: There was no unit supplied in \"SF_UNIT\", which will require that every number in the text is broken up for parsing.");
 if (!splitDelimiter.equals(""))
 session.log("//\\_/\\_/\\_/\\// Warning: Overriding the current split delimiter (\"" + splitDelimiter + "\") with a single space \" \"");
 else
 session.log("//\\_/\\_/\\_/\\// The split delimiter in \"SF_SPLIT_DELIMITER\" was blank, however, by circumstance, it must be set to \" \".  The change will be made automatically, for this execution of the script only.");
 splitDelimiter = " ";
}
if (nonUnit == null)
 nonUnit = "";


//\_/\_/\_/\// Now we finally split, based on the splitting token possibly specified in "SF_RANGE_MARKER" and "SF_SPLIT_DELIMITER"
String rangeMarker = session.getVariable("SF_RANGE_MARKER");
if (!noRange) // If 'force range handeling' is off
{
 if (rangeMarker == null) // If the user left the rangeMarker undefined...
 rangeMarker = "-"; // ...then set the default

 //\_/\_/\_/\// If we're going to split up the numbers to be detected as a range, we need to append the specified unit, if applicable.
 // Replaces all range markers with the unit and splitDelimiter, so that it all gets split up once the call to body.split actually happens.
 // This also exludes cases where there is a rangeMarker, yet no unit to propery accompany it, as in "666-55SF" when rangeMarker = "\\s+SF".
 if (!unit.equals(""))
 {
 body = body.replaceAll("(?<=\\d)\\s*" + rangeMarker + "\\s*(?=\\d+" + unit + ")", unit.replaceAll("\\\\s[+*?]", " "));
 session.log("//\\_/\\_/\\_/\\// After splitting up the range and appending the unit (regex definition: \"" + unit + "\"): " + body);
 }
 else
 {
 body = body.replaceAll(rangeMarker, splitDelimiter);
 session.log("//\\_/\\_/\\_/\\// There was no unit supplied in \"SF_UNIT\", so splitting will occur over spaces and range markers.  After splitting up ranges: " + body);
 }
}


if (!unit.equals(""))
{
 session.log("//\\_/\\_/\\_/\\// Set to find ranges around \"" + rangeMarker + "\".");
 body = body.replaceAll(unit, unit.replaceAll("\\\\[sb][+*?]", " ") + splitDelimiter);
}


if (splitDelimiter.equals("") && !noRange) // happens with there IS a unit, but no split delimiter was supplied
{
 bodySplit = body.split(rangeMarker);
}
else
{
 session.log("//\\_/\\_/\\_/\\// Set to split on \"" + splitDelimiter + "\".");
 bodySplit = body.split(splitDelimiter);
}


//\_/\_/\_/\// Place the new array back into the session variable (we're ignoring dataPutback here.. it doesn't matter for now), for optionally postprocessing the array
session.setVariable("LISTING_MAX_SF", bodySplit);

//\_/\_/\_/\// Optional script call to postprocess the data in LISTING_MAX_SF
if (session.getVariable("SF_CALL_SCRIPT_B") != null)
{
 session.log("//\\_/\\_/\\_/\\// Executing variably called script: \"" + session.getVariable("SF_CALL_SCRIPT_B") + "\".");
 session.executeScript(session.getVariable("SF_CALL_SCRIPT_B"));
 session.log("//\\_/\\_/\\_/\\// Finished executing variably called script: \"" + session.getVariable("SF_CALL_SCRIPT_B") + "\".");
 bodySplit = session.getVariable("LISTING_MAX_SF"); // this actually creates a reference to the changed array.  This way, changes
 // in the array.length are permitted, yet we can still use the same alias "bodySplit"
 // later in the code.
}
// NOTE: we can't set that temp sessionVariable "LISTING_MAX_SF" storage to null yet, since it might be the data referred to by bodySplit.
// We'll clear it just before writing out to dataRecord, dataSet, or some other specified session variable other than "LISTING_MAX_SF"


//\_/\_/\_/\// Prepare for litter characters
String basicLitter = session.getVariable("SF_LITTER");
String litter = "";
if (basicLitter == null)
 basicLitter = ",";
for (int j = 0; j < basicLitter.length(); j++)
 litter += "|" + basicLitter.charAt(j);
basicLitter = null;<


//\_/\_/\_/\// Function declaration for use in the parsing loop section

// Strips the line down to only digits, and updates the min/max SF values
void finishAndUpdateMinMax(Hashtable SF, String line)
{
 line = line.replaceAll("\\D", ""); // Destroys all remaining non-digits, leaving only the number(s) we're interested in
 session.log("//\\_/\\_/\\_/\\// After eliminating all non-digits: " + line);
 if (line.equals("") || Pattern.matches("\\s*", line))
 {
 session.log("//\\_/\\_/\\_/\\// No digits were found on this line.");
 return;
 }
 float sfToken = Float.parseFloat(line);
 if (SF.get("min") == 0 || sfToken < SF.get("min"))
 SF.put("min", sfToken);
 if (SF.get("max") == 0 || sfToken > SF.get("max"))
 SF.put("max", sfToken);
}


//\_/\_/\_/\// Begin the actual parsing

// to hold our tracked Min and Max. I used a Hashtable so that I can pass it to functions and be able to alter it.  (ie, it'll be passed by
// reference, as opposed to primitives, which are always passed by value.
Hashtable SF = new Hashtable();
 SF.put("min", new Float(0)); // to track the local SF min
SF.put("max", new Float(0)); // to track the local SF max

for (int i = 0; i < bodySplit.length; i++)
{
 String line = bodySplit[i];
 if (!line.equals(""))
 {
 session.log("//\\_/\\_/\\_/\\// ----------------------------------------");
 session.log("//\\_/\\_/\\_/\\// Processing: " + line);

 if (!nonUnit.equals("")) // if the user specified a nonUnit that we should ignore, then zap it
 {
 line = line.replaceAll(nonUnit, "");
 session.log("//\\_/\\_/\\_/\\// After ignoring non-units: " + line);
 }

 if (!unit.equals("")) // if we were given a unit to watch for, and if we found it in this line
 {
 Pattern p = Pattern.compile(unit); // Get a pattern going
 Matcher m = p.matcher(line); // Link it with the line
 if (m.find()) // Run it against the line
 {
 // This is magic.  :)   We match [digits or litters] that are NOT followed by our desired [digits or litters and then the unit]
 // By doing this, we destroy all numbers that are not important to us, leaving only good numbers and other text
 line = line.replaceAll("(\\d" + litter + ")(?!(\\d" + litter + ")*" + unit + ")", "");
 session.log("//\\_/\\_/\\_/\\// After allowing only numbers with specified unit \"" + unit + "\": " + line);

 // Test for a more matches in the line
 if (m.find())
 session.log("found another one.");

 finishAndUpdateMinMax(SF, line);
 }
 else
 {
 session.log("//\\_/\\_/\\_/\\// This line does not contain the specified unit \"" + unit + "\".");
 }
 }
 else
 {
 finishAndUpdateMinMax(SF, line);
 }
 }
}

session.log("//\\_/\\_/\\_/\\// ============================");


if (SF.get("max") == 0) // If the parse yielded no results
{
 SF.put("min", 0);
 session.setVariable("LISTING_MAX_SF", null);
 session.setVariable("_LISTING_MODIFIABLE", "FALSE"); // If there's no available room, then don't insert it into the database.
 session.log("//\\_/\\_/\\_/\\// Warning: A zero was determined to be the largest number in the text.  This listing will not be inserted.");
 session.log("//\\_/\\_/\\_/\\// ============================");
 return;
}


//\_/\_/\_/\// Convert from SF to acres if needed
if (session.getVariable("SF_IS_ACRES") != null)
{
 session.log("//\\_/\\_/\\_/\\// Variable \"SF_IS_ACRES\" is set.  Numbers will now be converted to square feet from acres.");
 SF.put("min", SF.get("min") * 43560);
 SF.put("max", SF.get("max") * 43560);
}


//\_/\_/\_/\// Put the data back where the user wants it
String varName = "LISTING";

if (dataPutback.equals("dataRecord")) // If we want to putback to the dataRecord in scope
{
 session.log("//\\_/\\_/\\_/\\// Putting the data into the current DATARECORD as:");
 dataRecord.put(varName + "_MIN_SF", SF.get("min").intValue().toString());
 dataRecord.put(varName + "_MAX_SF", SF.get("max").intValue().toString());
 session.setVariable("LISTING_MAX_SF", null); // we used this as a temp variable earlier.  If the user wants to putback to the dataRecord,
 // then we don't want this temp value to persist.
}
else if (dataPutback.equals("sessionvariable")) // If we want to putback to the "LISTING_MIN_SF" and "LISTING_MAX_SF" session variables
{
 session.log("//\\_/\\_/\\_/\\// Putting the data into SESSIONVARIABLES as:");
 session.setVariable(varName + "_MIN_SF", SF.get("min").intValue().toString());
 session.setVariable(varName + "_MAX_SF", SF.get("max").intValue().toString());
}
else // If we want to putback to custom sessionVariable names + "_MIN_SF"/"_MAX_SF"
{
 varName = dataPutback;
 session.log("//\\_/\\_/\\_/\\// Putting the data into SESSIONVARIABLES as: \"" + dataPutback + "_MIN_SF\" and \"" + dataPutback + "_MAX_SF\".");
 session.setVariable(varName + "_MIN_SF", SF.get("min").intValue().toString());
 session.setVariable(varName + "_MAX_SF", SF.get("max").intValue().toString());
 session.setVariable("LISTING_MAX_SF", null); // we used this as a temp variable earlier.  If the user wants to putback to the dataRecord,
 // then we don't want this temp value to persist.
}


session.log("//\\_/\\_/\\_/\\// " + varName + "_MIN_SF: " + SF.get("min").intValue().toString());
session.log("//\\_/\\_/\\_/\\// " + varName + "_MAX_SF: " + SF.get("max").intValue().toString());
session.log("//\\_/\\_/\\_/\\// ============================");

Attachment Size
SF (Script).sss 22.06 KB