Regular Expression Help

Introduction

Regular Expressions, often abbreviated to simply "Regex", are the power and flexibility behind a scraping session. While we won't go into the details about how they work (information that is readily available on the web, for instance at regular-expressions.info), we wanted to give various useful pointers about using them.

There are places where you will use regular expressions in screen-scraper: on extractor tokens and in scripts. Each is slightly different than the other so we will discuss them in more detail according to type.

Extractor Tokens

On your extractor tokens regular expressions will help to only gather the information that you desire. screen-scraper ships with the most common regular expressions for screen scraping already added to the system. They can be selected in the general tab of the extractor token editor.

You may edit screen-scraper's regular expressions at any time by clicking on the Edit regular expressions in the Options menu.

For a detailed list and explanation of the built-in regular expressions for extractor tokens as well as some other helpful expressions see our page on helpful regular expressions.

The Regular Expression parser that is used by screen-scraper internally is a PERL compatible parser. This can be an important to those writing their own expressions.

Scripts

Scripts are parsed and this can have its own implications of how things have to be formatted. This will depend on the language that you are using in screen-scraper. Examples of particular changes that are necessary in Java are available in our java regular expression help.

Helpful Regular Expressions

Built In Regular Expressions

screen-scraper comes with a number of pre-built regular expressions that can help you extract information. While many of these built expressions are pretty self explanatory, here are some extra notes about the more subtle effects of them.

These expressions are not perfect, some might match some things that are not what you want and others will not match all possible iterations. It is not the goal of these to work in all cases but rather to work correctly in the most common use cases. These have been used extensively in-house and proven their efficiency in getting the job done correctly and earned the right to be added to the software install. Explanations to how they work are provided to allow you to adjust them as you see fit for your projects.

General

  • Number [\d,]+

    Matches multiple (+) characters that are ([]) either digits (\d) or commas (,).

    Match whole numbers.

    Examples

    <a href="search_results.php?page=~@NEXT_PAGE@~

  • Floating-point number [+-]?\s?\d*.\d+

    Matches a number with an optional (?) positive/negative prefix ([+-]), followed by an optional (?) space (\s), and then followed as many (*) digits (\d) before a dot/period (\.) followed by at least one (+) digit (\d). That last part, about the digits with the period in the middle, is particularly flexible, as it will match a number less than 1 but without a zero, such as .1337', yet will still match more normal number like 6.02214179 or even '-234.991.

    A floating point number is any number containing a decimal (if it has more than one decimal it would usually be referred to as a reference and not a number). They are particularly common with percentages.

    Examples

    Sales growth for June: ~@POSITIVE_OR_NEGATIVE_FLOATING_POINT_NUMBER@~%

  • Dollar Amount [\d,]+\.\d{2}|[\d,]+

    Matches at least one (+) digit (\d) and/or comma (,) followed by a dot/period (\.) and two ({2}) digits (\d) or (|) one or more (+) digits (\d) and commas (,).

    This can be used to match a US dollar amount with or without cents listed. If you are using it for a country that switches the period and comma then you can switch them in the expression and it will work.

    Examples

    Cost: $~@DOLLAR_AMOUNT@~

  • Email address [\w.-]+@[\w.-]+\w+

    While this expression doesn't look complicated, it's quite powerful; it will match addresses between single or double quotes, parentheses, spaces, etc.

    Examples

    <a href="mailto:~@EMAIL_ADDRESS_IN_LINK@~">

    ... by email at ~@EMAIL_ADDRESS_IN_PARAGRAPH_WITHOUT_A_LINK@~.

  • U.S. date \d{1,2}[-/. ]+\d{1,2}[-/. ]+\d{2,4}

    Matches one or two ({1,2}) digits (\d); at least one (+) character that is a hyphen (-), dot/period (.), space ( ), or forward slash (/) followed by the same and ended with two to four ({2,4}) digits (\d).

    Matches full, numeric US dates. It does not do textual months or days with suffixes but it is a good standard.

    Examples

    Last Updated: ~@PUBLISHED_DATE@~

HTML

  • HTML whitespace [(&nbsp;)\t\s]*

    This is different than the RegEx that went out with the 5.0 release. There was an error, if you would like to correct it change it in the regex editor.

    Matches as many (*) characters ([]) as are available that are either an HTML non-breaking space entity (&nbsp;), tab (\t) or space (\s).

    Some sites will do strange things to add space around words and such but do so in inconsistent manners, this helps you cut through the inconsistencies of whitespace.

    Examples

    Name:~@whitespace@~~@NAME@~

  • HTML tag parameters [^>]*

    The pattern will match any number (*) of characters that are not ([^]) a greater than (>).

    This is used primarily to make extractor patterns that reference tags more stable if attributes are added, changed, or removed. Place this on a token and place a greater than (>) after the token and you would be matching all of the characters between the token and the end of the tag. If your using a parameter as a hook for the extractor pattern then you can add a token before and after with this expression to get the same results.

    Examples

    <h2~@unneeded_parameters@~>

    <a~@unneeded_parameters@~href="somelink.php"~@unneeded_parameters@~>

  • Non-HTML tags [^<>]*

    Matches as many (*) characters as it can that are not ([^]) a less than (<) or greater than (>) sign.

    Grab all the text from a starting point until it reaches an html tag, this is helpful for when you don't want to specify whether you are internal or external to the tag. The HTML tag parameters is for internal specifically.

    Examples

    <h1~@unneeded_parameters@~>~@TITLE@~<

  • Non-double quotes [^"]*

    Matches as many (*) characters as are available that are not ([^]) double-quotes (").

    Great for extracting attribute values from tags.

    Examples

    <a href="~@LINK_URL@~">

  • Non-single quotes [^']*

    Matches as many (*) characters as are available that are not ([^]) single-quotes/apostrophes (').

    For extracting attribute values that are in single quotes instead of double.

    Examples

    <a href='~@LINK_URL@~'>

  • URL GET parameter [^&"]*

    Similar to the example just above, this pattern matches any number (*) characters that are not [^] either an ampersand (&) of double quote (").

    Extract individual GET parameters from a link on a page without having to parse it manually. The ampersand (&) delimits parameters and a double quote should ends the href attribute.

    Examples

    href="somepage.asp?passedparameter=2&another=~@PARAMETER_VALUE@~&evenmore=mary%20poppins"

Phone Numbers

  • 7-digit phone number \d{3}[. -]*\d{4}

    Matches three ({3}) digits (\d) followed by as many (*) dots/periods (.), spaces ( ), and hyphens (-) as are present then ending with four ({4}) digits (\d).

    The flexibility of this expression makes it so that it can match 7-digit phone numbers in a wide variety of formats including such variations as 555-5236, 555 - 5236, 555.5236, 555 5236, and 5555236 or any combination of these. On the internet 7-digit phone numbers tend to pop up less than 10-digit ones but renationalized sites will use them sometimes.

    Examples

    ... for more information call ~@PHONE_NUMBER@~.

  • 10-digit phone number \(?\s*\d{3}[). -]*\d{3}[. -]*\d{4}

    Matches and optional (?) left parenthesis (\(); as many (*) following spaces (\s) as are present; three ({3}) digits (\d); as many (*) right parenthesis ()), dots/periods (.), spaces ( ), and hyphens (-) as are present; three ({3}) digits (\d); as many (*) dots/periods (.), spaces ( ), and hyphens (-) as are present; and finally four ({4}) digits (\d).

    The flexibility of this expression makes it so that it can match 10-digit phone numbers in a wide variety of formats including such variations as (555) 555-5236, ( 555 ) 555-5236, 555.555.5236, (555) 555 - 5236, 555-555-5236, 555 555 5236, and 5555555236 or any combination of these.

    Examples

    ... for more information call ~@PHONE_NUMBER@~.

Addresses

  • State abbreviation [A-Z]{2}

    Matches two ({2}) characters that are ([]) capital/uppercase letters (A-Z).

    Usually when working with an address it is easier to take it out in parts or remove it all and parse it. This helps with the first method.

    Examples

    36 Mulberry Ln. Salt Lake City, ~@STATE@~ 84101

  • 5-digit U.S. zip code \d{5}

    Matches five ({5}) digits (\d).

    Usually when working with an address it is easier to take it out in parts or remove it all and parse it. This helps with the first method.

    Examples

    36 Mulberry Ln. Salt Lake City, UT ~@ZIP@~

  • 5/9-digit U.S. zip code \d{5}[-\d]{5}|\d{5}

    Matches five ({5}) digits (\d) followed by five ({5}) hyphens (-), and digits (\d) or (|) five ({5}) digits (\d).

    When zip codes are not consistently five or nine digits this pattern with match either.

    Examples

    36 Mulberry Ln. Salt Lake City, UT ~@ZIP@~

Other Helpful Expressions that are not Built-in

  • HTML Hexadecimal color [\da-fA-F]{3,6}

    Matches three to six ({3,6}) hex characters ([\da-fA-F]). The range is for HTML where the browser will translate that a code of 4aF to 44aaFF. Though a four or five digit Hex is not acceptable in any format the extractor accepts them out of convenience to get both three and six figure hexes. For those not familiar with hex numbers they are base 16 numbers and so use our base 10 numbers (0-9) and then the first six letters (a-f) as their digits.

    If you only wanted to allow combinations of 3 and 6 characters for the HEX value you could use [\da-fA-F]{3}([\da-fA-F]{3})?

    Often you'll come across tables in your scrapes that use an alternating color scheme, so that every other row has a different color than the rest. While you could use a simple 'Non double quote' pattern to match it, you sometimes need to be more specific to keep from matching extraneous data on the page. It's still possible that a table use a color keyword, like "black" or "mintcream", this won't be a fix-all solution. But if you know the color will be a hex number, you might as well use this pattern.

    Examples

    <table bgcolor="#~@HEX_NUMBER@~" width="600px">

  • Match anything EXCEPT a given word between HTML tags. (?:(?!Foo).)[^><]*

    Matches any word other than Foo when looking between HTML tags.

Java Regular Expression Help

Escaping Characters

Java uses the same escape character that PERL regular expression do and so to use these character they have to be escaped in Java as well as in PERL. This can be a little confusing so here are some examples.

The replaceAll method is a string method available in Java and uses a PERL Regex to match characters. The second parameter is what the character is being replaced with by the method.

All of these examples are replacing a character with itself, the purpose is only to show what the regex would look like.

// match a single \ (not an escape character)
// it has to be escaped in PERL so \\
// then both have to be escaped in Java so \\\\
// the replace also has to be escaped for Java
value = value.replaceAll("\\\\", "\\");

// match a * (not a quantity definition)
value = value.replaceAll("\\*", "*");

// match a ? (not a quantity definition)
value = value.replaceAll("\\?", "?");

// match a " without causing issue with the regex
// representation as a string in Java
// This one replaces it with a single quote
value = value.replaceAll("\"", "\'");

// match a | (not an or qualifier)
value = value.replaceAll("\\|", "|");

Using Groups

When extracting a complex data set like an address it is sometimes easier to extract the whole group and do the breakdown using regular expressions in your scripts. This allows you to harness the power of some of the finer features of regular expressions. In this example we will show how to take an extracted address and break it into its parts.

// Import Java regex
import java.util.regex.*;

String address = "";
String apartment = "";

// Backslashes must be doubled for the Java regex to receive them.
// In this pattern, we're making use of both grouping and the OR bar "|"
Pattern p = Pattern.compile("(\\d+[\\w\\s]+),?(Apt|#|Suite)\\s(\\d+)");
Matcher m = p.matcher(dataRecord.get("ADDRESS_LINE"));

// Begins the matching process, and tests to see if any matchers were made
if (m.find()) {
    address = m.group(1); // # and street name
    apartment = m.group(3); // Apartment or suite number

    // We skipped 'm.group(2)' because group(2) refers to the '(Apt|#|Suite)' part, which isn't as relevant.
    // If you want to keep the 'Apt' or 'Suite' prefix, do the following instead:
    apartment = m.group(2) + " " + m.group(3);
}

// Places the modified values back into the dataRecord
dataRecord.put("ADDRESS", address);
dataRecord.put("APARTMENT_NUMBER", apartment);

You can play with the pattern, the basic idea is that each group, defined by the parentheses, can be selected using the group method allowing you to easily get at a part of what is selected instead of the all or nothing that the extractor tokens have to work on by their very nature.