Helpful Regular Expressions

Built In Regular Expressions

screen-scraper comes with a number of pre-built regular expressions that can help you extract information. While many of these built expressions are pretty self explanatory, here are some extra notes about the more subtle effects of them.

These expressions are not perfect, some might match some things that are not what you want and others will not match all possible iterations. It is not the goal of these to work in all cases but rather to work correctly in the most common use cases. These have been used extensively in-house and proven their efficiency in getting the job done correctly and earned the right to be added to the software install. Explanations to how they work are provided to allow you to adjust them as you see fit for your projects.

General

Number [\d,]+

Matches multiple (+) characters that are ([]) either digits (\d) or commas (,).

Match whole numbers.

Examples

<a href="search_results.php?page=~@NEXT_PAGE@~
Floating-point number [+-]?\s?\d*.\d+

Matches a number with an optional (?) positive/negative prefix ([+-]), followed by an optional (?) space (\s), and then followed as many (*) digits (\d) before a dot/period (\.) followed by at least one (+) digit (\d). That last part, about the digits with the period in the middle, is particularly flexible, as it will match a number less than 1 but without a zero, such as .1337', yet will still match more normal number like 6.02214179 or even '-234.991.

A floating point number is any number containing a decimal (if it has more than one decimal it would usually be referred to as a reference and not a number). They are particularly common with percentages.

Examples

Sales growth for June: ~@POSITIVE_OR_NEGATIVE_FLOATING_POINT_NUMBER@~%
Dollar Amount [\d,]+\.\d{2}|[\d,]+

Matches at least one (+) digit (\d) and/or comma (,) followed by a dot/period (\.) and two ({2}) digits (\d) or (|) one or more (+) digits (\d) and commas (,).

This can be used to match a US dollar amount with or without cents listed. If you are using it for a country that switches the period and comma then you can switch them in the expression and it will work.

Examples

Cost: $~@DOLLAR_AMOUNT@~
Email address [\w.-]+@[\w.-]+\w+

While this expression doesn't look complicated, it's quite powerful; it will match addresses between single or double quotes, parentheses, spaces, etc.

Examples

<a href="mailto:~@EMAIL_ADDRESS_IN_LINK@~">

... by email at ~@EMAIL_ADDRESS_IN_PARAGRAPH_WITHOUT_A_LINK@~.
U.S. date \d{1,2}[-/. ]+\d{1,2}[-/. ]+\d{2,4}

Matches one or two ({1,2}) digits (\d); at least one (+) character that is a hyphen (-), dot/period (.), space ( ), or forward slash (/) followed by the same and ended with two to four ({2,4}) digits (\d).

Matches full, numeric US dates. It does not do textual months or days with suffixes but it is a good standard.

Examples

Last Updated: ~@PUBLISHED_DATE@~

HTML

HTML whitespace [( )\t\s]*

This is different than the RegEx that went out with the 5.0 release. There was an error, if you would like to correct it change it in the regex editor.

Matches as many (*) characters ([]) as are available that are either an HTML non-breaking space entity ( ), tab (\t) or space (\s).

Some sites will do strange things to add space around words and such but do so in inconsistent manners, this helps you cut through the inconsistencies of whitespace.

Examples

Name:~@whitespace@~~@NAME@~
HTML tag parameters [^>]*

The pattern will match any number (*) of characters that are not ([^]) a greater than (>).

This is used primarily to make extractor patterns that reference tags more stable if attributes are added, changed, or removed. Place this on a token and place a greater than (>) after the token and you would be matching all of the characters between the token and the end of the tag. If your using a parameter as a hook for the extractor pattern then you can add a token before and after with this expression to get the same results.

Examples

<h2~@unneeded_parameters@~>

<a~@unneeded_parameters@~href="somelink.php"~@unneeded_parameters@~>
Non-HTML tags [^<>]*

Matches as many (*) characters as it can that are not ([^]) a less than (<) or greater than (>) sign.

Grab all the text from a starting point until it reaches an html tag, this is helpful for when you don't want to specify whether you are internal or external to the tag. The HTML tag parameters is for internal specifically.

Examples

<h1~@unneeded_parameters@~>~@TITLE@~<
Non-double quotes [^"]*

Matches as many (*) characters as are available that are not ([^]) double-quotes (").

Great for extracting attribute values from tags.

Examples

<a href="~@LINK_URL@~">
Non-single quotes [^']*

Matches as many (*) characters as are available that are not ([^]) single-quotes/apostrophes (').

For extracting attribute values that are in single quotes instead of double.

Examples

<a href='~@LINK_URL@~'>
URL GET parameter [^&"]*

Similar to the example just above, this pattern matches any number (*) characters that are not [^] either an ampersand (&) of double quote (").

Extract individual GET parameters from a link on a page without having to parse it manually. The ampersand (&) delimits parameters and a double quote should ends the href attribute.

Examples

href="somepage.asp?passedparameter=2&another=~@PARAMETER_VALUE@~&evenmore=mary%20poppins"

Phone Numbers

7-digit phone number \d{3}[. -]*\d{4}

Matches three ({3}) digits (\d) followed by as many (*) dots/periods (.), spaces ( ), and hyphens (-) as are present then ending with four ({4}) digits (\d).

The flexibility of this expression makes it so that it can match 7-digit phone numbers in a wide variety of formats including such variations as 555-5236, 555 - 5236, 555.5236, 555 5236, and 5555236 or any combination of these. On the internet 7-digit phone numbers tend to pop up less than 10-digit ones but renationalized sites will use them sometimes.

Examples

... for more information call ~@PHONE_NUMBER@~.
10-digit phone number \(?\s*\d{3}[). -]*\d{3}[. -]*\d{4}

Matches and optional (?) left parenthesis (\(); as many (*) following spaces (\s) as are present; three ({3}) digits (\d); as many (*) right parenthesis ()), dots/periods (.), spaces ( ), and hyphens (-) as are present; three ({3}) digits (\d); as many (*) dots/periods (.), spaces ( ), and hyphens (-) as are present; and finally four ({4}) digits (\d).

The flexibility of this expression makes it so that it can match 10-digit phone numbers in a wide variety of formats including such variations as (555) 555-5236, ( 555 ) 555-5236, 555.555.5236, (555) 555 - 5236, 555-555-5236, 555 555 5236, and 5555555236 or any combination of these.

Examples

... for more information call ~@PHONE_NUMBER@~.

Addresses

State abbreviation [A-Z]{2}

Matches two ({2}) characters that are ([]) capital/uppercase letters (A-Z).

Usually when working with an address it is easier to take it out in parts or remove it all and parse it. This helps with the first method.

Examples

36 Mulberry Ln. Salt Lake City, ~@STATE@~ 84101
5-digit U.S. zip code \d{5}

Matches five ({5}) digits (\d).

Usually when working with an address it is easier to take it out in parts or remove it all and parse it. This helps with the first method.

Examples

36 Mulberry Ln. Salt Lake City, UT ~@ZIP@~
5/9-digit U.S. zip code \d{5}[-\d]{5}|\d{5}

Matches five ({5}) digits (\d) followed by five ({5}) hyphens (-), and digits (\d) or (|) five ({5}) digits (\d).

When zip codes are not consistently five or nine digits this pattern with match either.

Examples

36 Mulberry Ln. Salt Lake City, UT ~@ZIP@~

Other Helpful Expressions that are not Built-in

HTML Hexadecimal color [\da-fA-F]{3,6}

Matches three to six ({3,6}) hex characters ([\da-fA-F]). The range is for HTML where the browser will translate that a code of 4aF to 44aaFF. Though a four or five digit Hex is not acceptable in any format the extractor accepts them out of convenience to get both three and six figure hexes. For those not familiar with hex numbers they are base 16 numbers and so use our base 10 numbers (0-9) and then the first six letters (a-f) as their digits.

If you only wanted to allow combinations of 3 and 6 characters for the HEX value you could use [\da-fA-F]{3}([\da-fA-F]{3})?

Often you'll come across tables in your scrapes that use an alternating color scheme, so that every other row has a different color than the rest. While you could use a simple 'Non double quote' pattern to match it, you sometimes need to be more specific to keep from matching extraneous data on the page. It's still possible that a table use a color keyword, like "black" or "mintcream", this won't be a fix-all solution. But if you know the color will be a hex number, you might as well use this pattern.

Examples

<table bgcolor="#~@HEX_NUMBER@~" width="600px">
Match anything EXCEPT a given word between HTML tags. (?:(?!Foo).)[^><]*

Matches any word other than Foo when looking between HTML tags.

scraper on 07/16/2010 at 4:57 pm

Printer-friendly version
Login or register to post comments

Search

Community

screen-scraper

User login