Java Regular Expression Help

Escaping Characters

Java uses the same escape character that PERL regular expression do and so to use these character they have to be escaped in Java as well as in PERL. This can be a little confusing so here are some examples.

The replaceAll method is a string method available in Java and uses a PERL Regex to match characters. The second parameter is what the character is being replaced with by the method.

All of these examples are replacing a character with itself, the purpose is only to show what the regex would look like.

// match a single \ (not an escape character)
// it has to be escaped in PERL so \\
// then both have to be escaped in Java so \\\\
// the replace also has to be escaped for Java
value = value.replaceAll("\\\\", "\\");

// match a * (not a quantity definition)
value = value.replaceAll("\\*", "*");

// match a ? (not a quantity definition)
value = value.replaceAll("\\?", "?");

// match a " without causing issue with the regex
// representation as a string in Java
// This one replaces it with a single quote
value = value.replaceAll("\"", "\'");

// match a | (not an or qualifier)
value = value.replaceAll("\\|", "|");

Using Groups

When extracting a complex data set like an address it is sometimes easier to extract the whole group and do the breakdown using regular expressions in your scripts. This allows you to harness the power of some of the finer features of regular expressions. In this example we will show how to take an extracted address and break it into its parts.

// Import Java regex
import java.util.regex.*;

String address = "";
String apartment = "";

// Backslashes must be doubled for the Java regex to receive them.
// In this pattern, we're making use of both grouping and the OR bar "|"
Pattern p = Pattern.compile("(\\d+[\\w\\s]+),?(Apt|#|Suite)\\s(\\d+)");
Matcher m = p.matcher(dataRecord.get("ADDRESS_LINE"));

// Begins the matching process, and tests to see if any matchers were made
if (m.find()) {
    address = m.group(1); // # and street name
    apartment = m.group(3); // Apartment or suite number

    // We skipped 'm.group(2)' because group(2) refers to the '(Apt|#|Suite)' part, which isn't as relevant.
    // If you want to keep the 'Apt' or 'Suite' prefix, do the following instead:
    apartment = m.group(2) + " " + m.group(3);
}

// Places the modified values back into the dataRecord
dataRecord.put("ADDRESS", address);
dataRecord.put("APARTMENT_NUMBER", apartment);

You can play with the pattern, the basic idea is that each group, defined by the parentheses, can be selected using the group method allowing you to easily get at a part of what is selected instead of the all or nothing that the extractor tokens have to work on by their very nature.