Regex Character Escapes

I have been using more and more Regex in scripts to alter scraped data. I have noticed that escaped characters are processed correctly in tokens in extractor patterns, however they do not seem to process when used in Interpreted Java scripts. Here is an example of what I mean:

If you double click a token in a pattern and go to the Regular Expression tab and enter this "\d*" it will look for a set of digits in the pattern.

However if you use the same RegEx in a script like this:
value = value.replaceAll("\d*", "0");

You will get an error in the log:
Token Parsing Error: Lexical error at line 18, column 44. Encountered: "d" (100), after : "\"\\".

This can be avoided by writing the expression in long hand:
value = value.replaceAll("[0-9]+", "0");

But there are many time when you need to use escaped characters like \d in a script. What is the Regex engine that processes the script built on and how can i make these work correctly?

Thanks,

Joel

jgardner on 02/18/2008 at 9:32 pm

screen-scraper public support

Regex Character Escapes

Thanks Tim,

Shortly after posting i came to the same realization, it looks like RegEx expressions are evaluated first by screen-scraper and then by Java so anytime you use a backslash it has to be doubled. Screen-scraper will look at the string "\\d" and then pass on "\d" to Java.

I appreciate the response,

Joel

jgardner on 02/22/2008 at 5:11 pm

Regex Character Escapes

Joel --

I'm new to screen-scraper, but while you're waiting for an official response you might try doubling up the back slash preceding the 'd' sot that it looks like "\\d" rather than "\d". 100 is the decimal value for lower case 'd' so it appears the error response is simply annotating the value.

I don't know what platform you're working on, but with other languages I have needed as many as four backslashes to get the desired result on windows. Generally two back slashes work on unix.

Hope this helps.

timz on 02/19/2008 at 12:01 pm

Search

Community

screen-scraper

User login

Regex Character Escapes

Regex Character Escapes

Regex Character Escapes