RegEx for UK Phone Numbers

Hi,

I need to extract a UK phone number from a random place in the html response (there is no real patter I can use around the number). I've managed to do this with email addresses but I'm having no luck with UK phone numbers.

I'm basically using

region>~@IGNORE@~ ~@PHONE_NUMBER@~ ~@IGNORE@~

then trying different RegEx formats for the number but with little success.

e.g.

^0\d{2,4}[ -]{1}[\d]{3}[\d -]{1}[\d -]{1}[\d]{1,4}$

or

^(((\+44\s?\d{4}|\(?0\d{4}\)?)\s?\d{3}\s?\d{3})|((\+44\s?\d{3}|\(?0\d{3}\)?)\s?\d{3}\s?\d{4})|((\+44\s?\d{2}|\(?0\d{2}\)?)\s?\d{4}\s?\d{4}))(\s?\#(\d{4}|\d{3}))?$

from something like

and holidays abroad. This is just a small selection of the activities on offer. Contact Mark for details 020 3361 4827 some more text

or phone numbers like 01570 834971

or 07469 367 483

Where am I going wrong?

Thanks

Thanks

RegEx for UK Phone Numbers

The reason ~@IGNORE@~ has been depreciated is because it tends to match too much. So, it ends up working like magic some times and will cause problems other times. But you don't always know that it's the cause of the problem and that's what's trouble.

As a quick test I made a scraping session where I scraped this forum thread and all I did was create an extractor pattern with one token.

~@UK_PHONE_REGEX@~

regular expression:

[0-9 ]{11,13}
This says, match any string of number or spaces that is at least 11 characters long and no more than 13 characters long. This assumes that you won't encounter any occassions where someone uses three spaces in the number like:

01 570 834 971

And if they do, just up the 13 to 14.

This matched the three examples you gave above. You'll need to clean it up with you're done using the Java methods:

replaceAll()
[url]http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#replaceAll(java.lang.String,%20java.lang.String)[/url]

trim()
[url]http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#trim()[/url]

-Scott

RegEx for UK Phone Numbers

Ok. One of the forum posts is what I figured. Sorry for stalling.

So, the first thing you'll want to do is replace the ~@IGNORE@~ tags with something else and use an appropriate regular expression. One technique I like to use is, for example, ~@non_html@~ where the regular expression is non-html tags.

Try it now with the ~@IGNORE@~ tags gone. I'd would experiment with you but I'll need to know the HTML that's behind the former ~@IGNORE@~ tags.

-Scott

IGNORE

It may be deprecated but it is very handy.

I think I found it in one of the forum posts or in the tutorials, sorry I cannot me more specific.

RegEx for UK Phone Numbers

Alex,

I'm sorry. Before I can answer your question I've got to know where you learned to use the ~@IGNORE@~ tag. We're trying to weed it out since it's been deprecated.

-Scott