Regular Expression - How to stop on punctuation
Hi,
Firstly the more I use this product, the better it seems, I'm now working on a site that has a tree 4 deep and things are going great.
The only issue I'm now having is with my regular expression skills. I'm trying to extract the venue name from an address string but I've got the issue that different characters are being used to either separate or terminate the part of the string I want. Any suggestions are welcome. I assume I need to do something like capture anything that is an alpha numeric character or a space, then stop on anything else.
Something like? (I am guessing here)
The data can have different or no puctuation after the venue name
e.g.
I'm trying to capture 'the venue name'
Thank you
I'll try this in the morning. Don't worry about the postcode, this appears elsewhere on the page.
Thanks for the help
Alex
Regular Expression - How to stop on punctuation
Alex,
Regex can be tough but it's the key to making it work. So, I tested this scenario on the two lines you provided and it worked.
Main extractor Pattern text field:
<p class="ev3"><span id="lblVenue"~@DATARECORD@~/span></p>
Sub-extractor pattern:
>~@ELEMENT_ONE@~~@VARIOUS_SEPARATORS@~~@SPACE@~~@ELEMENT_TWO@~<
Sub-extractor pattern token's regular expressions:
ELEMENT_ONE and ELEMENT_TWO:
[A-Za-z0-9 ]*
VARIOUS_SEPARATORS:
[,|;]*
(add more if needed - remember "|" means or)
SPACE:
*
(note the leading space)
You should be able to add a third and fourth combination if needed. Notice the * at the end of each of them. That says, match many times or not at all allowing for nothing to be there if it's not.
I included the opening and closing brackets in with the sub-extractor pattern just to give it something other than an alphanumeric character which is what we're trying to match in that first token.
Now, how you're going to go about culling out the postal code is another story. US zips are nice and consistent but UK postal codes don't follow a very reliable pattern.
I hope this helps.
-Scott
Examples
Here are a couple of examples of the sort of data I'm looking to scrape. I can get most of the venue names by breaking at a comma (,) as per the first example, but some addresses break on another punctuation mark (; in this example).
Horniman Museum, 100 London Road Forest Hill London London SE23 3PQ
153 Great Titchfield Street; London W1W 5BD
Regular Expression - How to stop on punctuation
Alex,
It look like you've got the right idea. The one possible problem I see is when there is no punctuation you'll need something else to key off of. Without a real world example it's hard to know what your options are.
Is there anything consistent and unique about how every address begins? For example, would every address possibly start with a number? If so, this would come close to providing a solution with the only exception possibly being when the venue ends in a number, there is not punctuation separating the venue and address and the address also starts with a number.
Two possible alternatives to taking this approach would be...
1. Looking for another place on the site (preferably on this same page) where either the venue or the address are referred to that you could scrape and use as a comparison to test to see if your extracted elements are the right ones (and perhaps only use this test method for the one exceptional case above).
2. Utilize an outside resource to basically do the same thing as the idea in #1. That is, use either the venue name or address that you've scraped as a search query on, say, a Yellow pages online that will basically correct your query and give you back a string that you can use to help determine the correct portions of a flawed extraction.
Scraping unstructured data is not an easy thing no matter how you slice it.
-Scott