To Ignore particular strings using RegEX

Hi everybody,
I have a site to scrape where i am using a token to scrape the required ads.
In some records the data present are

" " or "<hr />"

. So i need a regular expression to ignore
the token value

" " or "<hr />"

Please help me in writing a regular expression inorder to get all required data except

" " or "<hr />"

Thank you,
Vivek.

vivek on 07/17/2008 at 1:32 am

screen-scraper public support

I feel your pain!

I recently tackled a project where the HTML was formatted extremely sporatically, and there were literally dozens of irregularities sprinkled in the HTML.

To understand a little bit about WHY this is hard to do with extractor patterns, look at my recent post on this old thread: MORE REGEX PLEASE IM BRITISH (i didn't pick the title of the post :P It's found in our "suggestions" area of the forum.)

Assuming now that you've read my reply to that post, I can explain what you might do instead...

You might have to simply capture the region where the ads appear (ie, broaden your extractor pattern, so that it matches more things around you target, instead of just matching the target itself.), and then call a script to process that token manually, using Java's String "replaceAll()" or regex package for instance.

At that point, you could do something like the following (which will likely need to be adapted to fit your situation... you understand it better than I):

import java.util.regex.*

String token = dataRecord.get("REGION_AROUND_AD");

// If you choose to try the "replaceAll()" idea, you might do something like this, and then do some Java regex (similar to the next part of the example) to yank out your desired data
token = token.replaceAll(" |

", "");

// However, if the problem is more complex than that, you may have to implament your solution completely in Java regex stuff...

// set up the regex pattern... notice the "(\\s| )*" and "(\\s| )?" parts.
// That matches for possible spaces, whether in literal " " form, or in " " form.
// Also notice the parentheses around the "(someURL)" part. That's important.
// Also, notice the double-backslashes before an "s", and a backslash before a quote...
// it's all part of the annoying fact that you need to escape the escape character. (Good luck there.)

Pattern p = Pattern.compile("(\\s| )*(\\s| )?");

// associate the "token" variable with that Pattern, and store the object in a Matcher object called "m"
Matcher m = p.matcher(token);

// launch the matcher. If TRUE (it found a match), then process the data
if (m.find())
{
String adURL = m.group(2); // using "m.group(2)" here because "group(2)" represents the contents of the second group of parentheses that I used in the Pattern "p"
dataRecord.put("adURL", adURL); // put the data back into the dataRecord as "adURL"
}

Admittably more complex... and the Java regex process of using the Pattern and Matcher classes, and then using the ".find()" and ".group(n)" methods... it's just harder to perform.

You'll have to read up on the regex part of the Java API if you need more specific help with the regex syntax, but what I showed you is the basics.

timv on 07/17/2008 at 11:14 am

Search

Community

screen-scraper

User login

To Ignore particular strings using RegEX

I feel your pain!