problem with unwanted tags inside text extracted

Hi. I am scraping the posts in a discussion forum. Not the content, but the posts titles, date, user, etc... My extractor pattern looks roughly like this:

#~@POSTID@~
 
</td><td>
<a id="~@DUMMY@~" href="~@DUMMY@~">~@POSTTITLE@~</a>
</td><td style="white-space:nowrap;">
<a id="~@DUMMY@~" href="/boards/profilea.aspx?user=~@USERID@~">~@DUMMY@~</a>
</td><td style="white-space:nowrap;">~@DATEOFPOST@~</td>
</tr>

I am having trouble with the ~@POSTTITLE@~ title, because some post titles have bold tags, which make that for that post the extractor pattern doesn´t produce a match. My post titles as example are like this:

>This is one post title that you can find</a>

>This is another post title that you can find with something that is a must read in it</a>

So my problem is that I am lost as to what regexp I should put to make sure that I get the content of the post title no matter if it has "" or , ,  tags.
None of the supplied regex are suitable for this.
Which regex might be best for this? I have tried (.*) but it doesn´t work either :-(

cheers,
boga

bogavante on 10/30/2012 at 8:31 am

screen-scraper support for licensed users

In some cases it's okay to

In some cases it's okay to leave the regex blank. You just need to be sure the extractor can't get too greedy.

If you do need a regex, you can do something like:

[\w\s(</?b>)(</?strong>)]+

That would allow only words or the or tags.

jason on 10/30/2012 at 9:45 am

I had tried leaving it blank

I had tried leaving it blank at it was getting too greedy :-)

thank you for the regex, I will try to modify it by myself to include also the other tags.
cheers,
boga

bogavante on 10/30/2012 at 10:54 am

Search

Community

screen-scraper

User login

problem with unwanted tags inside text extracted

In some cases it's okay to

I had tried leaving it blank