How to compensate for bad HTML coders in Details Page?

I'm trying to scrape video game details from Game Fly. I need the game description, inside <p> tags, which follows an <h3> tag.

But sometimes there's extra tags before, inside, and after the description paragraph.

What expression should I use? I want to ignore all tags inside the first <p> and the last </p> in the Sub Extractor Pattern

EX: Simplest Format::

<h3>Game Description</h3>
<p>~@DESCRIPTION@~</p>
<div

Screwy Formats::

<h3>Game Description</h3>
<p></p>
<p>~@DESCRIPTION@~</p>
<div

<h3>Game Description</h3>
<p><strong>blah blah</strong>~@DESCRIPTION@~</p>
<div

<h3>Game Description</h3>
<p>~@DESCRIPTION@~</p>
<br />
<br />
<div

SOMETIMES there are combination of extra tags, even badly nested tags! (<p><strong>foobar</p></strong>)

I swear I've tried all of the RegEx in the list, and even tried some of my own but I don't "get" them, really.

This is my first day with the free version, 3 hours after doing tutorials 1, and 2.

Panda on 10/26/2010 at 6:27 pm

screen-scraper public support

You're fighting unstructured

You're fighting unstructured data, and that is my bane. I would start with:

<h3>Game Description</h3>
~@DATARECORD@~
<div

Then you can make any number of sub-extractors to get your description. Sorry I don't know a better way.

jason on 10/27/2010 at 8:59 am

Thanks

I found a pattern for the unstructured data. It's whenever the description has a "NOTE" attached to it. Certain notes have one type of coding, others a different type, but universally the same to their notes.

I just decided to clean it up in post. (Open the text file I write to with notepad++ and apply search-and-replace 3 times.)

Panda on 10/28/2010 at 8:35 am

Search

Community

screen-scraper

User login

How to compensate for bad HTML coders in Details Page?

You're fighting unstructured

Thanks