How to compensate for bad HTML coders in Details Page?
I'm trying to scrape video game details from Game Fly. I need the game description, inside <p> tags, which follows an <h3> tag.
But sometimes there's extra tags before, inside, and after the description paragraph.
What expression should I use? I want to ignore all tags inside the first <p> and the last </p> in the Sub Extractor Pattern
EX: Simplest Format::
<h3>Game Description</h3>
<p>~@DESCRIPTION@~</p>
<div
<p>~@DESCRIPTION@~</p>
<div
Screwy Formats::
<h3>Game Description</h3>
<p></p>
<p>~@DESCRIPTION@~</p>
<div
<h3>Game Description</h3>
<p><strong>blah blah</strong>~@DESCRIPTION@~</p>
<div
<h3>Game Description</h3>
<p>~@DESCRIPTION@~</p>
<br />
<br />
<div
<p></p>
<p>~@DESCRIPTION@~</p>
<div
<h3>Game Description</h3>
<p><strong>blah blah</strong>~@DESCRIPTION@~</p>
<div
<h3>Game Description</h3>
<p>~@DESCRIPTION@~</p>
<br />
<br />
<div
SOMETIMES there are combination of extra tags, even badly nested tags! (<p><strong>foobar</p></strong>)
I swear I've tried all of the RegEx in the list, and even tried some of my own but I don't "get" them, really.
This is my first day with the free version, 3 hours after doing tutorials 1, and 2.
You're fighting unstructured
You're fighting unstructured data, and that is my bane. I would start with:
~@DATARECORD@~
<div
Then you can make any number of sub-extractors to get your description. Sorry I don't know a better way.
Thanks
I found a pattern for the unstructured data. It's whenever the description has a "NOTE" attached to it. Certain notes have one type of coding, others a different type, but universally the same to their notes.
I just decided to clean it up in post. (Open the text file I write to with notepad++ and apply search-and-replace 3 times.)