How can I match a token conditionally, or pre parse the page to remove certain noise words so that my extractor pattern matches

This is my extractor pattern:-

<td class="prd-select"><input type="checkbox" value="~@SKU@~" name="aComparedProducts[]" /> </td>
<td class="prd-img"><a href="~@COMPAREPRODUCTURL@~"><img src="~@PRODUCTIMG@~" alt="~@IGNORE2@~" height="~@HEIGHT@~" width="~@WIDTH@~" /></a> </td>
<td class="prd-details">
<a href="~@PRODUCTURL@~">LG ~@MODELNO@~ ~@IGNORE2@~</a>

~@PRODUCTTITLE@~

<img src="~@RESERVEANDCOLLECTIMG@~" alt="~@RESERVEANDCOLLECTALT@~" />

<img src="~@HOMEDELIVERYIMG@~" alt="~@INSTOCK@~" />
</td>
<td class="prd-amount-details">
~@PRICE@~~@IGNORE3@~

<a href="~@MOREINFOLINK@~">More information</a>
</td>
</tr>

My issue is that occasionally ~@PRODUCTURL@~ varies and i want to do a replace in this token before I parse it, i.e. here are two different rows

LG 19LD350 19" HD Ready LCD TV
LG Flatron M2262D 22" Full HD LCD TV

I am trying to get the model number (and this is the only place in the page in which it is shown and it is usually after LG but sometimes there is an other word... like Infina or Flatron etc... so how can I remove this optional Flatron word so that it picks up M2262D? Or can a clever regular expression do this...

JulianGuppy on 08/20/2010 at 1:02 pm

screen-scraper public support

Tricky Regex Work Arounds

I have found that there are two optios and each has its benefits. The first is to try to use the extractor pattern to get the information for you, this is only possible if there is some part of the pattern that you can key off of to tell if there is an extra word.

In the two examples that you gave there is. There is always the LG, a possible word, the model number and then a number followed by a double quote (e.g., LG Flatron M2262D 22"). If this is always the case you can use something like:

>LG~@extra_word@~ ~@MODELNO@~ ~@number@~" ~@IGNORE2@~<

You would then give ~@extra_word@~ the non-HTML regex and it will get what it needs as long as there are not two numbers followed by double quotes in the product name. That could cause it to get more information then it should. You could use a non-space regex ([^ ] or [^\s]) if it is always a single word that gets inserted, but then multiple words become a problem. Basically you have to look for a pattern that more particularly defines the elements.

The non-HTML regex can match nothing so if there isn't a word it will not have problems and if there is the spaces in the pattern and the double quotes will ensure that it doesn't go beyond where it should (as long as there is only one double quote and the number always follows).

The other option is to extract the whole section and apply a regular expression in your script to strip out the parts. This allows for more logic to be applied including different regular expressions if one doesn't give you the expected results. One expression would be essentially a composite of the various extractor tokens in the current pattern and you could have as many tests as you want until you get something that is reliable for you.

I tend to favor the first. If there is a pattern that I can isolate, I can avoid the scripting of my regular expressions. Sometimes there are just too many variables for the extractor pattern to be adequate and so I have to do it, but I try to avoid it even though it can handle outliers a lot better.

tylers on 08/20/2010 at 2:18 pm

regex wont do what i need

The point being that the Extra word is only there occasionally and i dont see how i can use regex because there are not two numbers followed by double quotes in every one. and I cant see how what you are suggesting would work because there are multiple words and/or multiple numbers and in different sequences... however if I can remove the word "infinia" and "flatron" from the page (and maybe others as noise words) then maybe I can make this work. How can I remove a list of chosen noise words from my text before it is matched

JulianGuppy on 08/21/2010 at 1:04 am

That's a big, hairy extractor

That's a big, hairy extractor pattern. I think for that, it would be best to use a DATARECORD token, so it would look like:

<td class="prd-select">
~@DATARECORD@~
</tr>

Then use sub-extractors to match each part ... therefore if there is an extra bit of HTML in there, or something is missing, what is there will match anyhow. That should also help take care of this issue.

jason on 08/23/2010 at 9:23 am

Search

Community

screen-scraper

User login

How can I match a token conditionally, or pre parse the page to remove certain noise words so that my extractor pattern matches

Tricky Regex Work Arounds

regex wont do what i need

That's a big, hairy extractor