How can I match a token conditionally, or pre parse the page to remove certain noise words so that my extractor pattern matches
This is my extractor pattern:-
<td class="prd-select"><input type="checkbox" value="~@SKU@~" name="aComparedProducts[]" /> </td>
<td class="prd-img"><a href="~@COMPAREPRODUCTURL@~"><img src="~@PRODUCTIMG@~" alt="~@IGNORE2@~" height="~@HEIGHT@~" width="~@WIDTH@~" /></a> </td>
<td class="prd-details">
<p class="prd-name"><strong><a href="~@PRODUCTURL@~">LG ~@MODELNO@~ ~@IGNORE2@~</a></strong></p>
<p class="prd-description">~@PRODUCTTITLE@~</p>
<p class="prd-services"><img src="~@RESERVEANDCOLLECTIMG@~" alt="~@RESERVEANDCOLLECTALT@~" /></p>
<p class="prd-services"><img src="~@HOMEDELIVERYIMG@~" alt="~@INSTOCK@~" /></p>
</td>
<td class="prd-amount-details">
<p class="prd-amount"><strong>~@PRICE@~</strong>~@IGNORE3@~
<p class="prd-more-info"><a href="~@MOREINFOLINK@~">More information</a></p>
</td>
</tr>
<td class="prd-img"><a href="~@COMPAREPRODUCTURL@~"><img src="~@PRODUCTIMG@~" alt="~@IGNORE2@~" height="~@HEIGHT@~" width="~@WIDTH@~" /></a> </td>
<td class="prd-details">
<p class="prd-name"><strong><a href="~@PRODUCTURL@~">LG ~@MODELNO@~ ~@IGNORE2@~</a></strong></p>
<p class="prd-description">~@PRODUCTTITLE@~</p>
<p class="prd-services"><img src="~@RESERVEANDCOLLECTIMG@~" alt="~@RESERVEANDCOLLECTALT@~" /></p>
<p class="prd-services"><img src="~@HOMEDELIVERYIMG@~" alt="~@INSTOCK@~" /></p>
</td>
<td class="prd-amount-details">
<p class="prd-amount"><strong>~@PRICE@~</strong>~@IGNORE3@~
<p class="prd-more-info"><a href="~@MOREINFOLINK@~">More information</a></p>
</td>
</tr>
My issue is that occasionally ~@PRODUCTURL@~ varies and i want to do a replace in this token before I parse it, i.e. here are two different rows
LG 19LD350 19" HD Ready LCD TV
LG Flatron M2262D 22" Full HD LCD TV
I am trying to get the model number (and this is the only place in the page in which it is shown and it is usually after LG but sometimes there is an other word... like Infina or Flatron etc... so how can I remove this optional Flatron word so that it picks up M2262D? Or can a clever regular expression do this...
Tricky Regex Work Arounds
I have found that there are two optios and each has its benefits. The first is to try to use the extractor pattern to get the information for you, this is only possible if there is some part of the pattern that you can key off of to tell if there is an extra word.
In the two examples that you gave there is. There is always the LG, a possible word, the model number and then a number followed by a double quote (e.g., LG Flatron M2262D 22"). If this is always the case you can use something like:
You would then give ~@extra_word@~ the non-HTML regex and it will get what it needs as long as there are not two numbers followed by double quotes in the product name. That could cause it to get more information then it should. You could use a non-space regex ([^ ] or [^\s]) if it is always a single word that gets inserted, but then multiple words become a problem. Basically you have to look for a pattern that more particularly defines the elements.
The non-HTML regex can match nothing so if there isn't a word it will not have problems and if there is the spaces in the pattern and the double quotes will ensure that it doesn't go beyond where it should (as long as there is only one double quote and the number always follows).
The other option is to extract the whole section and apply a regular expression in your script to strip out the parts. This allows for more logic to be applied including different regular expressions if one doesn't give you the expected results. One expression would be essentially a composite of the various extractor tokens in the current pattern and you could have as many tests as you want until you get something that is reliable for you.
I tend to favor the first. If there is a pattern that I can isolate, I can avoid the scripting of my regular expressions. Sometimes there are just too many variables for the extractor pattern to be adequate and so I have to do it, but I try to avoid it even though it can handle outliers a lot better.
regex wont do what i need
The point being that the Extra word is only there occasionally and i dont see how i can use regex because there are not two numbers followed by double quotes in every one. and I cant see how what you are suggesting would work because there are multiple words and/or multiple numbers and in different sequences... however if I can remove the word "infinia" and "flatron" from the page (and maybe others as noise words) then maybe I can make this work. How can I remove a list of chosen noise words from my text before it is matched
That's a big, hairy extractor
That's a big, hairy extractor pattern. I think for that, it would be best to use a DATARECORD token, so it would look like:
~@DATARECORD@~
</tr>
Then use sub-extractors to match each part ... therefore if there is an extra bit of HTML in there, or something is missing, what is there will match anyhow. That should also help take care of this issue.