Reg expression subset
Hello,
I'm a bit new to regular expressions so perhaps what I'm asking isn't possible. I am trying to extract a value where it appears before the unique part of the datarecord. This is the raw response:
<td class="row1"><span class="gen">Timezone</span></td>
<br />
<td class="row2"><select name="timezone"><br />
<option value="7">GMT + 7 Hours</option><br />
<option value="8">GMT + 8 Hours</option><br />
<option value="9">GMT + 9 Hours</option><br />
<option value="9.5">GMT + 9.5 Hours</option><br />
<option value="10" selected="selected">GMT + 10 Hours</option><br />
<option value="11">GMT + 11 Hours</option>
<br />
<td class="row2"><select name="timezone"><br />
<option value="7">GMT + 7 Hours</option><br />
<option value="8">GMT + 8 Hours</option><br />
<option value="9">GMT + 9 Hours</option><br />
<option value="9.5">GMT + 9.5 Hours</option><br />
<option value="10" selected="selected">GMT + 10 Hours</option><br />
<option value="11">GMT + 11 Hours</option>
I am trying to extract the value "10" which appears just before the text: "selected". This is the subextractor pattern I'm using:
>Timezone<~@user_timezone@~ected="selected">
I can't use "option value=" as it is repeated. So I have to use the whole lot then the regular expression: "\d+" sel
but this gives me a result of: 10" sel
All I want is the 10. I'm sure I could use a script to split the resultant string but I'm wondering is there a way do this with reg ex's? or some other way I'm missing... I'm sure the case of unique text appearing AFTER the desired value is common.
thanks
I understand the approach
I understand the approach you're trying to use, though I'd advise letting the Regex do a little more of the work. Try this as the main extractor pattern on the page:
This will grab out just the option box you're after. Then, try this sub-extractor, which will just scan within the 'select' construct:
That 'user_timezone' variable should have a slightly different pattern, because I see in that example text that there's a possible '9.5' value. A '.' is not a digit, and so '\d+' would actually fail to match a result in the event that the real value was '9.5' or any other fractional number. That being said, the pattern I'd use would allow for decimal places: [\d.]+ This way, it will match any combination of digits and decimal points. Theoretically, it would also match 9..1234.43.2.3, but that won't ever pop up on the page, so you're safe.
Basically, the concept I'm utilizing in my extractors is that the pattern simply will fail to match on all the other '<option value=' lines, due to the fact that my sub-extractor includes the 'selected="selected"' text. Including that text says to screen-scraper that the pattern is only allowed to match if 'selected="selected"' follows the number you're after. Otherwise, it'll just continue searching the DATARECORD text until it finds a matching entry.
If that's still a little unclear, I'd be happy to explain it a bit more. Let me know if there are any other issues with that!
Tim
wow! thankyou for the very
wow! thankyou for the very detailed response Timv. Took me a few days to get a chance to try it all but it's working well now.
Part of the problem was that I was doing the first part of the match as a subextractor pattern already (using the main pattern to extract the body of the full record so couldn't drop to a sub-sub extrator patter... but after reading your post I realised I didn't need to as there was only one complete data record on the page.
now I'm on a roll I just need to understand the DB structure of phpBB... The reason I came across SS was because I need to scrape my forum off it's free host (who won't give me a DB backup). It's starting to look like it could be a successful project.
Thanks again...
haha.. funny you're scraping
haha.. funny you're scraping the site-- We did a similar thing to get is moved into this website!