newbie in need of help.
I have been getting on OK with Screen-Scraper until this...
ANGEL WINEMAKING 3627 WEST BROADWAY VANCOUVER,BRITISH COLUMBIA Canada V6R 2B8 604-730-6060 BACCHUS GRAP CONNECTION 3511 HASTINGS STREET E VANCOUVER,BRITISH COLUMBIA Canada V5K 2A8 604-299-4848 BEYOND THE GRAPE 2603 KINGSWAY AVENUE VANCOUVER,BRITISH COLUMBIA Canada V5R 5H4 604-437-7100 GRAPE ESCAPE 902 COMMERCIAL DRIVE VANCOUVER,BRITISH COLUMBIA Canada V5L 3L7 604-254-1200 GRAPEVINES WINEMAKING 1314 SW MARINE DR VANCOUVER,BRITISH COLUMBIA Canada V6P 5Z6 604-261-2739 MOSAIC WINE MAKER 1263 PACIFIC BLVD VANCOUVER,BRITISH COLUMBIA Canada 604-602-9463 NEIGHBORHOOD WINEMAKERS 1680 DAVIE STREET VANCOUVER ,BRITISH COLUMBIA Canada V6G 1V9 604-683-7777 PURPLE GRAPE WINEMAKER 125-555 WEST 12TH AVE VANCOUVER,BRITISH COLUMBIA Canada V5Z 3X5 604-873-9669 THE WINE CELLAR 1659 RENFREW STREET VANCOUVER,BRITISH COLUMBIA Canada V5K 3X7 604-251-9461 WEST COAST U-BREW 1616 CLARKE DR. VANCOUVER,BRITISH COLUMBIA Canada V5L4Y2 604-875-0600 WINE CASTLE THE 4172 FRASER STREET VANCOUVER,BRITISH COLUMBIA Canada V5V 4E8 604-877-1177 WINEMASTER 4107 MACDONALD STREET VANCOUVER,BRITISH COLUMBIA Canada V6L 2P1 604-731-9463 |
The issue is that I can not figure out a the proper extractor pattern to accurately identify one store's information from the next.
I have tried using a bunch of different extractor patterns but the most successful is this one but it gives me garbage for the first record and skips the last:
This one was good too but skipped every other record because SS starts looking for the next record AFTER the last character that identifies the previous record so I could not use the ending as the starting for the next record:
~@DATARECORD@~
I have tried just straight extractor patterns but some of the listing have URLs and EMAILs and other don't and i need to capture those items so I really need to use sub-extractor pattens. Any assistance getting this working would be greatly appreciated.
Thanks, Carl.
newbie in need of help.
ceshelman,
In order to get multiple results from sub-extractors you've got to do a few twists and turns and be running either the professional or enterprise edition.
You'll need to make use of the [url=http://www.screen-scraper.com/support/docs/api_documentation.php#extractData]extractData method[/url]. Here's an examples to follow (it's a bit much to try to explain in words only).
http://community.screen-scraper.com/script_repository/manual-extraction-example
I hope this helps. Sorry about needing to upgrade if you do.
-Scott
close but not quite there
Thanks for the reply. This is close but not quite there. SS does not appear to run sub-extraction pattens on a datarecord more than once even if there are multiple instances of the information in the datarecord. So all that is returned is a single row with the first instance if the sub-extractor patten data that SS comes across.
Thanks, Carl.
newbie in need of help.
Carl,
You've got the right idea you just need to expand out a bit and not overlook those nice consistent
tags they're using, too. Here's what I'd recommend for the main extractor pattern text:
<b~@DATARECORD@~<br />
</td>
</tr>
</table>
DATARECORD retuns all of the store details but now they'll be one long string stripped of hard returns and tabs. Looks like this:
>ANGEL WINEMAKING</b> <br />3627 WEST BROADWAY<br />VANCOUVER,BRITISH COLUMBIA<br />Canada<br />V6R 2B8<br />604-730-6060<br /> <br /><b>BACCHUS GRAP CONNECTION</b> <br />3511 HASTINGS STREET E<br />VANCOUVER,BRITISH COLUMBIA<br />Canada<br />V5K 2A8<br />604-299-4848<br /> <br /><b>BEYOND THE GRAPE</b> <br />2603 KINGSWAY AVENUE<br />VANCOUVER,BRITISH COLUMBIA<br />Canada<br />V5R 5H4<br />604-437-7100<br /> <br /><b>GRAPE ESCAPE</b> <br />902 COMMERCIAL DRIVE<br />VANCOUVER,BRITISH COLUMBIA<br />Canada<br />V5L 3L7<br />604-254-1200<br /> <br /><b>GRAPEVINES WINEMAKING</b> <br />1314 SW MARINE DR<br />VANCOUVER,BRITISH COLUMBIA<br />Canada<br />V6P 5Z6<br />604-261-2739<br /> <br /><b>MOSAIC WINE MAKER</b> <br />1263 PACIFIC BLVD<br />VANCOUVER,BRITISH COLUMBIA<br />Canada<br /> <br />604-602-9463<br /><a href="http://www.vinosaurs.com" class="textviolet" id="underline">www.vinosaurs.com</a> <br /> <br /><b>NEIGHBORHOOD WINEMAKERS</b> <br />1680 DAVIE STREET<br />VANCOUVER ,BRITISH COLUMBIA<br />Canada<br />V6G 1V9<br />604-683-7777<br /> <br /><b>PURPLE GRAPE WINEMAKER</b> <br />125-555 WEST 12TH AVE<br />VANCOUVER,BRITISH COLUMBIA<br />Canada<br />V5Z 3X5<br />604-873-9669<br /> <br /><b>THE WINE CELLAR</b> <br />1659 RENFREW STREET<br />VANCOUVER,BRITISH COLUMBIA<br />Canada<br />V5K 3X7<br />604-251-9461<br /> <br /><b>WEST COAST U-BREW</b> <br />1616 CLARKE DR.<br />VANCOUVER,BRITISH COLUMBIA<br />Canada<br />V5L4Y2<br />604-875-0600<br /> <br /><b>WINE CASTLE THE</b> <br />4172 FRASER STREET<br />VANCOUVER,BRITISH COLUMBIA<br />Canada<br />V5V 4E8<br />604-877-1177<br /> <br /><b>WINEMASTER</b> <br />4107 MACDONALD STREET<br />VANCOUVER,BRITISH COLUMBIA<br />Canada<br />V6L 2P1<br />604-731-9463<br /><a href="http://www.mywinemaster.com" class="textviolet" id="underline">www.mywinemaster.com</a>
They've done you a big favor by accounting for the Postal code even when there is none. This means for your sub-extractor patterns you won't be forced to separate out each element but instead you'll be able to use each element's neighbors to help identify which element it is.
They didn't do the same for any URL's, so you're going to need to handle it by itself.
Here's what I suggest for your sub-extractor patterns:
[b]First[/b]
>~@STORE@~</b> <br />~@ADDRESS_ONE@~<br />~@CITY@~,~@PROVINCE@~<br />~@COUNTRY@~<br />~@POSTAL_CODE@~<br />~@PHONE@~<br />
Using regex on some of these will help. I suggest using the following.
POSTAL_CODE:
[A-Za-z0-9 ]*
PHONE (available under the drop down)
\(?[\d]{3}[)-\. ]{1,2}[\d]{3}[-\. ]{1}[\d]{4}
For the rest use the standard non-html:
[^<>]*
[b]Second[/b]
<a href="~@URL@~" class="textviolet" id="underline">~@WEBSITE@~</a>
Also, if you anticipate them having a field for an email address you could do a third sub-extractor where it would look something like this.
<a href="mailto:~@EMAIL@~" class="textviolet" id="underline">~@EMAIL_ADDRESS@~</a>
The nice thing about sub-extractor patterns is that they're not required to match. So, if you never saw an email address it would be ok.
Hope this helps,
Scott