RE: Extractor Pattern - Only getting one data set
MacBook
System: OS X
Version: 10.6.3
Screen-Scraper Basic
AIM: To compile a list of wineries (just their names) in Spain using the Spanish YellowPages website.
Notes: I thought this would be basically the same as the E-commerce site tutorial but evidently not :(
My computer skills are pretty poor and after playing around with screen-scraper over the weekend
have decided I should probably just ask for help as I'm getting no where. Below are the steps I have made
in an attempt to extract this information in case I'm doing something wrong before the 'Extractor Pattern' step.
Step 1:
- New Proxy session
- Start Proxy session; http://www.paginasamarillas.es/
- Search input under Actividad: 'Bodega' (meaning Winery); Leave Provincia: 'Todas' (All Regions)
- Hit 'Encontrar' (meaning Search) button (should get around 5000 wineries)
- Go to bottom of page hit 'Siguente' (meaning next)
- Stop Proxy Session
Step 2:
- New Scraping Session
- Go to Proxy session; Progress tab; find 'GET http://www.paginasamarillas.es/resultados.asp?
activ=Bodega&mode=simple&site=paol&pg=2&id_busq=paol250100613041539XCB8767D3879DB961 HTTP/1.1'
(Number 23) and Generate scrapable file
- Go to scrapable file; parameters tab; change activ from 'Bodega' to ~#SEARCH#~ and pg from '2' to ~#PAGE#~
Step 3:
- New Script file; using interpreted java
- In text box:
// Set the session variables.
session.setVariable( "PAGE", "1" );
session.setVariable( "SEARCH", "bodega" );
- Go to scraping session; script tab; add just made script and Run scraping session
Step 4:
- Go to scrapable file; extractor patterns tab and add extractor pattern.
- ?
And here I'm stuck. I'm not sure what part of the source code I'm meant to use and where I place my tokens.
I've tried different bits as an example:
href="/functions/jump.aspdest=www%2Ecaviarhouse%2Dprunier%2Ecom&mode=simple&site=paol&
id_busq=paol250100613041539XCB8767D3879DB961&posicion=1&t=IL&producto=PAO&
c=253459894&a=005&gp_orden=LEXC" target="_blank">CAVIAR HOUSE & PRUNIER<
/span>
But like this, where:
~@PRODUCTID@~ is a URL GET PARAMETER and a saved session variable
~@PRODUCT_VALUE is a non-HTML tags
href="/functions/jump.aspdest=www%2Ecaviarhouse%2Dprunier%2Ecom&mode=simple&site=paol&
id_busq=paol250100613041539XCB8767D3879DB961&posicion=~@PRODUCTID@~&t=IL&
producto=PAO&c=253459894&a=005&gp_orden=LEXC" target="_blank">~@PRODUCT_VALUE@~
But all I get is the first winery and none of the others and I'm now at a loss of what else to do.
If someone could help me out that would be great because I haven't a clue. lol
doing really well so far!
Congrats Smro, You've provided probably the best, most descriptive forum post I've seen in a long time. Without even seeing the website you're scraping I feel like I know it.
Lets skip straight to the case since it looks like you've got some of the navigation working and the search getting you to a results page.
quote:
href="/functions/jump.aspdest=www%2Ecaviarhouse%2Dprunier%2Ecom&mode=simple&site=paol&
id_busq=paol250100613041539XCB8767D3879DB961&posicion=1&t=IL&producto=PAO&
c=253459894&a=005&gp_orden=LEXC" target="_blank">CAVIAR HOUSE & PRUNIER<
/span>
But like this, where:
~@PRODUCTID@~ is a URL GET PARAMETER and a saved session variable
~@PRODUCT_VALUE is a non-HTML tags
href="/functions/jump.aspdest=www%2Ecaviarhouse%2Dprunier%2Ecom&mode=simple&site=paol&
id_busq=paol250100613041539XCB8767D3879DB961&posicion=~@PRODUCTID@~&t=IL&
producto=PAO&c=253459894&a=005&gp_orden=LEXC" target="_blank">
>~@PRODUCT_VALUE@~
You're on the right track with this above. The reason you're only getting 1 result is because you need to make your extractor pattern a bit more generic. This is actually a delicate balance because you want it to be generic enough that it gets all the desired results off the page, but unique enough that it does grab like the entire page.
Take the following example from below. notice how caviarhouse is spelled out right at the beginning of the href? with something that specific on the extractor pattern you'll never get any results that don't have caviarhouse in them. My guess is that only 1 result on the page will have caviarhouse in it. Also, do you notice the long stretch of numbers with the XCB and DB in the get parameter? My guess is that this is also very unique to the caviarhouse even if it isn't spelled out. Some unique ID possibly. You'll need to get rid of these unique values before you can get this pattern to match multiple listings in the phone book.
href="/functions/jump.aspdest=www%2Ecaviarhouse%2Dprunier%2Ecom&mode=simple&site=paol&
id_busq=paol250100613041539XCB8767D3879DB961&posicion=~@PRODUCTID@~&t=IL&
producto=PAO&c=253459894&a=005&gp_orden=LEXC" target="_blank">
>~@PRODUCT_VALUE@~
Here's a possible solution - or at least enough of a building block to hopefully jump start ya.
href="~@junk@~posicion=~@PRODUCTID@~&~@junk2@~>~@PRODUCT_VALUE@~<
This may be to greedy and you may find that it isn't specific enough for the results you're looking for; however, I believe it demonstrates my point. You need the first junk tag to take everything from the start double quote (") to the string "posicion". Then you can scrape out the PRODUCTID. After that you want junk2 to remove and unique stuff from the & immediately after the PRODUCTID till the (>). This will make your pattern a little more generic but should also be specific enough that you're only going to get the desired data points.
Remember that when you are trying to craft extractor patterns that you need to look for unique pieces. If you are not careful with these unique pieces you'll hurt your extractor pattern to only look for too specific pieces.
Come to think of it, you could probably simplify this even further... possibly.
posicion=~@PRODUCTID@~&~@junk@~>~@PRODUCT_VALUE@~<
You likely won't even need the beginning part as posicion is very specific.
best of luck.
RE: Woo Hoo! Thank you :)
All I had to do was put ~@JUNK@~ around a few places and I was able to get it to work!
But I ended up using the Form id="FURL_0" as my ~@ProductID@~ because the posicion had duplicates. If that makes any sense :)
So thank you very much for getting back to me so quickly
Kind Regards
Sachi