Setting Extractor Pattern Token within iframe tag?

Hello,

I've been using ScreenScraper to do some simple test scrapes from websites, and it works very well. I'm trying to step things up a bit and get SS to do even more things for me: namely automate the downloading of a PDF file. I've read the various posts about this, and am following the advice in the post "Can PDF be saved?" i.e.

Download PDF
then
session.downloadFile( session.getVariable( "PDF_URL" ), "C:\mydir\my_doc.pdf" );

But my problem is in setting up the extractor pattern from this chunk of html:

 <br />
                <a href="http://v3.espacenet.com/espacenetDocument.pdf?flavour=phantomFull&amp;locale=en_GB&amp;FT=D&amp;date=19890704&amp;CC=US&amp;NR=4844520A&amp;KC=A" target="MaxView">Click here to open the pdf in a separate window for better navigation</a></p>
<p>                <embed src="http://v3.espacenet.com/espacenetDocument.pdf?flavour=phantomFull&amp;locale=en_GB&amp;FT=D&amp;date=19890704&amp;CC=US&amp;NR=4844520A&amp;KC=A" border="0" height="636" width="704"/><br />

I can set up the extractor pattern, but when performing the scrape, the log states: "The pattern did not find any matches." I've tried setting the token up in the first

<iframe

part, and the second

< a href

part, but no luck. Am I right in thinking that the problem is due to the

<iframe

tag?

(Original page: <a href="http://v3.espacenet.com/publicationDetails/originalDocument?CC=US&NR=4844520A&KC=A&FT=D&date=19890704&DB=EPODOC&locale=en_gb" title="http://v3.espacenet.com/publicationDetails/originalDocument?CC=US&NR=4844520A&KC=A&FT=D&date=19890704&DB=EPODOC&locale=en_gb">http://v3.espacenet.com/publicationDetails/originalDocument?CC=US&NR=484...</a>)

Thanks for any light you can shine on this!

James

SlimJim on 07/08/2009 at 3:05 am

screen-scraper public support

The problem is you've got

The problem is you've got additional attributes between the closing quotes and the closing ">" of the tag

http://v3.espacenet.com/espacenetDocument.pdf?flavour=phantomFull&locale=en_GB&FT=D&date=19890704&CC=US&NR=4844520A&KC=A" target="MaxView"

i.e. target="MaxView"

I assume you've got the token set to use the 'non-double quotes'' regex pattern? If not you would be getting a match but you'd be getting everything up to the

You could try:

as our extractor patter (with non-double quotes as the regex).

If that's the html you're seeing in the scrapeable file response tab then it's probably not an iframe problem. if it is you could try using the iframe tag to grab the URL... i.e.:
iframe src="~@PDF_URL@~"

since it holds the same URL anyway...

shadders on 07/08/2009 at 8:11 pm

Search

Community

screen-scraper

User login

Setting Extractor Pattern Token within iframe tag?

The problem is you've got