scraping search engine data
Hi all,
I’m scraping search engine data and have recently run into a problem. I’ve set up ScreenScraper to extract a pattern from a search engine page resolved with a certain set of keywords. Up until a few weeks ago, everything worked perfectly (and it worked well for close to 6 months) – meaning the pattern would match and return many occurrences on the page for each keyword I fed into the dynamic URL, which is exactly what it was meant to do.
However, recently it has started to drop matches; e.g. it would request a URL for a particular keyword in the search engine, but then would not match the pattern at all. I don't think the pattern spec is wrong because it will still find a match for 50-60% of the keywords that are dynamically fed into the URL, but it is missing a lot of data which should be there. So I’ve been trying to get it back to the way it was but I can’t figure out what it is happening. In my analysis, I have noticed a few strange things:
• The ‘last response’ page rendered by Screen Scraper looks quite different to how the same actual page displays in my browser. I had to completely modify my pattern so that it would match how Scraper renders it, but I don’t understand why it picks up the HTML completely differently from how it actually looks? Could this be the source of my problems?
• When scraping, in the log I notice that the “setting referrer to” URL is different from the resolved URL. For example, if I am scraping two keywords in this order: “car hire”, “red balloon”, then when Scraper is scraping engine data with the “red balloon” keyword, it will say ‘setting referrer to: URL – car hire’. I’m not sure if this is normal or a potential problem, because I didn’t pay attention to it before.
I also thought the issue might be due to my timeout settings in Screen Scraper, but the connection timeout is set to 180 seconds and the pattern timeout to 30 seconds. Given that the URL seems to resolve each time I don’t think this is the problem.
Can anybody suggest why this might be happening and how to fix it?
Thanks very much in advance.
scraper_0011, I recommend
scraper_0011,
I recommend that you call scrapeableFile.getContentAsString() and save out the HTML of each page in question. Load the problem pages in your browser and see if things are as they should be. Perhaps the site is either blocking or obfuscating in some way.
The HTML from the last response tab is going to look different than the HTML when you view the source in your browser. That is normal and shouldn't be of any concern.
Also, how screen-scraper is setting the referrers in your log sound like they are working as they should.
You're correct, I don't think timeouts are the issue, either.
If you are still stumped you can try posting some example code for us to look at.
-Scott
thanks, still having trouble...
Hi Scott,
Thanks for your comments and suggestions. I implemented the getContentAsString() method and using it as a debugging tool couldn't find any problems with my extractor pattern. So I think I can rule that out. I fully agree that it's possible the site is obfuscating my scraping in some way. However before I accept that I still need to understand the page rendering issue.
This is the problem. When I'm first setting up a scraping file, I request a sample Google search page using the Screen Scraper proxy, and based on the 'last response' tab I build my extractor pattern, which works perfectly when I test it. However, when I turn off the proxy and run the scrape on its own, the extractor doesn't match because the page is rendered completely differently. This doesn't make sense - when I first set up Screen Scraper months ago, the extractor pattern I coded using the proxy output HTML worked on both the proxy-captured search page and other pages.
So the page that is being fetched via Screen Scraper outside of proxy mode is somehow faulty/different. How does Scraper actually fetch the HTML? Does it launch my own browser in the background and interface with it somehow? Will the page be fetched from my location or is it routed through somewhere else? (if the proxy mode uses my location but the out-of-proxy Scraper goes through the US or something, this could be the source of the problem)
I'm just trying to understand what would make the proxy-fetched HTML different from the subsequent scraping file HTML.
I'm attaching below the last request log in Screen Scraper for a proxy session and a subsequent scraping file session. There are some differences there that I'm not sure are relevant but someone might pick something up.
Any help is greatly appreciated.
Thanks,
Dan
====================================
Initial Scraper Proxy Session
-----------------------------
GET http://www.google.com.au/search?q=health%20insurance%20review&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a&source=hp&channel=np HTTP/1.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Host: www.google.com.au
Cookie: PREF=ID=be8ec0da2a0fcabf:U=f973ef667b253ba6:FF=0:TM=1316341729:LM=1316342375:S=3teyKnc05ImZ94rj; NID=51=TiBTqKt8wZ4MEzKnZr6y9kRfuLIcAArpox3Cvlu7YIQrqVX7NDOnyVrD3rcUkMJGWIvxmBbY_JNF1-h-ClP-hHnN6rDykXrq1xySeWI2eWqBCY6V11L_vO5knVdDuq1Q
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip, deflate
Proxy-Connection: keep-alive
====================================
When Running The Scrape File
-----------------------------
GET /search?q=pet+insurance+review&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla%3Aen-GB%3Aofficial&client=firefox-a&source=hp&channel=np&key=value&outfile=java.io.FileWriter%4050567b HTTP/1.1
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Accept-Language: en-us,en;q=0.5
Host: www.google.com.au
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: gzip
Referer: http://www.google.com.au/search?q=free+vouchers&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla%3Aen-GB%3Aofficial&client=firefox-a&source=hp&channel=np&key=value&outfile=java.io.FileWriter%4050567b
Cookie: NID=51=N3jf1e09cE6SgOzsC4XHBwUm6nq1nWqglDf9eaBC8juNBiLZtY2UCm-Tp3w8-lHeioLwJCASfjn5mJPXGPYypEJ7H-8hN4FTtC7nVwGnq2FC2Pnqec2nFpRX7LEL1L8_; PREF=ID=be57846d0296b96f:U=7d44f88d4d8f36a6:FF=0:TM=1316344027:LM=1316344029:S=TxlgW9UImZErzkE8
Dan, We have a relatively new
Dan,
We have a relatively new feature available under the Last Request tab of each scrapeable file. If you click it after running your scraping session you will be prompted to compare the current last request with the counterpart equivalent from your proxy transactions.
If you review each of the elements of your request in screen-scraper you will hopefully be able to identify the issue that needs correcting.
Give that a try and let us know what you find.
-Scott