Complicated Site with Frames Looks Good in Proxy Session but not in Scraping Session
I am working on setting up a scrape of a site that requires an external proxy (which I have set up), brings the user to a license agreement page first and then to the main search page. From there the user can search multiple types of law journals. Using a screen scraper proxy session, I am able to capture the information that I need. The final search results are listed in a frame and I have been able to discern which of the many entries in the Proxy Session Progress tab is the frame that I need. I've looked at the response in the proxy session and the data that I want is there. I create scrapeable files from the entries in the proxy session and then run the scraping session. The response that I get is not correct - I can't even tell what it is that I'm getting back. From what I can see, they are using BrowserHawk as there are a lot of calls to BH functions in the response. No data. I would be happy to send you the response in a separate email if that would be useful. The session cookies appear to be propagating appropriately, although they might be expiring (?) as I see a timeout function that's being called regularly.
Any help or advice here would be much appreciated.
robind, We've run into
robind,
We've run into BrowserHawk one time back in '08. The solution ended up being pretty simple. All that we needed to do was set a cookie before the scrape started and the scraping session propagated it through the rest of the scrape.
Here is the cookie we set. Your cookie should resemble one from your own proxy session.
Hope this helps,
Scott
Thanks, Scott. I'll give it
Thanks, Scott. I'll give it a try!
Well, it looks like it will
Well, it looks like it will be a little more complicated than that as they appear to be setting a state, a timestamp and the current url in the blackhawk cookie. I'll have to figure out how to capture those before setting the cookie. Any suggestions would be much appreciated...
robind, I'm guessing the
robind,
I'm guessing the timestamp looks a little like this: 1322694237888. If so, you can generate a current timestamp like so...
If your timestamp happens to be two digits shorter than my example (because yours is a Unix timestamp and mine is a Java timestamp which includes milliseconds whereas a Unix timestamp does not) then you can convert it to a Unix timestamp like simply by lopping off the last two digits like so...
Then, if you're able to construct your URL in a manner similar to how I had suggested in your previous forum post you could do something like this.
I hope this helps.
-Scott