urls for java servlets are all identical
I am trying to scrape a site where all of the salient html is posted via java servlets. The problem I am encountering is that each page has the same url, e.g. "http:/www.website.com/DisplayServlet".
I am able to isolate the appropriate HTTP transaction from the "Progress" tab in the Proxy session, generate a scrapeable file, and successfully extract the desired data from the scrapeable file.
The problem is encountered when I run the scraping session. Looking at the log file it would appear screen-scraper is trying to apply the extractor pattern to the first occurence of the url in the html response file and hence is unable to find any matches.
I am using screen-scraper_pro 4.0. Any suggestions on how to get beyond this issue?
Thanks.
urls for java servlets are all identical
timz,
You'll need to scrape that unique request id from the previous page to the page that uses it as a post parameter, save that that string as a session variable and use it as the post parameter when requesting each subsequent page.
Likely, that string will change each time so you'll want to scrape and use it each time.
-Scott
java servlets
Scott --
I'm replying to an old post. I had to abandon my efforts with screen-scraper because of the problem I noted in this thread. I've managed to automate this task with another scraper product, but much prefer the screen-scraper approach. Have there been any updates to SS in the past 9 months that might permit me to overcome the problem I've outlined? I had tried to track the request id, but without success.
Thanks,
-tim
Well, I'm sure that your
Well, I'm sure that your problem can be overcome-- screen-scraper just automates the processes that you either do manually or that the site does behind the scenes. That being said, it could still potentially be tricky in a situation like what you briefly outlined in the original post.
I know that it's been a while now, but can you recall the specific website, or give some specific samples about what was (or was *not*) happening? Without something direct, a reply like mine becomes strict rhetoric about "how screen-scraper works".
Based on what I know of the issue, I would guess that some sort of HTTP transaction is being fired from the applet, and then processed server-side, and then it's replying to the applet, rather than directing your browser to a new page.
If this assumption is correct, then it's probably being handled by one of two things:
Frankly, in either case, the technique is generally straightforward; the little requests/replies that are made by either AJAX or an applet will show up in the screen-scraper proxy.
AJAX, being XML in nature, it dirt easy to parse, since everything is labeled. A reply to an applet would be similar, I imagine.
The trick is that you're not going to be able to "navigate" an applet very well, as you've discovered. You'll have to analyse the applet and watch your proxy server, and learn how to manually assemble the HTTP requests that the applet is sending out.
For instance, consider a page like the following: http://www.property24.com/search/PropertySearch.aspx?searchtype=Residential . There's a little IFRAME that dynamically reloads its content depending on the various choices you've made in the drop-down boxes.
It would be madness to try to send off the individual HTTP requests (in this case done by JavaScript) for the sake of reconstructing the *exact* process of browsing the page. Instead, watch the proxy server and you'll notice that the IFRAME's contents are coming to the page as seperate entries in the proxy. If you pay attention to the URL of the IFRAME's contents (via the proxy server), you'll see that the numeric code it uses to contruct the URL is simply the various "VALUE" tags from the HTML "<select>" tags in the drop-down boxes.
Since the URL to the data in the IFRAME can be requested independent of the page itself, I could frankly just build the URL to the IFRAME myself, so long as I know the values I want to give it. I could (and *did*, when I scraped this site recently) just proxy the IFRAME upon selecting all the various search options that I'm interested in. Now all I have to change is a single number in the IFRAME's URL to alter the Province that I'm requesting info for. Better yet, I never have to even visit the search page if I have the Province ID numbers from the drop-down box.
Sorry for being long winded, but I hope I'm illustrating the idea. You can often cut some corners while scraping certain sites. In fact, when dealing with pages like the one you've described, you may very well have to cut those corners.
By all means, let me know if you have any questions or concerns!!
Tim (Tim *V*, not Tim *Z*!)
Thanks
Tim V --
Thanks for the extensive response. I hope to have time to explore this next week and will revert with questions.
urls for java servlets are all identical
Scott --
Yes, there is a unique request ID for each page.
-tim
urls for java servlets are all identical
timz,
Are there any post parameters being passed between pages?
-Scott