Scraping AJAX

I'm trying to figure out how to use screen-scraper to scrape pages using Ajax, and I haven't found anything that's very helpful beyond reference to some methods that should be used (e.g. setRequestEntity, addHTTPHeader, etc.).

One of the sites I'm trying to scrape is www.harvardpilgrim.org, specifically the doctor lookup. When I look at the pages captured in the proxy session, instead of the normal parameters, I see the following type of request:

{"facetNames":["Rating","PCQR","HospitalQuality","HospitalCondition"],"context":{"ClientId":"","SiteId":10020,"MasterSiteId":10020,"SessionMode":0,"SiteLanguage":'en-US',"VisitorGuid":'896d0443-4030-4bae-8600-67f66ceaadd5',"SessionGuid":'cfe717ba-b3d8-4fb3-94f5-c09a83665beb',"NavigationHistory":[{"Page":"http://www.providerlookuponline.com/Harvardpilgrim/po7/gateway.aspx","ReferreringPage":"null"},{"Page":"http://www.providerlookuponline.com/Harvardpilgrim/po7/Search.aspx","ReferreringPage":"http://www.providerlookuponline.com/Harvardpilgrim/po7/gateway.aspx"},{"Page":"http://www.providerlookuponline.com/Harvardpilgrim/po7/Results.aspx","ReferreringPage":"http://www.providerlookuponline.com/Harvardpilgrim/po7/Search.aspx"}],"SearchCriteria":{"FacetCriteria":{"FacetSelectionCriteria":[{"FacetSelection":{"FacetDescriptorName":"Network","SelectedItems":[{"ItemName":"PPO","ItemIds":["3465438"],"ItemCount":0}],"SelectionType":0},"IsVisible":true,"IsRemovable":false},{"FacetSelection":{"FacetDescriptorName":"Role","SelectedItems":[{"ItemName":"Hospitals","ItemIds":["4"],"ItemCount":0}],"SelectionType":0},"IsVisible":true,"IsRemovable":false},{"FacetSelection":{"FacetDescriptorName":"State","SelectedItems":[{"ItemName":"Massachusetts","ItemIds":["MA"],"ItemCount":0}],"SelectionType":0},"IsVisible":true,"IsRemovable":true}]},"BMSCriteria":null,"GeographicCriteria":null},"FeatureSettings":[{"Feature":"has_feedback","Value":"false"},{"Feature":"has_bms_match_score","Value":"false"},{"Feature":"has_bms_match_score_for_products","Value":"false"},{"Feature":"has_bms_analytical_data","Value":"false"},{"Feature":"has_sanctions","Value":"false"},{"Feature":"has_photos","Value":"false"},{"Feature":"has_preferred","Value":"false"},{"Feature":"has_hiq","Value":"false"},{"Feature":"has_hiq_data_values","Value":"false"},{"Feature":"has_cms","Value":"false"},{"Feature":"has_leapfrog","Value":"false"},{"Feature":"has_pcqr","Value":"false"},{"Feature":"has_compare","Value":"true"},{"Feature":"has_keep_current_selections","Value":"true"},{"Feature":"has_ncqa","Value":"false"},{"Feature":"has_vcard","Value":"true"},{"Feature":"has_maps","Value":"true"},{"Feature":"has_mass_transit","Value":"true"},{"Feature":"has_refer_a_friend","Value":"true"},{"Feature":"has_custom_provider_attributes","Value":"true"},{"Feature":"has_out_of_network","Value":"false"}],"SelectedProductHasBMS":true,"NavigationId":'bf3d2aa4-e401-463f-a509-e6ba011851f6',"SiteSettings":[{"Key":"DefaultMileage","Value":"10"},{"Key":"MaxSearchDistance","Value":"100"},{"Key":"ProvidersPerPage","Value":"20"},{"Key":"MaxProvidersForPaging","Value":"1000"},{"Key":"SuppressFacetFillThreshold","Value":"10"},{"Key":"min_compare","Value":"2"},{"Key":"max_compare","Value":"3"},{"Key":"product_selection_visible","Value":"true"},{"Key":"product_selection_required","Value":"true"},{"Key":"show_specialty_definitions","Value":"false"},{"Key":"show_condition_definitions","Value":"false"},{"Key":"UseLocalProviderPhotos","Value":"true"},{"Key":"OnlineDirectoryThreshold","Value":"500"},{"Key":"FaxDirectoryThreshold","Value":"500"},{"Key":"EmailDirectoryThreshold","Value":"15000"},{"Key":"FilterDetailsByProduct","Value":"true"},{"Key":"FilterDetailsByRole","Value":"false"},{"Key":"FilterDetailsBySpecialty","Value":"false"},{"Key":"harvest_provider_data","Value":"true"}]}}

I'm not sure how to process this, especially because the parameters seem to be grouped in some cases (e.g. "SearchCriteria":{"FacetCriteria":{"FacetSelectionCriteria":[{"FacetSelection":{"FacetDescriptorName":"Network","SelectedItems":[{"ItemName":"PPO","ItemIds":["3465438"],"ItemCount":0}].

Any help deciphering this would be appreciated.

Error

Thanks, Scott.

I set up the session as you described using the script and substituting the session variables for the IDs. However, when I run the page, I get an error message ("An Error Occurred While Processing Your Request") instead of the results I need.

Is there some other setting I need to use?

jclerie,You should just be

jclerie,

You should just be able to pass that entire block of code through the POST payload using the scrapeableFile.setRequestEntity() method.

Even though I'm referring to the POST payload I don't mean to suggest that you would place your code under the Parameters tab of your scrapeable file. Rather, you'll want to create a script where you'll construct your method. The POST payload is simply the generic term referred to data that is not in the querystring that is being passed between the browser (or screen-scraper in this case) and the web server you're scraping from.

You'll likely need to pass dynamic variables in, as well. Here's a truncated example of what it might look like in your script.

// This example is only a part of the entire code block
// you should be passing in your request.

scrapeableFile.setRequestEntity("\"facetNames\":[\"Rating\",\"PCQR\",\"HospitalQuality\",\"HospitalCondition\"],\"context\":{\"ClientId\":\"\",\"SiteId\":10020,\"MasterSiteId\":10020,\"SessionMode\":0,\"SiteLanguage\":'en-US',\"VisitorGuid\":'" + session.getv("VisitorGuid") + "',\"SessionGuid\":'" + session.getv("SessionGuid") + "',\"NavigationHistory\"");

Note how each of the double-quotes in the code block have been escaped with a backslash. Also, note the reference to two session variables VisitorGuid & SessionGuid.

Do something similar by extracting the values of variables that you need to pass from a previous scrapeable file's response or as input from, say, your database or a CSV file. Then, call your script "Before file is scraped" on the scrapeable file that needs the data passed to it.

-Scott

John, There are a few

John,

There are a few required techniques that are unique to scraping .Net sites. Here is blog entry I wrote on scraping .Net sites.

This particular site isn't too particular, though. Just pay close attention to the referer for each of your scrapeable files and be sure to pass the VIEWSTATE properly, each time.

And...for the AJAX, be sure to add something like this prior to scraping each AJAX scrapeable file.

scrapeableFile.setContentType("application/json; charset=utf-8");

Let us know if you run in to any snags.

-Scott

Parameters

Thanks, Scott.

The one missing piece is adding parameters. Even when I add scrapeable files for pages that include parameters (e.g. Get First Tier Filters and Add Facet Selections), I still get the same result set.

John, These can be tricky

John,

These can be tricky sites to scrape. Send me an email if you would like me to give you a free quote on what we would charge to do the work for you.

-Scott
scottw [at] ourdomain