where to scrape ajax innerHTML?
I'm trying to scrape a branch locator site, the returns results using ajax to replace the contents of a span tag
https://www.bankofthewest.com/BOW/assets/vcmStaticContent/BOW%20Internet...
Any clues where I should begin?
TIA
Eric
Can screen-scraper process dynamic HTML?
I know I'm reviving a pretty old comment thread here but I'm not entirely clear on the last comment.
Can screen-scraper process dynamic HTML? Which is to say, when screen-scraper requests a page and that page contains Javascript that dynamically puts HTML into the DOM can that dynamic HTML be parsed by screen-scaper?
If so, any general tips on how to do that?
Thanks
I'll be honest, I'm having
I'll be honest, I'm having trouble using proxies against that site. A really solid proxy program called "Charles proxy" just times out while trying to monitor the site. And the screen-scraper proxy isn't yielding any information that you're after. The google-maps gadget isn't really important, I don't think. It's really just a question of assembling a URL that can query the website's database..
The reason you might be
The reason you might be having trouble proxying this site could be due to the number of requests being made by different domain/sub-domains using SSL. I was able to successfully proxy it using screen-scraper and Firefox 2 (FF3 limits access), but was prompted to confirm the normal secure certificate mis-match message multiple times.
In the resulting transaction log you'll find the following URL.
https://www.bankofthewest.com/BOW/assets/vcmStaticContent/BOW%20Internet%20Contents/GoogleMaps/inc/branches.js
It contains all of the results available for the address searched (regardless of the number of results you may have set on the form in your browser).
Because this URL does not require any GET or POST parameters be passed it does require the appropriate referrer. So, you'll need to first request the following URL (containing your address query) as a separate scrapeable file immediate before calling the branches.js URL.
https://www.bankofthewest.com/BOW/assets/vcmStaticContent/BOW%20Internet%20Contents/GoogleMaps/locator.html?frmMain.x=0&frmMain.y=0&address=95691
-Scott
not working
in my sss I call the locator.html with the zip code then I call the branches.js, but I get the entire file back, not just the nearby branches.
Hm. Well it seems to me that
Hm. Well it seems to me that their website is relying on that file to calculate the nearby ATMs. I don't think they're using a simple radius from the Zipcode center because when I search my old zipcode in the state of Kentucky, it shows a few results from way out in other states, since there aren't any nearby locations to KY.
My gut tells me that you'll likely have to rely on your own algorithm to get what you want out of that JS file. You could stick an extractor pattern on that JS file (as a scrapeableFile, of course) which saves the name of the location, the lon/lat, whatever info you need. The JS is saving it into an array, and I think you'll have to do the same.
If you write up your big extractor pattern, you can then make a script to run "After pattern is applied" (not after each pattern application) which processes your DataSet:
// Interpreted Java
int numRecords = dataSet.getNumDataRecords();
int numVarsPerRecord = dataSet.getDataRecord(0).size();
String[][] branches = new String[numRecords][numVarsPerRecord];
for (int i = 0; i < numRecords; i++)
{
// 'location' will be a Hashtable object, where each variable
// from the extractor pattern is a key.
location = dataSet.getDataRecord(i);
// You know what the keys are, so just start yanking data out of
// the 'location' variable and setting it in our array 'branches'
branches[i][0] = location.get("NAME");
branches[i][1] = location.get("URL");
branches[i][2] = location.get("etc, etc...");
}
// Now you can process those entries however you want... by distance from its long/lat, etc.
// You could save this String[][] array into a session variable for later use if you want.
Really, it becomes up to you how you want to filter that array into desired results. Granted, this becomes your own algorithm, and not the one behind the curtain that they're using. *shrug*
Hope this is helping. If you want to try to get the array sorted right there in the script, you could import java.util.Arrays and then make a little class object which implements Comparator, so that you can use the following line of code:
Arrays.sort(myComparatorClassObject, branches);
where 'myComparatorClassObject' is an instantiation of the custom class I just mentioned.
And, out of left field, you could choose to do this in a completely different programming language, like Python, whose capabilities are much more power and brief when it comes to scripting. (I personally don't like Java very much because it makes such basic tasks so amazingly complicated, when compared to other scripted languages.)
Either way you choose, it looks to me like you'll have to manually figure out a little radial distance thing to determine which results are closest to the zip code. If you can find the algorithm in the JS someplace, that's great, but otherwise, you could just hack one out to find the top 5 results by distance.
you given me something to think about
If this is the case, then I'm not sure what the advantages would be of using SS? I might as well write my own mobile Branch Locator.
thanks for your help
Eric
Some websites like this try
Some websites like this try to get extra tricky by calculating things server-side, without ever making obvious the method to invoke the calculation. Surely it happens someplace on the site, but this one has gone the extra mile to muddle up the process that we normally follow. I hope that you don't base your overall judgment of SS on this site and the way that it operates. In the end, this is precisely why SS empowers you with the ability to include custom scripts, to calculate things that are needed. For this situation, it is certainly *possible* to scrape the site the way you initially intended, but they're hiding so many parts of the site's operation that it's throwing darts in the dark to guess at what to do to harness that operation.
If it's of any use, I notice that in the JS file that you're getting, it's saving the data into a variable named after a JSON object, which is something like a special request entity, differing from GET/POST. SS has methods for setting a custom request entity, where, upon finding the syntax they're using for their JSON request object, you could set the request entity on a script "Before file is scraped". The trick at hand is to figure out what the magical URL to that scrapeableFile would be. If you can find that, then you're more than half done. The request would return to you exactly the data that you are wanting to scrape. It would likely be in XML format, making it extremely easy to parse.
They've likely obscured these details in javascript... I can't seem to figure it out yet, though..
thanks
this is tricky, the DOM source (available in IE Developer Tool Bar,etc) has the info required, but the regular source accessible by ss, does not :-(
The IE Developer Toolbar (and
The IE Developer Toolbar (and other realtime source monitors/debuggers) do show the right info, but if you use a normal IE, Firefox, Opera, Chrome, Safari (etc) installation and select its 'view source' option, you won't see the needed info either. Screen-scraper falls into this category of static source viewer. The only reason that IE Dev Toolbar can see the info you want is because it's a source "monitor", where it watches the javascript and dynamically adds/removes source info when javascript modifies it. Screen-scraper can't show you that info because screen-scraper isn't a javascript interpreting engine, and it can't show you changes made in the source by javascript because that is happening as a separate process from the actual HTTP request.
If you use Chrome's Source monitor, Firefox's "Firebug" extention, Opera's "Dragonfly" tool, you'd be able to see the needed info too. It's all about live monitoring of the source code, which screen-scraper won't do, since SS just makes the request to the file, and then doesn't process javascript.