Scraping Ajax inner HTML
I'm trying to scrape a site whose second menu fetches its parameters with ajax based on the selection of the preceding drop-down. I've read what is written on the subject in earlier posts but I don't really understand how I will be able to pick the responses in a scrapable file. This is what the ajax part looks like:
//If ajax support
if(ajax){
//leave only one element as option, exclude the rest
document.formSearch.districtList.options.length = 1;
idOpcao = document.getElementById("opcoes");
ajax.open("POST", "district.php", true);
ajax.setRequestHeader("Content-Type", "application/x-www-form-urlencoded");
ajax.onreadystatechange = function() {
//while processing... display message
if(ajax.readyState == 1) {
idOpcao.innerHTML = "Loading...";
}
//after processing - call function processXML that will fill the data
if(ajax.readyState == 4 ){
if(ajax.responseXML){
processXML(ajax.responseXML);
}else{
//If its not an XML file display the below message
idOpcao.innerHTML = "Select City first";
}
}
}
//get city code
var params = "city="+value;
//alert(params);
ajax.send(params);
}
}
function processXML(obj){
//get the city tag
var dataArray = obj.getElementsByTagName("district");
var sessionDistrict = "";
//total of elements in the city tag
if(dataArray.length > 0) {
//search the XML file to subtract the data
for(var i = 0 ; i < dataArray.length ; i++) {
var item = dataArray[i];
//content in the XML file
var codigo = item.getElementsByTagName("code")[0].firstChild.nodeValue;
var descricao = item.getElementsByTagName("description")[0].firstChild.nodeValue;
idOpcao.innerHTML = "Select a district";
//create a new option dynamically
var new= document.createElement("option");
//set an ID to the attribute
novo.setAttribute("id", "option");
//Begin selection of last search
if( description == sessionDistrict ){
new.setAttribute("selected", "selected");
}
//End selection of last search
//attribute a value
new.value = codigo;
//text attribute
new.text = descricao;
//add a new element
document.formBusca.District.options.add(new);
}
}
else {
//if the XML returns empy display message below
idOption.innerHTML = "Select city first";
}
}
----
And that's it. Any leads on how I can hook in a scrapable file given the above logic?
Best,
Johan
That's a pretty intense batch
That's a pretty intense batch of code.
Can you use the screen-scraper proxy to capture the HTTP request this request makes? If not, sometimes an alternate proxy like Charles or HTTPFox can show it to you. Once you can see their request, you should be able to make a scrapeable file to emulate it.
~Jason
Jason, Thanks for the answer.
Jason,
Thanks for the answer. I did use the proxy to capture the HTTP request and create a scrapable file as of the reply to Tim's reply below. All works well until I try to POST any of the menu alternatives that contain special chars.
"Tim,
I did as you suggested and created a scrapable file were I post the city name to the Ajax trigging PHP which returns the districts beautifully. However, as my darlings - the special chars - showed up the responses seized. As I understand it SS uses utf-8 parsing when posting an URL? Seems like this Ajax creature does not like that. This is what a successful response look like:
HTTP/1.1 200 OK
Date: Thu, 16 Jul 2009 16:09:01 GMT
Content-type: application/xml; charset=iso-8859-1
Server: Microsoft-IIS/6.0
Content-Length: 290
X-Powered-By: ASP.NET
Connection: close
Is it possible to configure SS to post using ISO+8859-1 or should I try another trick?"
And this is where I am stuck.
Any leads?
Yes, you can change the
Yes, you can change the character set if that is what the site needs. You like just need to overwrite the character set in a script using http://community.screen-scraper.com/API/addHTTPHeader
I saw that script and it
I saw that script and it looks interesting. Unfortunately I am running the Pro and not Enterprise edition which is required to use it. Is there a workaround?
For starters, I will let the
For starters, I will let the whole processes happen in my browser with the SS proxy turned on. Ajax requests will actually show up in your proxy as separate transactions. You can make a scrapeableFile out of them, as well. Usually ajax requests are pretty simple, so you can usually just change a parameter on the ajax request to be the value scraped from the preceding drop-down.
I usually try to experiment with the Ss proxy running, so that I can see which requests get triggered by these ajax drop-downs. Once you figure out what things fire ajax, you can get down to business.
Hope that helps,
Tim
Tim, I did as you suggested
Tim,
I did as you suggested and created a scrapable file were I post the city name to the Ajax trigging PHP which returns the districts beautifully. However, as my darlings - the special chars - showed up the responses seized. As I understand it SS uses utf-8 parsing when posting an URL? Seems like this Ajax creature does not like that. This is what a successful response look like:
HTTP/1.1 200 OK
Date: Thu, 16 Jul 2009 16:09:01 GMT
Content-type: application/xml; charset=iso-8859-1
Server: Microsoft-IIS/6.0
Content-Length: 290
X-Powered-By: ASP.NET
Connection: close
<?xml version="1.0" encoding="ISO-8859-1"?>
Is it possible to configure SS to post using ISO+8859-1 or should I try another trick?