Need help with scraping this page

Hi,

I want to extract the URL's (of the individual details pages) of all the companies listed when i search for a particular company from http://www.cro.ie/search/. That is, for example, I want to search for 'european investment corporation'. Now there are 7 companies listed, and each one is a link to a page that gives details about that particular company. What I want is the URL of that details page. Now the way the URL is constructed makes it impossible for me to get the URL. First off, there is cookie setting to do, but this can be done. Then, there are a bunch of Javascript functions which has got a lot to do with Random number generation. These make further calls to other functions which obfuscate the parameters further. Is there anyway I can work around these?

Thanks and regards,
hemanth

Need help with scraping this page

Hi Frisket,
I have to agree with you here. There is no need to invite trouble by leaving messy footprints around. I think i will put in the 'code' parameter afterall, because it is easy especially since the Javascript code is within the HTML page itself. Meanwhile i would love to have a look at that test html page that prints out the 'code' value. My email address is phemanth at gmail dot com. Thanks a lot for the effort.

Regards,
hemanth

Need help with scraping this page

Well, gosh, THAT'S no fun.

I dunno, if they were willing to go to the trouble of putting in the obfuscation mechanism which they undoubtedly seem to be checking on the server side, it might set off a few alarms with them if suddenly a bunch of requests without the obfuscation code suddenly start coming in. Personally, I would want my scraping sessions to leave as delicate a footprint as possible. Just a personal opinion, mind you.

I mean, they didn't put the JavaScript code in as a separate include file, did they? My understanding of external JavaScript files is that they are not so easily looked at as HTML pages with the JavaScript embedded in them. I wouldn't want to do anything to motivate them to make life any more difficult for me than necessary.

And there you have it.

If you would like to see my little test HTML page that uses their own logic and prints out the obfuscated code value, just let me know.

Frisket

Need help with scraping this page

Hi Frisket,

Thanks a lot for your reply. It works fine now. But i also figured we don't need any of the obfuscation either. Just capturing the parameter called 'number', inserting it in the details page URL, and leaving the value for 'code' parameter empty also works. Except that there's a small message saying "An error has occurred on the server" on the page. And this, I can live with, because the rest of the details display just fine. :)

Thanks and regards,
hemanth

Need help with scraping this page

Excuse me...

as POST variables and it SHOULD work. Just use their own obfuscation code against 'em.

Frisket

Need help with scraping this page

Yep. That seems to work. They just take a random number between 0 and 11 and fiddle it to death. Copying their own code into your own page SEEMS to work just fine. Just send "number" and "code" back as POST

Need help with scraping this page

Well, I'm new at this scraping stuff myself...but...

If I were trying to do this, all of their coding stuff is just programmatic gyrations to obfuscate. You can SEE the Javascript they're using to obfuscate. Therefore you can COPY it. Use a scrapable file to capture the plug replaceable bits (like the 'JNOPJMQPGNPOE4' and the 'var sessConcatenated = "408253700"' when I ran it) and then run a script that is a copy of their own code, copy and pasted into the script. Screen-Scraper allows you to run Javascript, so use their own obfuscating javascript against 'em. Don't know that it'd work, but I'd start there.