execute page javascript before file is scraped
Is there a way to tell screen-scraper to execute all page javascript before scraping the file, much like a browser would?
The problem I have, is that I have a number of sites using the same framework, Taleo, which builds the page from a javascript function called at the end of the page. My instincts tell me that this is purely an anti-scraper countermeasure. I've tried to find an "accessible" or scriptless version of the same page - to no avail.
It's not an onLoad, or a button click (I tried playing with the Actions system in screen-scraper, to no avail - I may just be using it wrong), but just an inline script that sets up the page and basically inserts the data from a nondescript array into the main page data. The javascript is not completely un-parseable, but doing it this way is so fragile that it's barely feasible, and I shudder at the amount of man-hours that would be wasted if they changed just the ordering of the variables, for instance.
Thanks!
We've played with some things
We've played with some things that would run JavaScript, but generally haven't found a good means to do it.
I rather doubt that the JavaScript you're seeing is special to hinder scraping. Odds are they are using AJAX/HTML5 things to make the site dynamic and exciting, and it just looks like it's intentionally hard to read.
Odds are you will need to proxy the site, and look for subsequent HTTP requests/responses as the page loads as the JavaScipts will be making little requests for their content. If you're lucky, the subsequent responses will be in JSON or XML and prove easier to scrape than most HTML.