Ignoring javascript in scraping session, sanitising output of javascript

Hi,

I know now how to ignore/sanitise html from my scraping session output. The thread to this is here: http://community.screen-scraper.com/node/2286

How can I do the same, ignore/sanitise javascript in the output too so I am just left with text not code from a webpage.

Basically, my scrapes have no html in the but still lots of javascript etc.

Thanks

It sounds like you're dealing

It sounds like you're dealing with a page that may not have any HTML. JavaScript can show the data from JSON or XML or something. If you proxy the page, and use the "find" to locate one of the results, it should help.

The page is HTML

Hi Jason,

Thanks. The page is a standard HTML page, and I am scrapping all of the data in one go from < to because there are about 3000 pages and they all have different html codes so setting up sub extractor patterns is not going to work.

The javascript is in the head, and in different places of the body of the html. I was hoping that like the thread I quoted there would be way to remove javascript like I remove html.

Basically I am getting the results I want but there is just a bit of a mess with the unwanted javascript in the results.

The 3000 files are stored locally, and I'm just getting SS to scrape off the local urls in my folder.

Any ideas to help?

Could you use a replaceAll

Could you use a replaceAll like

content = scrapeableFile.getContentAsString();
contect = content.replaceAll("<script.*?</script>", " ");
?