Strip HTML
Hi,
Some of the content I'm looking to scrape has some legacy HTML, anyone know of a bit if Java that I can add to a script to strip this out?
Some of the HTML is basic stuff etc but some looks to be old MS work tags and there are occasionally some font tags.
rgds/alex
Strip HTML
Scott,
I'm having trouble with comments, any ideas?
e.g. trying to remove the likes of
or
print '< DMY '.$fromday.' '.$frommon.' '.$fromyear.'';
include "calendar.inc";
} ?-->
Strip HTML
Works wonderfully, thank you.
Strip HTML
Thanks I'll give this a go.
I've only got the pro version at the moment as most of the scraping I'm doing is one off / infrequent to generate XML data files.
I now know how to add new regex to the library but a larger initial library of filters would be a nice addition.
I've found the tool to be very useful and your help even more so.
Thanks again
Alex
Strip HTML
Alex,
Both ways will work to call a function.
Again, if you're running enterprise edition you can easily convert HTML entities into ASCII by checking the Convert HTML entities box under the advanced tab for any given token.
Otherwise, give this a try.
{
if (value != null)
{
//Strip all html tags except for formating tags
value = value.replaceAll(",", "\\,");
value = value.replaceAll("\"", "\'");
value = value.replaceAll("<ol[^<>]*>", "ol_open_!HOLD!");
value = value.replaceAll("<li[^<>]*>", "li_!HOLD!");
value = value.replaceAll("</ol>", "ol_close_!HOLD!");
value = value.replaceAll("<ul[^<>]*>", "ul_open_!HOLD!");
value = value.replaceAll("</ul>", "ul_close_!HOLD!");
value = value.replaceAll("<p[^<>]*>", "p_open_!HOLD!");
value = value.replaceAll("</p>", "p_close_!HOLD!");
value = value.replaceAll("<br/>", "br_!HOLD!");
value = value.replaceAll("
Hidden characters
Scot,
Do you also have a helpful script for stripping non-visible characters?
Alex
Strip HTML
Scot,
Thank you very much for this but can I ask you one more question. I'm starting from a very low level with the scripting. How do I pass the content I scrape through this function.
is it
or
myVariable = fixstring((session.getVariable( "myVariable" ))
Alex
Strip HTML
Alex,
If you're wanting to remove the tags completely and you're using the enterprise edition you can make swift work of it by enabling the "Strip HTML" feature for each token under its advanced tab.
For basic and professional, you'll need to write a function in one of your scripts that gets called (typically) just prior to writing out the data.
{
if (value != null)
{
value = value.replaceAll("\"", "\'");
value = value.replaceAll("&", "&");
value = value.replaceAll("