Strip HTML

Hi,

Some of the content I'm looking to scrape has some legacy HTML, anyone know of a bit if Java that I can add to a script to strip this out?

Some of the HTML is basic stuff etc but some looks to be old MS work tags and there are occasionally some font tags.

rgds/alex

Strip HTML

Scott,

I'm having trouble with comments, any ideas?

e.g. trying to remove the likes of

<!--if &#40;!navigator.cookieEnabled&#41; &#123;  document.write&#40;' Cookies must be enabled in order to login '&#41;;&#125;//-->

or

<!--? if &#40;$_GET&#91;'lid'&#93; == 4099&#41; &#123;
print '<  DMY '.$fromday.' '.$frommon.' '.$fromyear.'';
  include "calendar.inc";
&#125; ?-->

Strip HTML

Works wonderfully, thank you.

Strip HTML

Thanks I'll give this a go.

I've only got the pro version at the moment as most of the scraping I'm doing is one off / infrequent to generate XML data files.

I now know how to add new regex to the library but a larger initial library of filters would be a nice addition.

I've found the tool to be very useful and your help even more so.

Thanks again

Alex

Strip HTML

Alex,

Both ways will work to call a function.

Again, if you're running enterprise edition you can easily convert HTML entities into ASCII by checking the Convert HTML entities box under the advanced tab for any given token.

Otherwise, give this a try.

String prepareStringForOutput&#40; String value &#41;
&#123;
        if &#40;value != null&#41;
        &#123;
                //Strip all html tags except for formating tags
                value = value.replaceAll&#40;",", "\\,"&#41;;
                value = value.replaceAll&#40;"\"", "\'"&#41;;
                value = value.replaceAll&#40;"<ol&#91;^<>&#93;*>", "ol_open_!HOLD!"&#41;;
                value = value.replaceAll&#40;"<li&#91;^<>&#93;*>", "li_!HOLD!"&#41;;
                value = value.replaceAll&#40;"</ol>", "ol_close_!HOLD!"&#41;;
                value = value.replaceAll&#40;"<ul&#91;^<>&#93;*>", "ul_open_!HOLD!"&#41;;
                value = value.replaceAll&#40;"</ul>", "ul_close_!HOLD!"&#41;;
                value = value.replaceAll&#40;"<p&#91;^<>&#93;*>", "p_open_!HOLD!"&#41;;
                value = value.replaceAll&#40;"</p>", "p_close_!HOLD!"&#41;;
                value = value.replaceAll&#40;"<br/>", "br_!HOLD!"&#41;;
                value = value.replaceAll&#40;"

Hidden characters

Scot,

Do you also have a helpful script for stripping non-visible characters?

Alex

Strip HTML

Scot,

Thank you very much for this but can I ask you one more question. I'm starting from a very low level with the scripting. How do I pass the content I scrape through this function.

is it

myVariable = fixstring&#40;myVariable&#41;

or

myVariable = fixstring&#40;&#40;session.getVariable&#40; "myVariable" &#41;&#41;

Alex

Strip HTML

Alex,

If you're wanting to remove the tags completely and you're using the enterprise edition you can make swift work of it by enabling the "Strip HTML" feature for each token under its advanced tab.

For basic and professional, you'll need to write a function in one of your scripts that gets called (typically) just prior to writing out the data.

String fixString&#40;String value&#41;
&#123;
        if &#40;value != null&#41;
        &#123;
                value = value.replaceAll&#40;"\"", "\'"&#41;;
                value = value.replaceAll&#40;"&", "&"&#41;;    
                value = value.replaceAll&#40;"