Interpreted Java and Regular Expressions

I have finally made the leap from vbScript to interpreted java but am having a problem with one of my scripts. Can anyone help?

I am applying a function to extracted data before saving it to a database with the intention of removing unwanted (but not all) html tags and attributes. The regex works fine in vbScript but not in java and I am assuming that I am missing something basic (I used the help file here to work out what I needed to escape but have probably made a hash of it!).

Here's my code:

        CleanHTML(strText)
                {
                        // First remove the html tags
                        // This is the vbScript version of the regex "<[/]?(font|span|xml|del|ins|[ovwxp]:\w+)[^>]*?>";
                        strRegEx = "<[/]\\?(font\\|span\\|xml\\|del\\|ins\\|[ovwxp]:\\\\w+)[^>]\\*\\?>";
                strResult = strText.replace(strRegEx, "");                     
                // Now remove attributes  
                // This is the vbScript of the regex trRegEx = "<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>";    
                        strRegEx = "<([^>]\\*)(\\?:class\\|lang\\|style\\|size\\|face\\|[ovwxp]:\\\\w+)=(\\?:'[^']\\*'|\'[^\']\\*\'|[^\\\\s>]+)([^>]\\*)>";
                //strRegEx = "(\\S+)=[\"']?((?:.(?![\"']?\\s+(?:\\S+)=|[>\"']))+.)[\"']?";
            strResult = strResult.replace(strRegEx,"$1");
            return strResult;
                }

Also, can anyone tell me why SS doesn't like me using the word function before declaring the method, e.g. "function CleanHTML(strText) ".

Apologies of this post appears twice - I tried to post it earlier but I can't see it.

Regular Expressions in Java

The reason you are unable to use the function keyword when declaring a function is that Interpreted Java doesn't have the function keyword. When interpreting the Java, we default to Beanshell which is very similar to Java in Syntax.

As for the regular expression, it looks like you escaped the regex special characters. In the first regular expression, you have "<[/]\\?(font\\|"... Everywhere you have a "\\" in the string, you can think of the regex having a "\". So this expression would match strings that look like "</?font|"..., and the regex that did the matching would look like "<[/]\?(font\|"...

The same regex you used in the vbscript should work in the Interpreted Java, but you'll need to change the escape character the interpreter uses. It looks like the vbscript uses "" to represent a single ", while java uses \" to represent it.

I've modified your code so it should work in java. I think it works the same as you were expecting, but it is possible I missed something. Also note that I used the replaceAll method for the replacement rather than the replace method. Please let me know if any of the changes don't make sense.

CleanHTML(strText)
{
  // First remove the html tags
  // This is the vbScript version of the regex "<[/]?(font|span|xml|del|ins|[ovwxp]\w+)[^>]*?>";
  strRegEx = "<[/]?(font|span|xml|del|ins|[ovwxp]\\w+)[^>]*>";
  strResult = strText.replaceAll(strRegEx, "");

  // Now remove attributes
  // This is the vbScript of the regex trRegEx = "<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>";
  strRegEx = "<([^>]*)(?:\\s(class|lang|style|size|face|[ovwxp]\\w+))=(?:'[^']*'|\"[^\"]*\"|[^\\s>]+)([^>]*)>";
  //strRegEx = "(\\S+)=[\"']?((?:.(?![\"']?\\s+(?:\\S+)=|[>\"']))+.)[\"']?";
  strResult = strResult.replaceAll(strRegEx,"<$1$3>");
  return strResult;
}

Thanks, that has done the

Thanks, that has done the trick. There were a few other tweaks I had to make but I was getting completely mixed up with my regex!

Here's the final function, in case it's of use to anyone.

        CleanHTML(strText)
                {
                        // First remove the html tags
                        strRegEx = "<[/]?(img|iframe|noframe|font|span|xml|del|ins|[ovwxp]:\\w+)[^>]*?>";
                strResult = strText.replaceAll(strRegEx, "");                  
                // Now remove attributes  
                        strRegEx = "<([^>]*)(?:[\\s]class|[\\s]lang|[\\s]style|[\\s]size|[\\s]face|[ovwxp]:\\w+)=(?:'[^']*'|\"[^\"]*\"|[^\\s>]+)([^>]*)>";
            strResult = strResult.replaceAll(strRegEx,"<$1>");
            return strResult;
                }