Java XML processing with JAXB and special characters

When processing XML, it is important to understand that there are two kinds of special characters:
  • Characters that are permitted, but must either be escaped or inside a CDATA block.
  • Control characters that are not permitted at all.

Characters that fall into the first category are:

  • & – can be replaced by &
  • < - can be replaced by &lt;
  • > – can be replaced by &gt;
  • ” – can be replaced by &quot;
  • ‘ – can be replaced by &apos;

If you are creating XML with Java and JAXB, you do not need to do anything to escape these characters, it will be done automatically.

What about control characters that are not allowed at all? The specification is here: https://www.w3.org/TR/REC-xml/#NT-Char

The range of valid characters is defined as:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] 
This means that all of the first 32 ASCII characters are prohibited, except:
  • 9 = horizontal tab
  • xA = 10 = line feed
  • xD = 13 = carriage return

Bizarrely, JAXB will allow you to generate XML even when your input contains control characters. Any control character will hit the default escape mechanism and be converted to its hex representation. e.g. control character 2 becomes:

&#2;

This means you can generate XML with JAXB that you cannot read in with JAXB, as these characters will cause a hard failure!!! How can you deal with this? If you Google this problem, you will find a number of posts about writing your own escape handler. For example: https://stackoverflow.com/questions/4435934/handling-xml-escape-characters-e-g-quotes-using-jaxb-marshaller

I investigated this and found two important points:

Firstly, I found two implementations of JAXB on my classpath, both in the com.sun package. One was the “internal” one, the other was the Reference Implementation (RI). This is important because when trying to override the escape handler, you need specify the correct package in the property name, and obviously extend the correct class. If you aren’t picking up the RI by default, you can force this by setting a startup property:

javax.xml.bind.context.factory=com.sun.xml.bind.v2.ContextFactory

Secondly, custom escape handlers won’t be invoked at all if you call the method

marshal( Object jaxbElement, javax.xml.transform.Result result )
While the most common use case for creating XML involves calling marshal, we are also applying an XSLT, so we call the above method. This uses a completely different writer which does not use the standard escape handler! Hence I had to look for another place to intercept the character stream. We were creating a transformer like this:
SAXTransformerFactory stf = new com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl();
TransformerHandler transformerHandler = stf.newTransformerHandler();

We cannot alter the real transformer, but we can wrap it in another transformer, that can strip the control characters, so what I wrote is below. In this implementation, I have only blocked the low value ASCII control characters, as those are the ones actually appearing in our input data. It would easy to extend the method to block the high range characters if you want to.

import org.xml.sax.*;

import javax.xml.transform.Result;
import javax.xml.transform.Transformer;
import javax.xml.transform.sax.TransformerHandler;
import java.util.HashSet;
import java.util.Set;

public class ControlStrippingTransformerHandler implements TransformerHandler {

    private TransformerHandler transformerHandler;

    public ControlStrippingTransformerHandler(TransformerHandler transformerHandler) {
        this.transformerHandler = transformerHandler;
    }

    // this is the only method we need to override!!
    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {
        // remove dodgy control characters
        // for performance, we only want to create a new array if there are characters to strip
        // we only want to initialise the set of characters if there are any
        Set<Integer> charactersToStrip = null;
        for (int i = start; i < start + length; i++) {
            if (isControlCharacter(ch[i])) {
                if (charactersToStrip == null) {
                    charactersToStrip = new HashSet<>();
                }
                charactersToStrip.add(i);
            }
        }
        if (charactersToStrip != null && charactersToStrip.size() > 0) {
            // this array only needs to be the specific section of the input string
            // at most this array will be 255 characters in length, since the
            // input SAX class has a 256 character array
            char[] newArray = new char[length - charactersToStrip.size()];
            int newArrayIndex = 0;
            for (int i = start; i < start + length; i++) {
                if (!charactersToStrip.contains(i)) {
                    newArray[newArrayIndex] = ch[i];
                    newArrayIndex++;
                }
            }
            ch = newArray;
            start = 0;
            length = length - charactersToStrip.size();
        }

        transformerHandler.characters(ch, start, length);
    }

    private boolean isControlCharacter(char c) {
        if (c < 9 || c == 11 || c == 12 || (c > 13 && c < 32)) {
            return true;
        }
        return false;
    }

    // other methods just call the wrapped transformer methods
}
This entry was posted in Java, XML and tagged , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

HTML tags are not allowed.

515,023 Spambots Blocked by Simple Comments