1. A sample program making use of XXE native DOM

This sample program:

  1. loads a XHTML file;

  2. traverses the loaded document searching for h1, h2, h3 headings;

  3. adds an empty <a name="tocentryNNN"/> to each of these headings;

  4. for each of the traversed headings, adds an indented line containing <a href="#tocentryNNN">text of the heading</a> to the div that will be used as a Table of Contents (TOC);

  5. inserts the div used as a TOC as first child of body;

  6. saves modified document to disk.

Excerpts from AddTOC.java:

public class AddTOC {
    private static final Name BODY = Name.get(Namespace.XHTML, "body");1
    private static final Name DIV = Name.get(Namespace.XHTML, "div");
    private static final Name H1 = Name.get(Namespace.XHTML, "h1");
    private static final Name H2 = Name.get(Namespace.XHTML, "h2");
    private static final Name H3 = Name.get(Namespace.XHTML, "h3");
    private static final Name A = Name.get(Namespace.XHTML, "a");
    private static final Name BR = Name.get(Namespace.XHTML, "br");

    private static final Name CLASS = Name.get("class");
    private static final Name NAME = Name.get("name");
    private static final Name HREF = Name.get("href");

    private static final class Info {
        public int headingCount;
        public Element toc;
        public Element body;
    }

    public static void processDocument(Document doc) {
        final Info info = new Info();

        Element b = new Element(Name.get(Namespace.XHTML, "b"));
        b.putAttribute(CLASS, "toctitle");
        b.appendChild(new Text("Contents"));

        info.toc = new Element(DIV);
        info.toc.putAttribute(CLASS, "toc");
        info.toc.appendChild(b);

        Traversal.traverse(doc.getRootElement(), new Traversal.HandlerBase() {2
            public Object enterElement(Element element) {
                Name name = element.getName();

                if (name == H1 || name == H2 || name == H3) {
                    processHeading(element, info);
                    return Traversal.LEAVE_ELEMENT;
                } else {
                    if (name == BODY) {
                        info.body = element;
                    }
                    return null;
                }
            }
        });

        if (info.body != null) {
            info.toc.appendChild(new Element(BR));
            info.toc.appendChild(new Element(Name.get(Namespace.XHTML, "hr")));

            add(info.body, info.toc);
        }
    }

    private static void processHeading(Element heading, Info info) {
        String id = "tocentry" + info.headingCount++;

        Element target = new Element(A);
        target.putAttribute(CLASS, "tocentry");
        target.putAttribute(NAME, id);

        add(heading, target);

        Traversal.TextGrabber grabber = new Traversal.TextGrabber();3
        Traversal.traverse(heading, grabber);
        String headingText = 
            XMLText.collapseWhiteSpace(grabber.grabbed.toString());4

        Element link = new Element(A);
        link.putAttribute(HREF, "#" + id);
        link.appendChild(new Text(headingText));

        int indentation;
        Name headingName = heading.getName();
        if (headingName == H1) {
            indentation = 4;
        } else if (headingName == H2) {
            indentation = 8;
        } else {
            indentation = 12;
        }
        StringBuilder indent = new StringBuilder();
        while (indentation > 0) {
            indent.append('\u00A0'); // &nbsp;
            --indentation;
        }
        
        info.toc.appendChild(new Element(BR));
        info.toc.appendChild(new Text(indent.toString()));
        info.toc.appendChild(link);
    }

    private static void add(Element parent, Element added) {5
        Name addedName = added.getName();
        String addedClass = added.getAttribute(CLASS);

        boolean replaced = false;
        loop: for (Node child = parent.getFirstChild();
                   child != null;
                   child = child.getNextSibling()) {
            switch (child.getType()) {
            case TEXT:
            case COMMENT:
            case PROCESSING_INSTRUCTION:
                break;
            case ELEMENT:
                {
                    Element element = (Element) child;

                    if (element.getName() == addedName &&
                        addedClass.equals(element.getAttribute(CLASS))) {
                        parent.replaceChild(element, added);
                        replaced = true;
                        break loop;
                    }
                }
                break;
            }
        }

        if (!replaced) {
            parent.insertChild(parent.getFirstChild(), added);
        }
    }

Names and namespaces

1

Element and attribute names are not plain strings, they are Name objects. A Name is the aggregation of a Namespace object and a string local part.

Name.get("body") is equivalent to Name.get(Namespace.NONE, "body"). Namespace.NONE is used to specify absence of namespace. Other commonly used namespaces are defined as constants, for example: Namespace.XML (that is, http://www.w3.org/XML/1998/namespace).

Names and namespaces are managed as symbols in a symbol table. For example, it is not possible to invoke new Name(new Namespace("http://foo.com"), "bar") to get a name with "http://foo.com" as its namespace URI and "bar" as its local name. To do this, invoke Name.get(Namespace.get("http://foo.com"), "bar").

Because of this, names and namespaces can be compared for equality using == rather than using equals.

Document nodes

5

A document is composed of Nodes: Text, Comment, ProcessingInstruction, Element, DocumentTypeDeclaration, Document. Notice that a Document is itself a Node. Document and Element are Trees, that is, Node containers.

Attributes are not Nodes. Attribute is just a simple data structure which groups together the attribute name, the attribute value and the element having the attribute. This simple data structure is mainly used by the Iterator returned by Element.getAttributes.

Function add() in the AddTOC sample shows how an Element can be used. This function inserts element added as first child of element parent. If parent already contains a child element with same element name and same class attribute value as added, added replaces this child element.

The for loop shows how to enumerate the child Nodes of a Tree. The switch construct shows how to test the type of a Node. Note that in production code, it would have been simpler to test if a node is an Element by writing if (node instanceof Element).

Element has many convenience functions to access its attributes or child nodes, for example: getIntAttribute(name, min, max, fallback) or getChildElement(index).

Document traversal

2

Traversal is a set of utility functions that can be used to traverse a Tree in both directions (Traverse.traverse, Traverse.traverseBackwards, etc) or to traverse document nodes after or before a given node (Traverse.traverseAfter, Traverse.traverseBefore, etc).

During the traversal, Traversal functions notify a Traversal.Handler which must implement: processText, processComment, processPI, enterElement, leaveElement.

Traversal.HandlerBase can be used as the base class of a handler if most notifications methods are not useful.

Document traversal can be controlled by returning a value from notification methods. Return null to continue traversal. Return an Object to stop traversal and to get this Object as the result of the traversal (imagine a document traversal used to implement “find something”). Return special value Traversal.LEAVE_ELEMENT to continue traversal after skipping the element being traversed.

In the AddTOC example, Traversal.LEAVE_ELEMENT is used to skip useless traversal of h1, h2 and h3 headings.

3

You do not always need to define your own Traversal.Handler. Class Traversal contains many predefined, ready-to-use, Traversal.Handlers for simple tasks. Traversal.TextGrabber used in the AddTOC example is one of them. You'll also find Travsersal.TextNodeFinder, Traversal.NodeMatcher, etc.

4

XMLText contains a lot of utility functions related to lexical aspects of XML. It defines functions that trim whitespaces, that escape and unescape XML text and attribute values, that escape and unescape URIs, etc.

Loading and saving a document

    public static void main(String[] args) 
        throws IOException {
        if (args.length != 2) {
            System.err.println(
                "usage: java AddTOC in_xhtml_file out_xhtml_file");
            System.exit(1);
        }
        String inFileName = args[0];
        String outFileName = args[1];

        Document doc = LoadDocument.load(new File(inFileName));1

        AddTOC.processDocument(doc);

        SaveDocument.save(doc, new File(outFileName));2
    }

1

The document is loaded using the high-level document loader LoadDocument. There is also a low-level document loader, DocumentLoader, which is used to implement LoadDocument.

Both document loaders automatically add some properties to the loaded document. Example: a NamespacePrefixMap as the value of property NAMESPACE_PREFIX_MAP_PROPERTY. (See node properties below.)

Both loaders are XML catalog aware. Note that in build.xml we use system property xml.catalog.files to specify to these loaders which catalogs to use. This can also be done programmatically using XMLCatalogs.

However, there many advantages to using LoadDocument rather than using DocumentLoader. The two main advantages are:

  • It systematically adds a DocumentType to loaded document. This object is the value of property DOCUMENT_TYPE_PROPERTY.

  • Using DocumentType, it can intelligently strip ignorable whitespaces from loaded document.

2

The document is saved using the hight-level document writer SaveDocument. There are also low-level document writers, DocumentWriter and DocumentIndenter, which are used to implement SaveDocument.

Node properties

Any XML node can have application-level properties. These properties are generally added by document loaders at load time but nothing prevents a Java™ developer from adding its own properties at any time.

What follows is a comparison between element attributes and properties.

AttributeProperty
Part of document content.Not part of document content.
User can edit attributes.User cannot edit properties.
Can be loaded and saved to disk as XML.Transient.
XML nodes other than elements cannot have attributes.Any XML node can have properties.
An attribute name is a Name. An attribute value is string.A property name is also a Name. A property value is an Object.
Views are notified when attributes are changed by the means of an AttributeEvent.Views are also notified when properties are changed by the means of a PropertyEvent.

Properties used to implement XXE have their names defined as constants in com.xmlmind.xml.doc.Constants and in com.xmlmind.xmledit.edit.Constants.