Chapter 4. Programming the Document Object Model

Table of Contents

1. Names and Namespaces
2. Document nodes
3. Document traversal
4. Loading and saving a document
5. Tree properties

XMLmind XML Editor Document Object Model (DOM) is somewhat similar though, in our opinion, simpler than W3C DOM or JDOM. This chapter describes how to program XXE DOM using AddTOC.java as an example.

This sample program:

  1. loads a XHTML file,

  2. traverses the loaded document searching for h1, h2, h3 headings,

  3. adds an empty <a name="tocentryNNN"/> to each of these headings,

  4. for each of the traversed headings, adds an indented line containing <a href="#tocentryNNN">text of the heading</a> to the div that will be used as a TOC,

  5. inserts the div used as a TOC as first child of body,

  6. saves modified document to disk.

import java.io.File;
import java.io.IOException;
import com.xmlmind.xmledit.xmlutil.*;
import com.xmlmind.xmledit.doc.*;
import com.xmlmind.xmledit.doctype.DocumentType;
import com.xmlmind.xmledit.edit.Loader;
import com.xmlmind.xmledit.edit.Formatter;

public class AddTOC {
    private static final Name BODY = Name.get("body");1
    private static final Name DIV = Name.get("div");
    private static final Name H1 = Name.get("h1");
    private static final Name H2 = Name.get("h2");
    private static final Name H3 = Name.get("h3");
    private static final Name A = Name.get("a");
    private static final Name BR = Name.get("br");

    private static final Name CLASS = Name.get("class");
    private static final Name NAME = Name.get("name");
    private static final Name HREF = Name.get("href");

    private static final class Info {
        public int headingCount;
        public Element toc;
        public Element body;
    }

    public static void processDocument(Document doc) {
        final Info info = new Info();

        Element b = new Element(Name.get("b"));
        b.putAttribute(CLASS, "toctitle");
        b.appendChild(new Text("Contents"));

        info.toc = new Element(DIV);
        info.toc.putAttribute(CLASS, "toc");
        info.toc.appendChild(b);

        Traversal.traverse(doc.getRootElement(), new Traversal.HandlerBase() {2
            public Object enterElement(Element element) {
                Name name = element.getName();

                if (name == H1 || name == H2 || name == H3) {
                    processHeading(element, info);
                    return Traversal.LEAVE_ELEMENT;
                } else {
                    if (name == BODY)
                        info.body = element;
                    return null;
                }
            }
        });

        if (info.body != null) {
            info.toc.appendChild(new Element(BR));
            info.toc.appendChild(new Element(Name.get("hr")));

            add(info.body, info.toc);
        }
    }

    private static void processHeading(Element heading, Info info) {
        String id = "tocentry" + info.headingCount++;

        Element target = new Element(A);
        target.putAttribute(CLASS, "tocentry");
        target.putAttribute(NAME, id);

        add(heading, target);

        Traversal.TextGrabber grabber = new Traversal.TextGrabber();3
        Traversal.traverse(heading, grabber);
        String headingText = 
            XMLUtil.collapseWhiteSpace(grabber.grabbed.toString());4

        Element link = new Element(A);
        link.putAttribute(HREF, "#" + id);
        link.appendChild(new Text(headingText));

        int indentation;
        Name headingName = heading.getName();
        if (headingName == H1)
            indentation = 4;
        else if (headingName == H2)
            indentation = 8;
        else
            indentation = 12;
        StringBuffer indent = new StringBuffer();
        while (indentation > 0) {
            indent.append('\u00A0'); // &nbsp;
            --indentation;
        }
        
        info.toc.appendChild(new Element(BR));
        info.toc.appendChild(new Text(indent.toString()));
        info.toc.appendChild(link);
    }

    private static void add(Element parent, Element added) {5
        Name addedName = added.getName();
        String addedClass = added.getAttribute(CLASS);

        boolean replaced = false;
        loop: for (Node child = parent.getFirstChild();
                   child != null;
                   child = child.getNextSibling()) {
            switch (child.getNodeType()) {
            case Node.TEXT:
            case Node.COMMENT:
            case Node.PROCESSING_INSTRUCTION:
                break;
            case Node.ELEMENT:
                {
                    Element element = (Element) child;

                    if (element.getName() == addedName &&
                        addedClass.equals(element.getAttribute(CLASS))) {
                        parent.replaceChild(element, added);
                        replaced = true;
                        break loop;
                    }
                }
                break;
            }
        }

        if (!replaced) 
            parent.insertChild(parent.getFirstChild(), added);
    }

1. Names and Namespaces

1

Element and attribute names are not Strings, they are Name objects. A Name is the aggregation of a Namespace object and a String local part.

Name.get("body") is equivalent to Name.get(Namespace.NONE, "body"). Namespace.NONE is used to specify absence of namespace. Other commonly used namespaces are defined as constants, for example: Namespace.XML (that is, http://www.w3.org/XML/1998/namespace).

Names and Namespaces are managed as symbols in a symbol table. For example, it is not possible to invoke new Name(new Namespace("http://foo.com"), "bar") to get a name with "http://foo.com" as its namespace URI and "bar" as its local name. To do this, invoke Name.get(Namespace.get("http://foo.com"), "bar").

Because of this, Names and Namespaces can be compared for equality using == rather than using equals.

2. Document nodes

5

A document is composed of Nodes: Text, Comment, ProcessingInstruction, Element, DocumentTypeDeclaration, Document. Notice that a document is itself a Node. Document and Element are Trees, that is, Node containers.

Attributes are not Nodes. Attribute is just a simple data structure which groups together the attribute name, the attribute value and the element having the attribute. This simple data structure is mainly used by the Enumeration returned by Element.getAttributes().

Function add() in the AddTOC sample shows how an Element node can be used. This function inserts Element added as first child of Element parent. If parent already contains a child element with same element name and same class attribute value as added, added replaces this child element.

The for loop shows how to enumerate the child Nodes of a Tree. The switch construct shows how to test the type of a Node. Note that in production code, it would have been simpler to test if a node is an Element by writing if (node instanceof Element).

Element has many convenience functions to access its attributes or child nodes, for example: getIntAttribute(name, min, max, fallback) or getChildElement(index).

3. Document traversal

2

Traversal is a set of utility functions that can be used to traverse a Tree in both directions (Traverse.traverse, Traverse.traverseBackwards, etc) or to traverse document nodes after or before a given node (Traverse.traverseAfter, Traverse.traverseBefore, etc).

During the traversal, Traversal functions notify a Traversal.Handler which must implement: processText, processComment, processPI, enterElement, leaveElement.

Traversal.HandlerBase can be used as the base class of a handler if most notifications methods are not useful.

Traversal can be controlled by returning a value from notification methods. Return null to continue traversal. Return an Object to stop traversal and to get this Object as the result of the traversal (imagine traversal used to implement a find). Return special value Traversal.LEAVE_ELEMENT to continue traversal after skipping the Element being traversed.

In the AddTOC example, Traversal.LEAVE_ELEMENT is used to skip useless traversal of h1, h2 and h3 headings.

3

You do not always need to define your own Traversal.Handler. Class Traversal contains many predefined, ready-to-use, Traversal.Handlers for simple tasks. Traversal.TextGrabber used in the AddTOC example is one of them. You'll also find Travsersal.TextNodeFinder, Traversal.NodeMatcher, etc.

4

XMLUtil contains a lot of utility functions related to lexical aspects of XML. It defines functions that trim whitespaces, that escape and unescape XML text and attribute values, that escape and unescape URIs, etc.

4. Loading and saving a document

    public static void main(String[] args) throws IOException {
        if (args.length != 2) {
            System.err.println(
                "usage: java AddTOC in_xhtml_file out_xhtml_file");
            System.exit(1);
        }
        String inFileName = args[0];
        String outFileName = args[1];

        Loader docLoader = new Loader();1
        docLoader.setAddedProperties(0x0);
        Document doc = docLoader.load(inFileName);

        DocumentType docType = 
            (DocumentType) doc.getProperty(StandardProperty.DOCUMENT_TYPE);2

        AddTOC.processDocument(doc);

        Formatter docWriter = new Formatter(docType);3
        docWriter.writeDocument(doc, outFileName);
    }

1

Compared to low-level DocumentLoader, Loader has many advantages.

  • It systematically adds a DocumentType to loaded document.

  • It can automatically add other application-level properties such as StyleSheetInfo, UndoManager, etc to loaded document (see Tree properties below).

  • Using DocumentType, it can intelligently strip ignorable whitespaces from loaded document.

Both loaders are XML catalog aware. Note that in build.xml we use system property xml.catalog.files to specify to these loaders which catalogs to use. This can also be done programmatically using XMLCatalogs.

3

Low-level DocumentWriter not being DocumentType aware, it cannot output indented XML. Therefore in the AddTOC example, we rather user Formatter.

5. Tree properties

2

Trees, that is Documents and Elements, can have application-level properties. These properties are generally added by document loaders at load time but nothing prevents a programmer to add and remove its own properties at any time.

What follows is a comparison between Element attributes and properties.

AttributeProperty
Part of document content.Not part of document content.
User can edit attributes.User cannot edit properties.
Can be loaded and saved to disk as XML.Transient.
Documents cannot have attributes.Documents and Elements can both have properties.
An attribute name is a Name. An attribute value is String.A property name is an Object. A property value is an Object.
Views are notified when attributes are changedViews are not notified when properties are changed.

Properties used to implement XXE have their names defined as constants in StandardProperty. One such key is StandardProperty.DOCUMENT_TYPE.