Introduction

Microsoft® Word is an amazing popular writing tool. However, its main drawback is that, once your document is complete, you cannot do much with it: print it, convert it to PDF or send it as is by email.

XMLmind Word To XML aims no less than to suppress Microsoft® Word main drawback. This 100% Java™ software component allows to automate the publishing —in its widest sense— of contents created using Microsoft® Word 2007+.

More precisely, XMLmind Word To XML (w2x for short) allows to automatically convert DOCX files to:

Clean, styled, valid XHTML+CSS, looking very much like the source DOCX files.

Because the generated XHTML+CSS file is clean and valid, you can easily restyle it, extract metadata or an abstract from it before publishing it.

Unstyled, valid, semantic XML (DITA, DocBook, XHTML, your custom schema, etc).

In this case, most styles are converted to semantic tags. For example, numbered paragraphs are converted to proper ordered lists.

Generating semantic XML out of DOCX files is useful for interchange reasons (e.g. implement open data) or because you want to port your existing documentation to a structured document format where form and content are completely separated (e.g. implement single source publishing).

Of course, deploying w2x does not require installing MS-Word on the machines hosting the software. Also note that w2x does not require the authors to change their habits while using MS-Word: no strict writing discipline, no specific styles, no specific document templates, no specific macros, etc.

This document explains:

how to install and use w2x;

how to customize the output of w2x;

because w2x has been designed to be easily embedded in any Java, desktop or server-side, application, how to embed a w2x processor in a Java application.