XMLmind Word To XML
|Convert icons
Convert to PDF iconConvert to RTF (Word 2000+) iconConvert to WordprocessingML (Word 2003+) iconConvert to Office Open XML (.docx, Word 2007+) iconConvert to OpenDocument (.odt, OpenOffice/LibreOffice 2+) icon

Change history

1.2.2 (April 14, 2017)


  • Added parameter edit.ids.generate-section-ids. Setting this parameter to yes (default value is no) ensures that all the sections found in the semantic XHTML resulting from the conversion of a DOCX file have a unique ID.

    When this ID is missing, it is computed using the content of the h1, h2, ..., h6 heading which is the first child of the section. Example:

    <div class="role-section2" id="Title_of_this_section">
      <h2>Title of this section</h2>

    The maximum length of the automatically computed ID may be specified using parameter edit.ids.section-id-max-length. The default value of this parameter is 32.

    Setting edit.ids.generate-section-ids to yes is especially useful when converting a DOCX file to a DITA map or bookmap. With this parameter, the filenames of the topics referenced by the generated map are guaranteed to have meaningful values (e.g. "Introduction.dita" rather than "d0e35.dita").

  • Added XSLT parameter shortdesc-class-name to W2X_install_dir/xslt/topic.xslt, the XSLT stylesheet which is used to convert intermediate semantic XHTML document to a DITA topic.

    This parameter is used to specify the class name of the XHTML <p> which acts as a short description of the section. Examples: -p transform.shortdesc-class-name p-Shortdesc, -p transform.shortdesc-class-name p-Abstract.

    When this parameter is not specified (or is specified as the empty string which is its default value), the following style mapping, created by the w2x-app wizard:

    -p edit.blocks.convert "p-Shortdesc p class='p-Shortdesc'"
    <xsl:template match="h:p[@class='p-Shortdesc']">
        <xsl:call-template name="processCommonAttributes"/>

    causes DITA <shortdesc> elements to generated inside topic bodies, which is invalid.

    After specifying -p transform.shortdesc-class-name p-Shortdesc, this issue is fixed and DITA <shortdesc> elements are generated before topic bodies.

  • Added an "Other parameters" screen to the w2x-app wizard. This new screen lets the user specify parameters which are not supported by the "Output format options" and "MS-Word style to XML element map" screens. For example, when generating a DITA document, the other screens do not let the user specify -p transform.pre-element-name codeblock (default value being pre).
  • Upgraded XMLmind Web Help Compiler (whc for short) to version 1.4.2_03.

Bug fixes:

  • For some DOCX paragraphs, significant whitespace was removed by XMLmind Word To XML. This gave incorrect results when these DOCX paragraphs were converted to DocBook programlisting, DITA pre, XHTML pre, etc.
  • In the source DOCX file, fields having an empty code (that is, somewhat abnormal fields) caused XMLmind Word To XML to raise a StringIndexOutOfBoundsException.
  • When generating semantic XHTML of any kind with parameter edit.convert-tabs.to-table set to no (the default value), attribute class="role-tabs-XXX" and elements <span class="role-tab"> were not discarded.

    Not only this markup is not useful, but it also prevented some style mappings created the w2x-app wizard from working. Example, the following style mapping of MS-Word paragraph style Note to a DITA element <note>:

    -p edit.blocks.convert "p-Note p class='p-Note'"
    <xsl:template match="h:p[@class='p-Note']">
        <xsl:call-template name="processCommonAttributes"/>

    failed for the following paragraph (intermediate semantic XHTML preceding the transformation to DITA):

    <p class="role-tabs-35.45-0-117 p-Note">Note:
    <span class="role-tab"> </span>Body of the note here.</p>
  • In rare cases, foot/end notes were numbered starting from 2 and not starting from 1 as expected.


  • w2x_all.jar, the self-contained JAR file, is no longer used by the following scripts: bin/w2x, w2x.bat, w2x-app, w2x-app-c.bat. This prevented advanced users from easily modifying the scripts found in subdirectories xed/ and xslt/. This self-contained JAR file is still available but its use should be reserved to embedding w2x in a third-party application.

1.2.1 (November 24, 2016)


  • Conversion of images found in the DOCX file (TIFF, WMF, EMF, etc) to standard formats (SVG, PNG, JPEG) may now be controlled using environment variable (or Java™ property) W2X_IMAGE_CONVERSIONS. The default value of this variable is (all specifications on a single line):
    .wmf.svg java:com.xmlmind.w2x_ext.wmf_converter.WMFConverterFactory;
    .tiff.png java:com.xmlmind.w2x.docx.image.ImageConverterFactoryImpl

    On Windows, the default value of W2X_IMAGE_CONVERSIONS is (all specifications on a single line):

    .wmf.svg java:com.xmlmind.w2x_ext.wmf_converter.WMFConverterFactory;
    .emf.png java:com.xmlmind.w2x_ext.emf2png.EMF2PNG resolution 0;
    .tiff.png java:com.xmlmind.w2x.docx.image.ImageConverterFactoryImpl
  • Added two new image converters:
    External image converter

    This image converter executes an external program to perform the conversion.

    Examples of W2X_IMAGE_CONVERSIONS specifications (see above): convert EMF to SVG using OpenOffice/LibreOffice:

    .emf.svg soffice --headless --convert-to svg -–outdir %~po %i

    Convert EMF/WMF to PNG using ImageMagick:

    .emf.png.wmf.png magick convert -density 288 "%I" -scale 25% "%O"

    This image converter is available only on Windows. It leverages Windows own GDI+ to convert EMF (in fact, Windows metafiles of any kind, including WMF) to PNG.

    This is not that great because, unlike com.xmlmind.w2x_ext.wmf_converter.WMFConverterFactory which converts WMF (Windows vector graphics format) to SVG (standard vector graphics format), EMF2PNG converts a vector graphics format to a raster image format. However, having EMF2PNG is better than nothing at all.

  • Upgraded XMLmind Web Help Compiler (whc for short) to version 1.4.2, which leverages jQuery v3.1.1 and jQuery UI v1.12.1. This implies that the Web Help generated by w2x no longer supports Internet Explorer 8 and older versions.

Bug fixes:

  • Images which were used to statically render objects embedded in the DOCX file (e.g. a PowerPoint slide) were ignored.

1.2 (August 01, 2016)


  • Desktop application w2x-app has now a setup assistant (AKA “wizard” style dialog box) making it quick and easy creating w2x option files. This new setup assistant has a screen which may be used to map MS-Word character and paragraph styles (e.g. p-CodeSample) to XML elements possibly having attributes (e.g. DITA pre outputclass="code-sample").
  • New “semantic” output formats:
    • Multi-page semantic XHTML 1.0 Strict (-o frameset_strict), XHTML 1.0 Transitional (-o frameset_loose), XHTML 1.1 (-o frameset1_1), XHTML 5 (-o frameset5).
    • Web Help containing semantic XHTML 1.0 Strict (-o webhelp_strict), XHTML 1.0 Transitional (-o webhelp_loose), XHTML 1.1 (-o webhelp1_1), XHTML 5 (-o webhelp5).
    • EPUB 2 containing semantic XHTML 1.1 (-o epub1_1).
  • MS-Word math (that is, OpenXML math) is now automatically converted to MathML. However not all output formats may embed MathML. By default, MathML elements are added only to documents having the following formats: XHTML 5, EPUB (through the use of <ops:switch>), DITA and DocBook 5. When targeting any other format, XMLmind Word To XML generates external files containing MathML then adds elements pointing to these external ".mml" files. XHTML 1 example: <object data="doc_files/math-010.mml" type="application/mathml+xml"/>.

    The parameters related to MathML support are: convert.create-mathml-object, edit.finish-styles.mathjax (MathJax support).

  • Added a useful variant of parameter edit.blocks.convert called edit.blocks.convert-to-pre. This new parameter is best explained by comparing it to edit.blocks.convert.

    When using MS-Word, there two ways to represent code samples:

    1. Use a sequence of paragraphs having the same style. Each paragraph contains one line of the code sample. Let's call the style of these paragraphs Code1.
    2. Use a single paragraph containing the whole code sample, which means that this single paragraph contains significant whitespace and line breaks. Let's call the style of this paragraph Code2.

    A sequence of Code1 paragraphs may be converted to an XHTML pre using:

    –p edit.blocks.convert "p-Code1 span g:id='pre' g:container='pre'"

    A Code2 paragraph may be converted to an XHTML pre using:

    –p edit.blocks.convert-to-pre "p-Code2 pre"
  • New parameter transform.pre-element-name may be used to specify to which DocBook or DITA element, an HTML pre element is to be converted. The default value of transform.pre-element-name is pre when generating DITA and literallayout when generating DocBook.
  • When converting a DOCX file to semantic XHTML, new parameter remove-styles.preserved-classes may be used to preserve some of the classes (e.g. c-Code, p-Note, etc) used to style the elements found in the intermediate, automatically generated, styled XHTML document.

    Moreover specifying both parameters prune.preserve and remove-styles.preserved-classes is currently the only way to keep in the generated semantic XHTML empty paragraphs having a given MS-Word style. For example, specifying -p prune.preserve p-PlaceHolder and -p remove-styles.preserved-classes p-PlaceHolder may be used to keep in the semantic XHTML output all empty paragraphs having the p-PlaceHolder style.

  • The conversion to DITA may now generate some DITA 1.3 elements and attributes, for example: equation-block, equation-inline, mathml, line-through, entry/@rotate.

Bug fixes:

  • DOCX to styled HTML: fixed a couple of bugs related to numbering.
  • In some cases, option transform.generate-xref-text=yes (the default value) generated "???" (e.g. "See example ???.") rather than useful hyperlink text link "above" or "below" (e.g. "See example below.").
  • Specifying parameters split.use-id-as-filename=true and webhelp.use-id-as-filename=true caused w2x to generate files having incorrect names when the input DOCX had duplicate bookmarks or when it had bookmarks containing the '.' character.
  • In some cases, changing the style of the footnote number automatically created by MS-Word caused w2x to raise a NullPointerException.

1.1 (March 15, 2016)

It's now possible to convert a DOCX document to the following styled HTML formats (that is, XHTML+CSS):

Files generated this way look like the source DOCX document. Previously the only way to generate Web Help or EPUB was to first convert the source DOCX document to DITA or DocBook (semantic XML) and then to convert the intermediate DITA or DocBook files to Web Help or EPUB using external tools such as DITA Open Toolkit, XMLmind DITA Converter, DocBook XSL stylesheets. However in such case, the generated Web Help or EPUB does not look like the source DOCX document.

Note that a frameset is automatically generated along the multi-page styled HTML pages. While an obsolete HTML feature, a frameset makes it easy browsing these HTML pages. Moreover the table of contents used as the left frame is a convenient way to programmatically list all the generated HTML pages. Example: excerpts from w2x_install_dir/doc/manual/manual-TOC.html:

<p class="toc-entry-0"><a href="manual-0.html" target="contentFrame">XMLmind Word To XML Manual</a></p>
<p class="toc-entry-1"><a href="manual-1.html" target="contentFrame">Contents</a></p>
<p class="toc-entry-1"><a href="intro.html" target="contentFrame">1 Introduction</a></p>
<p class="toc-entry-1"><a href="install.html" target="contentFrame">2 Installing w2x</a></p>
<p class="toc-entry-2"><a href="distribution.html" target="contentFrame">2.1 Contents of
the installation directory</a></p>

How does this work?

In order to generate these 3 new formats, we need to automatically split the source DOCX document into parts. A new part is created each time a paragraph having an outline level less than or equal to specified split-before-level parameter is found in the source. An outline level is an integer between 0 (e.g. style Heading 1) and 8 (e.g. style Heading 9). The default value of parameter split-before-level is 0, which means: for each Heading 1, create a new page starting with this Heading 1.

Example: for each Heading 1 and Heading 2, create a new page (out/manual-1.html, out/manual-2.html, ..., out/manual-N.html) starting with this Heading 1 or Heading 2:

w2x -p split.split-before-level 1 -o frameset manual.docx out/manual.html

Important tip

Generating any of these 3 new formats should work great if, for the DOCX document to be converted, you can use MS-Word's "References > Table of Contents" button to automatically create a table of contents. Note that the source DOCX document is not required to have a table of contents, but MS-Word should allow to automatically create a good one. In other words, automatically creating a table of contents using MS-Word is the best way to check that your outline levels are OK.

Other enhancements:

  • When a DOCX document is converted to styled HTML of any kind (as opposed to semantic XML), the generated processing instructions are now automatically removed and all the footnotes and endnotes are now automatically given a number. If you don't want this to happen, pass parameters -p edit.do.remove-pis "" and -p edit.do.number-footnotes "" to w2x.
  • New parameter -p edit.finish-styles.custom-styles-url-or-file CSS_URL_OR_FILE makes it easy customizing the CSS styles used by the generated styled HTML pages. The custom CSS styles found in file CSS_URL_OR_FILE are simply appended to the automatically generated CSS styles.
  • New parameter -p convert.lower-case-resource-names yes (default value: no) is needed to keep quiet epubcheck on platforms where filenames are case-sensitive (e.g. Linux). Not for general use.

Bug fixes:

  • w2x-app: added a workaround for an Apple Java bug which caused any scrolled window to become garbled when scrolling quickly. This bug seems to be specific to Apple Java and to non-Retina Macs running El Capitan.

1.0.0_01 (December 4, 2015)

Bug fix: a span class=role-tabs having a negative X coordinate caused expand-tabs.js to loop forever.

1.0.0 (November 17, 2015)

First version of the commercial product.


  • Text runs aligned on tab stops are now processed as follows:
    • When generating XHTML+CSS, some JavaScript™ code is added to the output file. This code computes and gives a width to all <span class="role-tab">. This allows to decently emulate tab stops in any modern Web browser.

      If you don't want this code to be added to the output file, pass option -p edit.do.expand-tabs "" to w2x.

    • When generating semantic XHTML and all the other semantic XML formats (DocBook, DITA, etc), it's now possible to convert consecutive paragraphs containing text runs aligned on tab stops to a borderless table.

      However because, in the general case, it's not possible to emulate tab stops using tables, this XED script is disabled by default. If you really want to emulate tab stops using tables, pass option -p edit.convert-tabs.to-table yes to w2x.

    Note that the alignment of a tab stop (right, center, etc) is ignored. That is, the text run is always considered to be left aligned.

  • DOCX files using the "Strict Open XML Document" format are now supported. DOCX files using this format conforms to the Strict profile of the Open XML standard (ISO/IEC 29500). This profile of Open XML doesn't allow a set of features that are designed specifically for backward-compatibility with existing binary documents, as specified in Part 4 of ISO/IEC 29500.
  • Tested XMLmind Word To XML against the DOCX files created using MS-Word 2016.
  • Desktop application w2x-app now works fine on computers having very high resolution (HiDPI) screens. For example, it now works fine on a Mac having a Retina® screen and a Windows computer having an UHD (“4K”) screen. On Windows, all DPI scale factors —100%, 125%, 150%, 200%, etc— are supported.

    On a Linux computer having a HiDPI screen, HiDPI is not automatically detected. You'll have to to specify the display scaling factor you prefer using the -putpref command-line option. Example: w2x-app -putpref displayScaling 200.

1.0.0-beta04 (September 8, 2015)


  • The “Word To XML” servlet now provides the user with a minimal work in progress feedback during the execution of a lengthy conversion.

Bug fixes:

  • Added more DOCX files coming from different origins to the test suite of the XMLmind Word To XML. Had to slightly modify the software to cope with some specificities of these DOCX files.
  • XMLmind Word To XML add-on for XMLmind XML Editor: a user preferring to use the native file chooser on Windows or on the Mac forced the add-on to also use the native file chooser. Using the native file chooser in the context of the add-on is not convenient as this prevents the file filters specified by the add-on (DOCX, TXT, XML, DITA, etc) from working.

1.0.0-beta03 (July 13, 2015)

New “Word To XML” servlet is a Java™ Servlet (server-side standard component) which has the same functions as the w2x-app desktop application.

The “Word To XML” servlet comes in a software distribution of its own: w2x_servet-1_0_0_beta03.zip. This distribution contains a ready-to-deploy binary w2x.war, as well as the full Java™ source code of the servlet.

More information.

1.0.0-beta02 (May 6, 2015)

  • New graphical application w2x-app should be easier to use than the w2x command-line utility.
  • New application w2x-app is also available as an add-on for XMLmind XML Editor. This add-on adds an "Import DOCX" item to the File menu. The "Import DOCX" menu item displays a non modal dialog box almost identical to w2x-app. XML output files created using the "Import DOCX" dialog box are automatically opened in XMLmind XML Editor.

    This add-on is compatible with XMLmind XML Editor v6.3+. In order to install it, please follow the instructions found in XMLmind Word To XML Manual, Installing the "Word To XML" add-on.

  • Added parameter edit.headings.convert which allows to easily convert to h1, h2, ..., h6 headings paragraphs not having a outline level property.

1.0.0-beta01 (March 30, 2015)

First public release.

© 2003-2017 Pixware SARL. Updated on 2017/4/12.
Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Acrobat and PostScript are trademarks of Adobe Systems Incorporated.