XMLmind Word To XML
|Convert icons
Convert to PDF iconConvert to RTF (Word 2000+) iconConvert to WordprocessingML (Word 2003+) iconConvert to Office Open XML (.docx, Word 2007+) iconConvert to OpenDocument (.odt, OpenOffice/LibreOffice 2+) icon

Change history

1.2 (August 01, 2016)

Enhancements:

  • Desktop application w2x-app has now a setup assistant (AKA “wizard” style dialog box) making it quick and easy creating w2x option files. This new setup assistant has a screen which may be used to map MS-Word character and paragraph styles (e.g. p-CodeSample) to XML elements possibly having attributes (e.g. DITA pre outputclass="code-sample").
  • New “semantic” output formats:
    • Multi-page semantic XHTML 1.0 Strict (-o frameset_strict), XHTML 1.0 Transitional (-o frameset_loose), XHTML 1.1 (-o frameset1_1), XHTML 5 (-o frameset5).
    • Web Help containing semantic XHTML 1.0 Strict (-o webhelp_strict), XHTML 1.0 Transitional (-o webhelp_loose), XHTML 1.1 (-o webhelp1_1), XHTML 5 (-o webhelp5).
    • EPUB 2 containing semantic XHTML 1.1 (-o epub1_1).
  • MS-Word math (that is, OpenXML math) is now automatically converted to MathML. However not all output formats may embed MathML. By default, MathML elements are added only to documents having the following formats: XHTML 5, EPUB (through the use of <ops:switch>), DITA and DocBook 5. When targeting any other format, XMLmind Word To XML generates external files containing MathML then adds elements pointing to these external ".mml" files. XHTML 1 example: <object data="doc_files/math-010.mml" type="application/mathml+xml"/>.

    The parameters related to MathML support are: convert.create-mathml-object, edit.finish-styles.mathjax (MathJax support).

  • Added a useful variant of parameter edit.blocks.convert called edit.blocks.convert-to-pre. This new parameter is best explained by comparing it to edit.blocks.convert.

    When using MS-Word, there two ways to represent code samples:

    1. Use a sequence of paragraphs having the same style. Each paragraph contains one line of the code sample. Let's call the style of these paragraphs Code1.
    2. Use a single paragraph containing the whole code sample, which means that this single paragraph contains significant whitespace and line breaks. Let's call the style of this paragraph Code2.

    A sequence of Code1 paragraphs may be converted to an XHTML pre using:

    –p edit.blocks.convert "p-Code1 span g:id='pre' g:container='pre'"

    A Code2 paragraph may be converted to an XHTML pre using:

    –p edit.blocks.convert-to-pre "p-Code2 pre"
  • New parameter transform.pre-element-name may be used to specify to which DocBook or DITA element, an HTML pre element is to be converted. The default value of transform.pre-element-name is pre when generating DITA and literallayout when generating DocBook.
  • When converting a DOCX file to semantic XHTML, new parameter remove-styles.preserved-classes may be used to preserve some of the classes (e.g. c-Code, p-Note, etc) used to style the elements found in the intermediate, automatically generated, styled XHTML document.

    Moreover specifying both parameters prune.preserve and remove-styles.preserved-classes is currently the only way to keep in the generated semantic XHTML empty paragraphs having a given MS-Word style. For example, specifying -p prune.preserve p-PlaceHolder and -p remove-styles.preserved-classes p-PlaceHolder may be used to keep in the semantic XHTML output all empty paragraphs having the p-PlaceHolder style.

  • The conversion to DITA may now generate some DITA 1.3 elements and attributes, for example: equation-block, equation-inline, mathml, line-through, entry/@rotate.

Bug fixes:

  • DOCX to styled HTML: fixed a couple of bugs related to numbering.
  • In some cases, option transform.generate-xref-text=yes (the default value) generated "???" (e.g. "See example ???.") rather than useful hyperlink text link "above" or "below" (e.g. "See example below.").
  • Specifying parameters split.use-id-as-filename=true and webhelp.use-id-as-filename=true caused w2x to generate files having incorrect names when the input DOCX had duplicate bookmarks or when it had bookmarks containing the '.' character.
  • In some cases, changing the style of the footnote number automatically created by MS-Word caused w2x to raise a NullPointerException.

1.1 (March 15, 2016)

It's now possible to convert a DOCX document to the following styled HTML formats (that is, XHTML+CSS):

Files generated this way look like the source DOCX document. Previously the only way to generate Web Help or EPUB was to first convert the source DOCX document to DITA or DocBook (semantic XML) and then to convert the intermediate DITA or DocBook files to Web Help or EPUB using external tools such as DITA Open Toolkit, XMLmind DITA Converter, DocBook XSL stylesheets. However in such case, the generated Web Help or EPUB does not look like the source DOCX document.

Note that a frameset is automatically generated along the multi-page styled HTML pages. While an obsolete HTML feature, a frameset makes it easy browsing these HTML pages. Moreover the table of contents used as the left frame is a convenient way to programmatically list all the generated HTML pages. Example: excerpts from w2x_install_dir/doc/manual/manual-TOC.html:

...
<body>
<p class="toc-entry-0"><a href="manual-0.html" target="contentFrame">XMLmind Word To XML Manual</a></p>
<p class="toc-entry-1"><a href="manual-1.html" target="contentFrame">Contents</a></p>
<p class="toc-entry-1"><a href="intro.html" target="contentFrame">1 Introduction</a></p>
<p class="toc-entry-1"><a href="install.html" target="contentFrame">2 Installing w2x</a></p>
<p class="toc-entry-2"><a href="distribution.html" target="contentFrame">2.1 Contents of
the installation directory</a></p>
...

How does this work?

In order to generate these 3 new formats, we need to automatically split the source DOCX document into parts. A new part is created each time a paragraph having an outline level less than or equal to specified split-before-level parameter is found in the source. An outline level is an integer between 0 (e.g. style Heading 1) and 8 (e.g. style Heading 9). The default value of parameter split-before-level is 0, which means: for each Heading 1, create a new page starting with this Heading 1.

Example: for each Heading 1 and Heading 2, create a new page (out/manual-1.html, out/manual-2.html, ..., out/manual-N.html) starting with this Heading 1 or Heading 2:

w2x -p split.split-before-level 1 -o frameset manual.docx out/manual.html

Important tip

Generating any of these 3 new formats should work great if, for the DOCX document to be converted, you can use MS-Word's "References > Table of Contents" button to automatically create a table of contents. Note that the source DOCX document is not required to have a table of contents, but MS-Word should allow to automatically create a good one. In other words, automatically creating a table of contents using MS-Word is the best way to check that your outline levels are OK.

Other enhancements:

  • When a DOCX document is converted to styled HTML of any kind (as opposed to semantic XML), the generated processing instructions are now automatically removed and all the footnotes and endnotes are now automatically given a number. If you don't want this to happen, pass parameters -p edit.do.remove-pis "" and -p edit.do.number-footnotes "" to w2x.
  • New parameter -p edit.finish-styles.custom-styles-url-or-file CSS_URL_OR_FILE makes it easy customizing the CSS styles used by the generated styled HTML pages. The custom CSS styles found in file CSS_URL_OR_FILE are simply appended to the automatically generated CSS styles.
  • New parameter -p convert.lower-case-resource-names yes (default value: no) is needed to keep quiet epubcheck on platforms where filenames are case-sensitive (e.g. Linux). Not for general use.

Bug fixes:

  • w2x-app: added a workaround for an Apple Java bug which caused any scrolled window to become garbled when scrolling quickly. This bug seems to be specific to Apple Java and to non-Retina Macs running El Capitan.

1.0.0_01 (December 4, 2015)

Bug fix: a span class=role-tabs having a negative X coordinate caused expand-tabs.js to loop forever.


1.0.0 (November 17, 2015)

First version of the commercial product.

Enhancements:

  • Text runs aligned on tab stops are now processed as follows:
    • When generating XHTML+CSS, some JavaScript™ code is added to the output file. This code computes and gives a width to all <span class="role-tab">. This allows to decently emulate tab stops in any modern Web browser.

      If you don't want this code to be added to the output file, pass option -p edit.do.expand-tabs "" to w2x.

    • When generating semantic XHTML and all the other semantic XML formats (DocBook, DITA, etc), it's now possible to convert consecutive paragraphs containing text runs aligned on tab stops to a borderless table.

      However because, in the general case, it's not possible to emulate tab stops using tables, this XED script is disabled by default. If you really want to emulate tab stops using tables, pass option -p edit.convert-tabs.to-table yes to w2x.

    Note that the alignment of a tab stop (right, center, etc) is ignored. That is, the text run is always considered to be left aligned.

  • DOCX files using the "Strict Open XML Document" format are now supported. DOCX files using this format conforms to the Strict profile of the Open XML standard (ISO/IEC 29500). This profile of Open XML doesn't allow a set of features that are designed specifically for backward-compatibility with existing binary documents, as specified in Part 4 of ISO/IEC 29500.
  • Tested XMLmind Word To XML against the DOCX files created using MS-Word 2016.
  • Desktop application w2x-app now works fine on computers having very high resolution (HiDPI) screens. For example, it now works fine on a Mac having a Retina® screen and a Windows computer having an UHD (“4K”) screen. On Windows, all DPI scale factors —100%, 125%, 150%, 200%, etc— are supported.

    On a Linux computer having a HiDPI screen, HiDPI is not automatically detected. You'll have to to specify the display scaling factor you prefer using the -putpref command-line option. Example: w2x-app -putpref displayScaling 200.


1.0.0-beta04 (September 8, 2015)

Enhancements:

  • The “Word To XML” servlet now provides the user with a minimal work in progress feedback during the execution of a lengthy conversion.

Bug fixes:

  • Added more DOCX files coming from different origins to the test suite of the XMLmind Word To XML. Had to slightly modify the software to cope with some specificities of these DOCX files.
  • XMLmind Word To XML add-on for XMLmind XML Editor: a user preferring to use the native file chooser on Windows or on the Mac forced the add-on to also use the native file chooser. Using the native file chooser in the context of the add-on is not convenient as this prevents the file filters specified by the add-on (DOCX, TXT, XML, DITA, etc) from working.

1.0.0-beta03 (July 13, 2015)

New “Word To XML” servlet is a Java™ Servlet (server-side standard component) which has the same functions as the w2x-app desktop application.

The “Word To XML” servlet comes in a software distribution of its own: w2x_servet-1_0_0_beta03.zip. This distribution contains a ready-to-deploy binary w2x.war, as well as the full Java™ source code of the servlet.

More information.


1.0.0-beta02 (May 6, 2015)

  • New graphical application w2x-app should be easier to use than the w2x command-line utility.
  • New application w2x-app is also available as an add-on for XMLmind XML Editor. This add-on adds an "Import DOCX" item to the File menu. The "Import DOCX" menu item displays a non modal dialog box almost identical to w2x-app. XML output files created using the "Import DOCX" dialog box are automatically opened in XMLmind XML Editor.

    This add-on is compatible with XMLmind XML Editor v6.3+. In order to install it, please follow the instructions found in XMLmind Word To XML Manual, Installing the "Word To XML" add-on.

  • Added parameter edit.headings.convert which allows to easily convert to h1, h2, ..., h6 headings paragraphs not having a outline level property.

1.0.0-beta01 (March 30, 2015)

First public release.


© 2003-2016 Pixware SARL. Updated on 2016/7/31.
Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Acrobat and PostScript are trademarks of Adobe Systems Incorporated.