Limitations and implementation specificities

The Convert step does not support the following MS-Word features.

By “does not support”, we mean that w2x will not generate something useful corresponding to such features. We don’t mean that using such features in a DOCX file would cause w2x to fail or to generate invalid XML documents.

Right to left scripts.

Enclose characters.

Asian layout.

Cover Page. Blank Page.

Text wrapping of tables and pictures other than the simplest one.

Picture formats other than GIF, PNG, JPEG, BMP, TIFF and WMF are not supported. EMF pictures are supported only on Windows.

Clip Art. Shapes. SmartArt. Chart.

Header. Footer. Page Number.

Japanese Greetings. Text Box. WordArt. Drop Cap.

Object.

All features related to Page Layout except (to a minimal extent) page and column breaks and end of sections.

All features related to Mailings.

All features related to Spelling & Grammar, except of course the various languages used in the document (i.e. lang attribute).

Comments.

All features related to Change Tracking.

When a DOCX file contains revision info (i.e. "Track Changes"), w2x implements its own, automatic, very crude, interpretation of "Accept All Changes". That's why, a warning is issued informing the user that she/he would better use MS-Word to manually accept or reject the tracked changes before submitting the DOCX file to w2x.

All features related to (document) Compare, (document) Protect.

Macros.

Controls.

The Convert step generates XHTML+CSS documents having the following specificities:

Tab stops are converted to <span class="role-tab"> </span>. See About tab stops.

MS-Word document properties having no standard meta equivalent are given names starting with “ms-”. Example:

<meta content="Hussein Shafie" name="ms-cp-lastModifiedBy" />

MS-Word “styles” having no CSS equivalent are a given a “-ms-” prefix. Example:

.p-Heading3 {

-ms-outlineLvl: 2;

color: #4F81BD;

font-family: Cambria;

...

Page breaks are translated to <?break-page?>. Column breaks are translated to <?break-column?>. End of sections are signaled by <?end-of-section?>.

WMF pictures are converted to SVG.

OpenXML math, for example , is converted to MathML.

Conversion from OpenXML math to MathML is implemented by an XSLT 1.0 stylesheet called omml2mml.xsl coming from open source project XSL stylesheets for TEI XML. If you think you have access to a better XSLT stylesheet than open source omml2mml.xsl, then you may use it by specifying environment variable (or Java™ system property) W2X_MATH_CONVERTER_XSLT. Example:

set W2X_MATH_CONVERTER_XSLT=C:\Users\john\My better omml2mml.xsl

All simple and most complex fields are converted to a <?field code?> having a <span class="role-field"> parent. Example:

<span class="role-field">

<?field DATE \@ "MMMM d, yyyy" \* MERGEFORMAT ?>

August 27, 2014

</span>

Smart tags are enclosed between <?begin-smartTag tag?> and <?end-smartTag tag?>. Example:

<?begin-smartTag {urn:schemas-microsoft-com:office:smarttags}PersonName#0?>

<?begin-smartTag {urn:schemas:contacts}GivenName#1?>

Bill

<?end-smartTag {urn:schemas:contacts}GivenName#1?>

<?begin-smartTag {urn:schemas:contacts}Sn#2?>

Gates

<?end-smartTag {urn:schemas:contacts}Sn#2?>

<?end-smartTag {urn:schemas-microsoft-com:office:smarttags}PersonName#0?>

Controls are enclosed between <?begin-sdt control_id?> and <?end-sdt control_id?>. Example:

<?begin-sdt comboBox#6?>

<td class="tc-TableGrid--bb tc-TableGrid"

style="padding-bottom: 7.2pt; padding-left: 7.2pt;

padding-right: 7.2pt; padding-top: 7.2pt;">

<p class="tp-TableGrid p-Normal" lang="fr-FR">

<span class="c-PlaceholderText">Choose an item.</span>

</p>

</td>

<?end-sdt comboBox#6?>

The language of DOCX files written in an East Asian language is not correctly detected.

Unfortunately, this will always be the case because w2x never examines the characters actually contained in a text span having <w:lang w:eastAsia="ja-JP" w:val="en-US"/> to determine whether this text span is written in ja-JP or is written in en-US or is written is a mix of both languages.

However, a partial workaround for this limitation is to specify for example –p convert.set-lang ja-JP or –p convert.default-lang ja-JP. When parameter convert.set-lang or parameter convert.default-lang is set to a language code starting with ja, zh or ko, then it is attribute w:lang/@w:eastAsia which is used to determine the language of a text span and not attribute w:lang/@w:val.

Note that –p convert.default-lang ja-JP is just used as a hint to favor attribute w:lang/@w:eastAsia over attribute wlang/@w:val. Given the way MS-Word sets these two attributes, using parameter –p convert.default-lang ja-JP will not cause a vastly incorrect detection of the language when converting a German DOCX file for example.

w2x can generate DITA indexterm elements having index-sort-as children and DocBook indexterm/primary, secondary, tertiary elements having sortas attributes. For this to happen, the input DOCX file must contain XE (index entry) fields having \y "yomi" (first phonetic character for sorting indexes) field arguments.

Unlike MS-Word which considers \y "yomi" only for East Asian languages, w2x uses this XE field argument to sort the index entries whatever the language of the document. English examples: {XE "<span>" \y "span"}, {XE "Operation:+" \y ":Addition"}.