1.2. The "Paste from Word" engine

The "Paste from Word" engine converts the non-filtered HTML generated by MS-Word 2003+ to XML. This engine is embedded in command pasteFromWord. It is also available as a command-line utility.

The conversion to XML comprises 3 phases:

  1. Parse phase: parse the non-filtered HTML generated by MS-Word 2003+ and convert it to well-formed XHTML (embedding a CSS stylesheet by the means of the style XHTML element). The XHTML document obtained this way is completely invalid (e.g. it contain many foreign elements and attributes), highly redundant and non-structured (e.g. no lists; just styled paragraphs).

  2. Edit phase: modify the XHTML document in place in order to clean it up and to structure it. The XHTML document obtained after this phase is a clean, almost completely structured, valid "XHTML 1.0 Transitional" document.

    This phase is implemented by a elaborate sequence of edit steps. Most edit steps are implemented using XED scripts in XMLmind XML Editor - Support of XPath 1.0. However a small number of edit steps are still implemented in Java™.

    The edit steps and their parameters are documented in Section 1.2.1, “Edit steps”.

  3. Transform phase: transform the "XHTML 1.0 Transitional" document to the target document type using XSLT 1.0 stylesheets.

    The XSLT stylesheets and their parameters are documented in Section 1.2.3, “Transform stylesheets”.

1.2.1. Edit steps

Note

The documentation found in this section is currently insufficient to be able to parameterize and/or customize the edit steps. For now, you'll have to read the XED source of these steps. All these XED scripts are found in addon_install_dir/xed/.

The edit steps are invoked in the following order by the main XED script (addon_install_dir/xed/main.xed):

after-parse

XED script addon_install_dir/xed/after-parse.xed. Delete some foreign elements (e.g. w:Sdt[@docparttype="Table of Contents"]).

No parameters.

styles

Compiled step. This step processes head/style elements as well as class and style attributes:

  • Remove all the elements found in head, except title.

  • Parse the contents of style elements found in head.

    Parsed styles are saved as a <!--styles--> comments for reference by the developer of XED scripts.

  • Attributes class and style are moved to urn:x-mlmind:namespace:style, the style namespace.

    The value of a style attribute is parsed and possibly split into several attributes belonging to the style namespace.

    The CSS cascade (inheritance, selectors, computed property value, etc) is applied to these style attributes.

No parameters.

prune

XED script addon_install_dir/xed/prune.xed. Delete useless elements (e.g. divs which are only used to style their contents). Replace each non-empty span containing only whitespace and/or non-breaking spaces by a single space character.

No parameters.

lang

Compiled step. This step simplifies the lang attributes found in the document.

ParameterValueDefault ValueDescription
lang.langremove | simple | simplify | ISO 639-1 two-letter language codesimplify
remove

Remove all lang attributes.

simple

Move the lang attribute from the body element to the html root element. Remove all other lang attributes.

simplify

Minimize the number of lang attributes found in the document.

A user-specified language code such as de, fr-CA, etc

Remove all lang attributes. Set the lang attribute of the html root element to this user-specified language code.

In all cases, remove all s:mso-*-language attributes.

title

XED script addon_install_dir/xed/title.xed. Process the title of the document.

No parameters.

biblio

XED script addon_install_dir/xed/biblio.xed. Process bibliography entries.

No parameters.

index

XED script addon_install_dir/xed/index.xed. Process index entries.

No parameters.

xrefs

XED script addon_install_dir/xed/xrefs.xed. Process anchors and links.

No parameters.

inlines

XED script addon_install_dir/xed/inlines.xed. Process styled spans.

ParameterValueDefault ValueDescription
inlines.generate-big-smallno|yesyesif yes, convert span to big or small depending on the font size of the style of the span.
tables

Compiled step. This step mainly reduces the number of align, valign and width attributes found inside tables.

  • align attributes are removed from list items and tables found in td.

  • align attributes which are common to all td/p are moved to the td.

  • align attributes which are common to all td belonging to the same column are moved to the corresponding colgroup.

  • valign attributes which are common to all tr/td are moved to the tr.

  • width attributes are removed from all td elements. Colgroup elements are are added to specify the width of all columns. A width is specified as a percentage.

Moreover this step:

  • Groups in a tbody all tr found directly in a table.

  • Sets the table width to 100%.

  • Adds attribute style="-cell-rotate:NNN;", where NNN is 270 or 90, to all rotated td elements.

ParameterValueDefault ValueDescription
tables.set-column-numberno|yesnoif yes, insert before the first child of each table cell <?column-number N?>, where N is the column number of the cell. First column is column #1.
captions

XED script addon_install_dir/xed/captions.xed. Process table and figure captions.

No parameters.

headings

XED script addon_install_dir/xed/headings.xed. Convert paragraphs having an outline level to headings (h1, h2, ..., h6). Simplify headings.

No parameters.

lists

XED script addon_install_dir/xed/lists.xed. Convert sequences of paragraphs styled as list items to proper lists.

No parameters.

footnotes

XED script addon_install_dir/xed/footnotes.xed. Process footnotes and endnotes.

No parameters.

sections

XED script addon_install_dir/xed/sections.xed. Leverage headings (h1, h2, ..., h6) to create sections (<div class="role-sectionN">)

ParameterValueDefault ValueDescription
sections.max-levelinteger; negative or null means no limit.-1Specifies how deeply sections can nest.
ids

XED script addon_install_dir/xed/ids.xed. Move id attributes from headings and captions to their parent containers (section, table, figure, etc).

No parameters.

finish

XED script addon_install_dir/xed/finish.xed. Delete empty elements. Optionally, set <!DOCTYPE>.

ParameterValueDefault ValueDescription
finish.set-doctypeno|yesnoif yes, add a "XHTML 1.0 Transitional" <!DOCTYPE> to the document being edited.
before-save

Compiled step. This step performs the final clean-up needed before saving the XHTML result document to disk.

  • It removes the <!--styles--> comments found in the head.

  • It removes all the "s:" and "g:" prefixed attributes.

  • It removes the "s:" and "g:" prefixes.

  • It removes all foreign elements and attributes created by TagSoup, the HTML parser (their namespace starts with "urn:x-prefix:").

ParameterValueDefault ValueDescription
before-save.allow-flowno|yesno

if yes, allow flow elements (e.g. li, td) to contain text and inline elements (e.g. b, i) in addition to block elements (e.g. p, pre, table).

if no, do not allow flow elements to contain text and inline elements. In order to implement this, wrap the text and inline elements into <p class="role-inline-wrapper">.

1.2.2. XPath extension functions for use by the edit steps

In the following reference, prefix "f:" is bound to namespace "urn:x-mlmind:namespace:function".

string f:alias(style_name)

Returns the reference, English, style name corresponding to specified style name. Uses the alias declarations found in text file addon_install_dir/xed/aliases.txt to determine this. Example: alias("TitelZchn") returns "TitleChar". Returns style_name when the reference style name is not found.

number f:cm(number)

Converts number, a number expressed in centimeters, to points. Example: cm(2.54) returns 72. Returns NaN when the conversion fails.

string f:color(color)

Converts color to its 6 hexadecimal digit, upper-case, representation. Examples: color("red") returns "#FF0000", color("rgb(255,0,0)") returns "#FF0000". Returns "" when the conversion fails.

boolean f:contains-font-family(style, family, ..., family)

Returns true() if specified font-family style property contains any of the specified typefaces (case insensitive). Example: contains-font-family(../@s:font-family, "Times New Roman", "Courier New", "Menlo") returns true(). Returns false() when this cannot be determined.

number f:content-type(node?)

Returns a numeric code indicating the type of contents of specified element (or parent element of specified node in case specified node is not an element). Parameter node defaults to the context element (or the parent element of the context node if the context node is not an element). Returns -1 when the content type cannot be determined.

CodeDescription
0Empty.
1

Whitespace only.

Important

Non-breaking space characters (&nbsp;) are considered to be whitespace.

2Element only.
3Elements and whitespace.
4Text other than whitespace (words) but no elements.
6Words and elements.
number f:em(number)

Converts number, a number expressed in em, to points, using the font size of the styled element containing the context node. Example: em(1) returns 10. Returns NaN when the conversion fails.

number f:ex(number)

Converts number, a number expressed in ex, to points, using the font size of the styled element containing the context node. Example: ex(1) returns 5. Returns NaN when the conversion fails.

boolean f:font-family(family, node?)

Returns true() if the styled element containing containing specified node uses specified typeface (case insensitive). Parameter node defaults to the styled element containing the context node. Examples: font-family("Times New Roman") returns true(), font-family("Times New Roman", .//html:tt) returns true(). Returns false() when this cannot be determined.

number f:font-size(node?)

Returns the font size, expressed in points, of the styled element containing containing specified node. Parameter node defaults to the styled element containing the context node. Examples: font-size() returns 10, font-size(./html:b) returns 10. Returns NaN when the font size cannot be determined.

number f:length(length)

Converts length to points. Example: length("1in") returns 72. Returns NaN when the conversion fails.

For some units, function length() has to use the font size of the styled element containing the context node in order to perform the conversion. Example: length("1em") return 10.

number f:line-height(node?)

Returns the line height, expressed in points, of the styled element containing containing specified node. Parameter node defaults to the styled element containing the context node. Examples: line-height() returns 12, line-height(.//html:p) returns 12. Returns NaN when the line height cannot be determined.

number f:in(number)

Converts number, a number expressed in inches, to points. Example: in(1) returns 72. Returns NaN when the conversion fails.

number f:mm(number)

Converts number, a number expressed in millimeters, to points. Example: mm(25.4) returns 72. Returns NaN when the conversion fails.

boolean f:monospaced-font-family(node?)

Returns true() if the styled element containing containing specified node uses a monospaced font. Parameter node defaults to the styled element containing the context node. Examples: monospaced-font-family() returns false(), monospaced-font-family(.//html:b) returns false(). Returns false() when this cannot be determined.

number f:parse-list-value(ol_type, pattern, value, start)

Parses label value (e.g. "III.D.") of a list item using the format specified by the combination of ol_type ("1", "a", "A", "i", "II") and pattern (e.g. "%1)", "%1.%2.") Returns the number corresponding to the label. Returns start (e.g. 1) when the label cannot be parsed. Examples: parse-list-value("A", "%1.%2.", "III.D.", 1) returns 4; parse-list-value("1", "%1)", "(two)", 0) returns 0.

number f:percent(percent)

Converts percent to a number. Example: percent("10%") returns 10. Returns NaN when the conversion fails.

number f:pc(number)

Converts number, a number expressed in pica, to points. Example: pc(10) returns 120. Returns NaN when the conversion fails.

number f:pt(number)

Converts number, a number expressed in points, to points. Example: pt(10) returns 10. Returns NaN when the conversion fails.

number f:px(number)

Converts number, a number expressed in pixels, to points, using a 96DPI resolution. Example: px(100) returns 75. Returns NaN when the conversion fails.

number f:vertical-align(node?)

Returns the vertical align, an offset from the baseline expressed in points, of the styled element containing containing specified node. Parameter node defaults to the styled element containing the context node. Examples: vertical-align() returns -3, vertical-align(.//html:p) returns -3. Returns NaN when the vertical align cannot be determined or is a keyword (sub, super) and not a length or percentage.

1.2.3. Transform stylesheets

Note

The documentation found in this section is currently insufficient to be able to parameterize and/or customize the transform stylesheets. For now, you'll have to read the XSLT 1.0 source of these stylesheets. All these stylesheets are found in addon_install_dir/xslt/.

addon_install_dir/xslt/docbook5.xslt

Converts the "XHTML 1.0 Transitional" document created during phase #2 to a DocBook 5 document.

addon_install_dir/xslt/docbook.xslt

Converts the "XHTML 1.0 Transitional" document created during phase #2 to a DocBook 4 document.

addon_install_dir/xslt/topic.xslt

Converts the "XHTML 1.0 Transitional" document created during phase #2 to a DITA topic.

addon_install_dir/xslt/xhtml1_1.xslt

Converts the "XHTML 1.0 Transitional" document created during phase #2 to an XHTML 1.1 document.

addon_install_dir/xslt/xhtml5.xslt

Converts the "XHTML 1.0 Transitional" document created during phase #2 to an XHTML 5 document.

addon_install_dir/xslt/xhtml_loose.xslt

Converts the "XHTML 1.0 Transitional" document created during phase #2 to a completely structured "XHTML 1.0 Transitional" document.

addon_install_dir/xslt/xhtml_strict.xslt

Converts the "XHTML 1.0 Transitional" document created during phase #2 to an "XHTML 1.0 Strict" document.

1.2.4. Engine options

Process options:

-p name value

Set parameter name to value.

Parameters starting with "transform." are passed to the XSLT stylesheet, if any, after removing the "transform." prefix. All the other parameters are passed as is to the main .xed script, if any.

-pu name URL_or_file

Same as "-p", except that parameter value URL_or_file is first converted to an URL.

URL_or_file is an URL or an absolute or relative (to current working directory) filename.

-s xed_URL_or_file

Specifies which main .xed script to use to modify the document.

Specify an empty string ("") to suppress the edit phase.

Default script: paste-from-word:xed/main.xed.

-t xslt_URL_or_file

Specifies which XSLT 1.0 stylesheet to use to transform the document.

Specify an empty string ("") to suppress the transform phase.

Process options modifying default script "paste-from-word:xed/main.xed":

-parse

Save XHTML without fully processing it. (Stop processing after edit step "styles".)

-i step xed_URL_or_file

Insert script before .xed step.

-a step xed_URL_or_file

Add script after .xed step.

-r step xed_URL_or_file

Replace step by .xed script.

Step may be a single step name or a range: "first..last" or "..last" or "first..".

-d step

Delete step.

Step may be a single step name or a range.