Validate OOXML

Ensure the standard-conformance of OOXML-documents created from scratch.

This document is also posted to my blog on Linuxquestions.org.

Contents

Disclaimer

I have invented nothing of this; I have not found it, nor developed the procedures mentioned on this page. The complete knowledge that I only reproduce here, has been communicated to me by NevemTeve in a discussion on LinuxQuestions.org.

There is now a second thread on LinuxQuestions.org, dealing OOXML validation.

Introduction

Even if modern wordprocessors are enriched with many functions which facilitate the redaction of complex text-documents, some recurring tasks, which are performed often and in the same way within the same document, cannot be completely automated with the commands that the program offers. As documents are produced for specific purposes and the needs of individual users cannot be anticipated in all detail, software-companies integrate scripting interfaces to their office-software.

Where such a scripting interface is not present, you can still automate the generation and manipulation of office-documents.

Modern wordprocessors read from and write to compressed XML-files. The Microsoft® file-format OOXML – e.g in docx-files – as well as ODF base on XML. To read and manipulate the content and formatting of such documents you only need to edit the XML-files which you discover after unzipping an ODT- or DOCX-file:

user@machine:/tmp/docx$ unzip ../rudi.docx
Archive:  ../rudi.docx
  inflating: _rels/.rels             
  inflating: docProps/core.xml       
  inflating: docProps/app.xml        
  inflating: word/_rels/document.xml.rels  
  inflating: word/document.xml       
  inflating: word/styles.xml         
  inflating: word/fontTable.xml      
  inflating: word/settings.xml       
  inflating: [Content_Types].xml  

You can find the meaning of each of the XML-tags, all the possible XML-attributes, as well as the rules for their deployment in specific contexts on specialised web-sites. Here, I want to concentrate on OOXML only: http://officeopenxml.com/index.php

Motivation

Writing OOXML from scratch can be complicated. As long as you do only modify text-nodes, nothing can happen. But as soon as you manipulate XML-tags or introduce more tags and complexer tag-structures to your document, you have to be careful to obey strictly to the rules of the OOXML standard. Where programmed routines are responsible for those manipulations, they can rapidly and profoundly alter the file-structures together with the actual content.

Even if, after opening the resulting document in your wordprocessor, all looks fine and just as you want it, other programs can be in trouble, if your OOXML code is not what they expect. But interoperability, comparability and comprehension is what standards are initially meant to achieve. You should, therefore, routinely validate your own OOXML-documents against the OOXML-standard to be sure that routines which generate or modify OOXML files, work reliably in all situations.

This document describes a way to validate OOXML wordprocessor files against the pertinent OOXML Schemas, in order to locate and identify potential errors.

XML Schema validation with xmllint

I prefer to first present you the command-line which you will execute to validate a wordprocessor-file and explain its components. The objective is then to ensure that the conditions for the successful command execution are met (read on below)

One last remark. A surprising amount of file-manipulations are needed, before you can validate OOXML with the procedure I chose to present on this page. I consider this unsatisfactory and still seek simplification. But also note that, once that the preparations are completed, repeated validations are as easy as launching xmllint with the few arguments that are included in the command, shown here:

xmllint -noout -debugent -schema ooxml_xsd/wml.xsd document.xml
xmllint
xmllint is an XML-parser for many purposes. Consult the xmllint man-page for the complete description of its many options. On a Linux system, xmllint is part of libxml.
-noout
This option specifies that xmllint shall not produce output other than potential error- and warning-messages.
-debugent
Comments will be printed concerning entities which are defined in the source-document.
-schema
The location of the initial schema-file, which will be read to compare the source-document to the standard.
document.xml
The XML-document which is validated. document.xml is also the main component of an OOXML wordprocessor file. This is where the textual content and the structure of the enclosing tags are found, like in this (scrollable) example of a file document.xml:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<w:document xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xmlns:w10="urn:schemas-microsoft-com:office:word"
xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"
xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing"
xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml"
mc:Ignorable="w14 wp14">
  <w:body>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="Heading1" />
        <w:bidi w:val="0" />
        <w:spacing w:before="240" w:after="120" />
        <w:jc w:val="left" />
        <w:rPr></w:rPr>
      </w:pPr>
      <w:r>
        <w:rPr></w:rPr>
        <w:t>Validate OOXML</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="TextBody" />
        <w:bidi w:val="0" />
        <w:spacing w:lineRule="auto" w:line="276" w:before="0"
        w:after="140" />
        <w:jc w:val="left" />
        <w:rPr></w:rPr>
      </w:pPr>
      <w:r>
        <w:rPr></w:rPr>
        <w:t>Ensure the standard-conformance of OOXML-documents
        created from scratch.</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="Heading2" />
        <w:bidi w:val="0" />
        <w:jc w:val="left" />
        <w:rPr></w:rPr>
      </w:pPr>
      <w:r>
        <w:rPr></w:rPr>
        <w:t>Contents</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="Heading2" />
        <w:bidi w:val="0" />
        <w:jc w:val="left" />
        <w:rPr></w:rPr>
      </w:pPr>
      <w:bookmarkStart w:id="0" w:name="intro" />
      <w:bookmarkEnd w:id="0" />
      <w:r>
        <w:rPr></w:rPr>
        <w:t>Introduction</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="TextBody" />
        <w:bidi w:val="0" />
        <w:spacing w:lineRule="auto" w:line="276" w:before="0"
        w:after="140" />
        <w:jc w:val="left" />
        <w:rPr></w:rPr>
      </w:pPr>
      <w:r>
        <w:rPr></w:rPr>
        <w:t>Even if modern wordprocessors are enriched with many
        functions which facilitate the redaction of complex
        text-documents, some recurring tasks, which are performed
        often and in the same way within the same document, cannot
        be completely automated with the commands that the program
        offers. As documents are produced for specific purposes and
        the needs of individual users cannot be anticipated in all
        detail, software-companies integrate scripting interfaces
        to their office-software.</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="TextBody" />
        <w:bidi w:val="0" />
        <w:spacing w:lineRule="auto" w:line="276" w:before="0"
        w:after="140" />
        <w:jc w:val="left" />
        <w:rPr></w:rPr>
      </w:pPr>
      <w:r>
        <w:rPr></w:rPr>
        <w:t>Where such a scripting interface is not present, you
        can still automate the generation and manipulation of
        office-documents.</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="TextBody" />
        <w:bidi w:val="0" />
        <w:spacing w:lineRule="auto" w:line="276" w:before="0"
        w:after="140" />
        <w:jc w:val="left" />
        <w:rPr></w:rPr>
      </w:pPr>
      <w:r>
        <w:rPr></w:rPr>
        <w:t>Modern wordprocessors read from and write to
        compressed XML-files. The Microsoft® file-format OOXML –
        e.g in docx-files – as well as ODF base on XML. To read and
        manipulate the content and formatting of such documents you
        only need to edit the XML-files which you discover after
        unzipping an ODT- or DOCX-file.</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="TextBody" />
        <w:bidi w:val="0" />
        <w:spacing w:lineRule="auto" w:line="276" w:before="0"
        w:after="140" />
        <w:jc w:val="left" />
        <w:rPr></w:rPr>
      </w:pPr>
      <w:r>
        <w:rPr></w:rPr>
        <w:t>You can find the meaning of each of the XML-tags all
        the possible XML-attributes, as well as the rules for their
        deployment in specific contexts on specialised web-sites.
        Here, I want to concentrate on OOXML only: [ OOXML -
        reference goes here ]</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="Heading2" />
        <w:bidi w:val="0" />
        <w:jc w:val="left" />
        <w:rPr></w:rPr>
      </w:pPr>
      <w:bookmarkStart w:id="1" w:name="motivation" />
      <w:bookmarkEnd w:id="1" />
      <w:r>
        <w:rPr></w:rPr>
        <w:t>Motivation</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="TextBody" />
        <w:bidi w:val="0" />
        <w:spacing w:lineRule="auto" w:line="276" w:before="0"
        w:after="140" />
        <w:jc w:val="left" />
        <w:rPr></w:rPr>
      </w:pPr>
      <w:r>
        <w:rPr></w:rPr>
        <w:t>Writing OOXML from scratch can be complicated. As long
        as you do only modify text-nodes, nothing can happen. But
        as soon as you manipulate XML-nodes or introduce more tags
        and complexer tag-structures to your document, you have to
        be careful to obey st</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="Normal" />
        <w:bidi w:val="0" />
        <w:jc w:val="left" />
        <w:rPr></w:rPr>
      </w:pPr>
      <w:r>
        <w:rPr></w:rPr>
      </w:r>
    </w:p>
    <w:sectPr>
      <w:type w:val="nextPage" />
      <w:pgSz w:w="12240" w:h="15840" />
      <w:pgMar w:left="1134" w:right="1134" w:header="0"
      w:top="1134" w:footer="0" w:bottom="1134" w:gutter="0" />
      <w:pgNumType w:fmt="decimal" />
      <w:formProt w:val="false" />
      <w:textDirection w:val="lrTb" />
    </w:sectPr>
  </w:body>
</w:document>

Before you can validate anything, you must ensure that all the necessary schemas, in the form of *.xsd files, can be accessed by an XML-parser.
I will show you the steps to establish this validating-environment.

Set-up step by step

I. Provide the schema catalog
Ensure that the file /usr/local/etc/xml/catalog exists, create it otherwise, as root :
$ mkdir -p /usr/local/etc/xml
$ cat >/usr/local/etc/xml/catalog <<DONE
<?xml version="1.0"?>
<!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN" "http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
  <uri name="http://www.w3.org/XML/1998/namespace" uri="file:///usr/local/etc/xml/xml_2009_01.xsd"/>
  <nextCatalog catalog="file:///etc/xml/catalog"/>
</catalog>
DONE
II. Provide xml.xsd
Ensure that /usr/local/etc/xml/catalog contains the line
  <uri name="http://www.w3.org/XML/1998/namespace" uri="file:///usr/local/etc/xml/xml_2009_01.xsd"/>
If it is missing, insert the line before the tag <nextCatalog>, just like it is shown above.
You must also get the actual file xml.xsd:
wget -O /usr/local/etc/xml/xml_2009_01.xsd http://www.w3.org/2009/01/xml.xsd
III. Provide the OOXML-xsd files
The schema files can be downloaded from https://repo1.maven.org/maven2/org/apache/poi/ooxml-schemas/1.4/.
Choose the file ooxml-schemas-1.4.jar and download it.
Unzip the file, e.g. to your temporary directory and locate the xsd-files in the sub-directory /schemaorg_apache_xmlbeans/src. Move all the xsd-files to a directory that will be accessible later, when calling the xml-parser, e.g. a sub-directory of your working-directory:
          :~/project$ mkdir ooxml_xsd
          :~/project$ cd ooxml_xsd
          :~/project$/ooxml_xsd mv /tmp/schemaorg_apache_xmlbeans/src/*.xsd ./ 
IV. Complete wml.xsd
Open the schema file wml.xsd and find the tag <xsd:import> with the id xml (the fourth at the time of this writing). Complete this line with the schemaLocation attribute or replace it, so that it is identical to the following:
  <xsd:import id="xml" namespace="http://www.w3.org/XML/1998/namespace" schemaLocation="http://www.w3.org/XML/1998/namespace"/> 
V. Consolidate duplicated import of the same namespace in dml-wordprocessingDrawing.xsd
Create a xsd-file dml-wordprocessingDrawing_import.xsd with the following content:
<?xml version="1.0" encoding="utf-8"?>
<xsd:schema targetNamespace="http://schemas.openxmlformats.org/drawingml/2006/main"
   elementFormDefault="qualified"
   xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <xsd:include schemaLocation="dml-graphicalObject.xsd"/>
  <xsd:include schemaLocation="dml-documentProperties.xsd"/>
</xsd:schema> 
Now open the schema file dml-wordprocessingDrawing.xsd. Replace the two tags <xsd:import> with the schemaLocations dml-wordprocessingDrawing.xsd and dml-documentProperties.xsd by one single line which imports only the newly created schema-file
  <xsd:import schemaLocation="dml-wordprocessingDrawing_import.xsd" namespace="http://schemas.openxmlformats.org/drawingml/2006/main" /> 

Invoking xmllint

The call to xmllint is already shown, above, but prior executing the command, you must remember to set the environment variable XML_CATALOG_FILES to the location of the schema catalog as, otherwise, the standard path /etc/xml/catalog would be read. This is an example of a successful validation with xmllint after having completed the preparatory tasks, listed above :

user@machine:/tmp$ export XML_CATALOG_FILES=/usr/local/etc/xml/catalog 
user@machine:/tmp$ xmllint -noout -debugent -schema ~/prog/ooxml_xsd/wml.xsd ./docx/word/document.xml 
new input from file: /prog/ooxml_xsd/wml.xsd
new input from file: /prog/ooxml_xsd/shared-customXmlSchemaProperties.xsd
new input from file: /prog/ooxml_xsd/shared-math.xsd
new input from file: /prog/ooxml_xsd/dml-wordprocessingDrawing.xsd
new input from file: /prog/ooxml_xsd/dml-wordprocessingDrawing_import.xsd
new input from file: /prog/ooxml_xsd/dml-graphicalObject.xsd
new input from file: /prog/ooxml_xsd/dml-documentProperties.xsd
new input from file: /prog/ooxml_xsd/dml-baseTypes.xsd
new input from file: /prog/ooxml_xsd/shared-relationshipReference.xsd
new input from file: /prog/ooxml_xsd/dml-shapeGeometry.xsd
new input from file: /prog/xml.xsd
new input from file: docx/word/document.xml
docx/word/document.xml validates
DOCUMENT
No entities in internal subset
No entities in external subset 

Now please just believe me: This is cool.