Ensure the standard-conformance of OOXML-documents created from scratch.
This document is also posted to my blog on Linuxquestions.org.
I have invented nothing of this; I have not found it, nor developed the procedures mentioned on this page. The complete knowledge that I only reproduce here, has been communicated to me by NevemTeve in a discussion on LinuxQuestions.org.
There is now a second thread on LinuxQuestions.org, dealing OOXML validation.
Even if modern wordprocessors are enriched with many functions which facilitate the redaction of complex text-documents, some recurring tasks, which are performed often and in the same way within the same document, cannot be completely automated with the commands that the program offers. As documents are produced for specific purposes and the needs of individual users cannot be anticipated in all detail, software-companies integrate scripting interfaces to their office-software.
Where such a scripting interface is not present, you can still automate the generation and manipulation of office-documents.
Modern wordprocessors read from and write to compressed XML-files. The Microsoft® file-format OOXML – e.g in docx-files – as well as ODF base on XML. To read and manipulate the content and formatting of such documents you only need to edit the XML-files which you discover after unzipping an ODT- or DOCX-file:
user@machine:/tmp/docx$ unzip ../rudi.docx Archive: ../rudi.docx inflating: _rels/.rels inflating: docProps/core.xml inflating: docProps/app.xml inflating: word/_rels/document.xml.rels inflating: word/document.xml inflating: word/styles.xml inflating: word/fontTable.xml inflating: word/settings.xml inflating: [Content_Types].xml
You can find the meaning of each of the XML-tags, all the possible XML-attributes, as well as the rules for their deployment in specific contexts on specialised web-sites. Here, I want to concentrate on OOXML only: http://officeopenxml.com/index.php
Writing OOXML from scratch can be complicated. As long as you do only modify text-nodes, nothing can happen. But as soon as you manipulate XML-tags or introduce more tags and complexer tag-structures to your document, you have to be careful to obey strictly to the rules of the OOXML standard. Where programmed routines are responsible for those manipulations, they can rapidly and profoundly alter the file-structures together with the actual content.
Even if, after opening the resulting document in your wordprocessor, all looks fine and just as you want it, other programs can be in trouble, if your OOXML code is not what they expect. But interoperability, comparability and comprehension is what standards are initially meant to achieve. You should, therefore, routinely validate your own OOXML-documents against the OOXML-standard to be sure that routines which generate or modify OOXML files, work reliably in all situations.
This document describes a way to validate OOXML wordprocessor files against the pertinent OOXML Schemas, in order to locate and identify potential errors.
I prefer to first present you the command-line which you will execute to validate a wordprocessor-file and explain its components. The objective is then to ensure that the conditions for the successful command execution are met (read on below)
One last remark. A surprising amount of file-manipulations are needed, before you can validate OOXML with the procedure I chose to present on this page. I consider this unsatisfactory and still seek simplification. But also note that, once that the preparations are completed, repeated validations are as easy as launching xmllint with the few arguments that are included in the command, shown here:
xmllint -noout -debugent -schema ooxml_xsd/wml.xsd document.xml
<?xml version="1.0" encoding="utf-8" standalone="yes"?> <w:document xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" mc:Ignorable="w14 wp14"> <w:body> <w:p> <w:pPr> <w:pStyle w:val="Heading1" /> <w:bidi w:val="0" /> <w:spacing w:before="240" w:after="120" /> <w:jc w:val="left" /> <w:rPr></w:rPr> </w:pPr> <w:r> <w:rPr></w:rPr> <w:t>Validate OOXML</w:t> </w:r> </w:p> <w:p> <w:pPr> <w:pStyle w:val="TextBody" /> <w:bidi w:val="0" /> <w:spacing w:lineRule="auto" w:line="276" w:before="0" w:after="140" /> <w:jc w:val="left" /> <w:rPr></w:rPr> </w:pPr> <w:r> <w:rPr></w:rPr> <w:t>Ensure the standard-conformance of OOXML-documents created from scratch.</w:t> </w:r> </w:p> <w:p> <w:pPr> <w:pStyle w:val="Heading2" /> <w:bidi w:val="0" /> <w:jc w:val="left" /> <w:rPr></w:rPr> </w:pPr> <w:r> <w:rPr></w:rPr> <w:t>Contents</w:t> </w:r> </w:p> <w:p> <w:pPr> <w:pStyle w:val="Heading2" /> <w:bidi w:val="0" /> <w:jc w:val="left" /> <w:rPr></w:rPr> </w:pPr> <w:bookmarkStart w:id="0" w:name="intro" /> <w:bookmarkEnd w:id="0" /> <w:r> <w:rPr></w:rPr> <w:t>Introduction</w:t> </w:r> </w:p> <w:p> <w:pPr> <w:pStyle w:val="TextBody" /> <w:bidi w:val="0" /> <w:spacing w:lineRule="auto" w:line="276" w:before="0" w:after="140" /> <w:jc w:val="left" /> <w:rPr></w:rPr> </w:pPr> <w:r> <w:rPr></w:rPr> <w:t>Even if modern wordprocessors are enriched with many functions which facilitate the redaction of complex text-documents, some recurring tasks, which are performed often and in the same way within the same document, cannot be completely automated with the commands that the program offers. As documents are produced for specific purposes and the needs of individual users cannot be anticipated in all detail, software-companies integrate scripting interfaces to their office-software.</w:t> </w:r> </w:p> <w:p> <w:pPr> <w:pStyle w:val="TextBody" /> <w:bidi w:val="0" /> <w:spacing w:lineRule="auto" w:line="276" w:before="0" w:after="140" /> <w:jc w:val="left" /> <w:rPr></w:rPr> </w:pPr> <w:r> <w:rPr></w:rPr> <w:t>Where such a scripting interface is not present, you can still automate the generation and manipulation of office-documents.</w:t> </w:r> </w:p> <w:p> <w:pPr> <w:pStyle w:val="TextBody" /> <w:bidi w:val="0" /> <w:spacing w:lineRule="auto" w:line="276" w:before="0" w:after="140" /> <w:jc w:val="left" /> <w:rPr></w:rPr> </w:pPr> <w:r> <w:rPr></w:rPr> <w:t>Modern wordprocessors read from and write to compressed XML-files. The Microsoft® file-format OOXML – e.g in docx-files – as well as ODF base on XML. To read and manipulate the content and formatting of such documents you only need to edit the XML-files which you discover after unzipping an ODT- or DOCX-file.</w:t> </w:r> </w:p> <w:p> <w:pPr> <w:pStyle w:val="TextBody" /> <w:bidi w:val="0" /> <w:spacing w:lineRule="auto" w:line="276" w:before="0" w:after="140" /> <w:jc w:val="left" /> <w:rPr></w:rPr> </w:pPr> <w:r> <w:rPr></w:rPr> <w:t>You can find the meaning of each of the XML-tags all the possible XML-attributes, as well as the rules for their deployment in specific contexts on specialised web-sites. Here, I want to concentrate on OOXML only: [ OOXML - reference goes here ]</w:t> </w:r> </w:p> <w:p> <w:pPr> <w:pStyle w:val="Heading2" /> <w:bidi w:val="0" /> <w:jc w:val="left" /> <w:rPr></w:rPr> </w:pPr> <w:bookmarkStart w:id="1" w:name="motivation" /> <w:bookmarkEnd w:id="1" /> <w:r> <w:rPr></w:rPr> <w:t>Motivation</w:t> </w:r> </w:p> <w:p> <w:pPr> <w:pStyle w:val="TextBody" /> <w:bidi w:val="0" /> <w:spacing w:lineRule="auto" w:line="276" w:before="0" w:after="140" /> <w:jc w:val="left" /> <w:rPr></w:rPr> </w:pPr> <w:r> <w:rPr></w:rPr> <w:t>Writing OOXML from scratch can be complicated. As long as you do only modify text-nodes, nothing can happen. But as soon as you manipulate XML-nodes or introduce more tags and complexer tag-structures to your document, you have to be careful to obey st</w:t> </w:r> </w:p> <w:p> <w:pPr> <w:pStyle w:val="Normal" /> <w:bidi w:val="0" /> <w:jc w:val="left" /> <w:rPr></w:rPr> </w:pPr> <w:r> <w:rPr></w:rPr> </w:r> </w:p> <w:sectPr> <w:type w:val="nextPage" /> <w:pgSz w:w="12240" w:h="15840" /> <w:pgMar w:left="1134" w:right="1134" w:header="0" w:top="1134" w:footer="0" w:bottom="1134" w:gutter="0" /> <w:pgNumType w:fmt="decimal" /> <w:formProt w:val="false" /> <w:textDirection w:val="lrTb" /> </w:sectPr> </w:body> </w:document>
Before you can validate anything, you
must ensure that all the necessary schemas
, in the form
of *.xsd files, can be accessed by an XML-parser.
I will show you the steps to establish this
validating-environment
.
$ mkdir -p /usr/local/etc/xml $ cat >/usr/local/etc/xml/catalog <<DONE <?xml version="1.0"?> <!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN" "http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd"> <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> <uri name="http://www.w3.org/XML/1998/namespace" uri="file:///usr/local/etc/xml/xml_2009_01.xsd"/> <nextCatalog catalog="file:///etc/xml/catalog"/> </catalog> DONE
<uri name="http://www.w3.org/XML/1998/namespace" uri="file:///usr/local/etc/xml/xml_2009_01.xsd"/>If it is missing, insert the line before the tag <nextCatalog>, just like it is shown above.
wget -O /usr/local/etc/xml/xml_2009_01.xsd http://www.w3.org/2009/01/xml.xsd
:~/project$ mkdir ooxml_xsd :~/project$ cd ooxml_xsd :~/project$/ooxml_xsd mv /tmp/schemaorg_apache_xmlbeans/src/*.xsd ./
xml(the fourth at the time of this writing). Complete this line with the schemaLocation attribute or replace it, so that it is identical to the following:
<xsd:import id="xml" namespace="http://www.w3.org/XML/1998/namespace" schemaLocation="http://www.w3.org/XML/1998/namespace"/>
<?xml version="1.0" encoding="utf-8"?> <xsd:schema targetNamespace="http://schemas.openxmlformats.org/drawingml/2006/main" elementFormDefault="qualified" xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:include schemaLocation="dml-graphicalObject.xsd"/> <xsd:include schemaLocation="dml-documentProperties.xsd"/> </xsd:schema>Now open the schema file dml-wordprocessingDrawing.xsd. Replace the two tags <xsd:import> with the schemaLocations dml-wordprocessingDrawing.xsd and dml-documentProperties.xsd by one single line which imports only the newly created schema-file
<xsd:import schemaLocation="dml-wordprocessingDrawing_import.xsd" namespace="http://schemas.openxmlformats.org/drawingml/2006/main" />
The call to xmllint is already shown, above, but prior executing the command, you must remember to set the environment variable XML_CATALOG_FILES to the location of the schema catalog as, otherwise, the standard path /etc/xml/catalog would be read. This is an example of a successful validation with xmllint after having completed the preparatory tasks, listed above :
user@machine:/tmp$ export XML_CATALOG_FILES=/usr/local/etc/xml/catalog user@machine:/tmp$ xmllint -noout -debugent -schema ~/prog/ooxml_xsd/wml.xsd ./docx/word/document.xml new input from file: /prog/ooxml_xsd/wml.xsd new input from file: /prog/ooxml_xsd/shared-customXmlSchemaProperties.xsd new input from file: /prog/ooxml_xsd/shared-math.xsd new input from file: /prog/ooxml_xsd/dml-wordprocessingDrawing.xsd new input from file: /prog/ooxml_xsd/dml-wordprocessingDrawing_import.xsd new input from file: /prog/ooxml_xsd/dml-graphicalObject.xsd new input from file: /prog/ooxml_xsd/dml-documentProperties.xsd new input from file: /prog/ooxml_xsd/dml-baseTypes.xsd new input from file: /prog/ooxml_xsd/shared-relationshipReference.xsd new input from file: /prog/ooxml_xsd/dml-shapeGeometry.xsd new input from file: /prog/xml.xsd new input from file: docx/word/document.xml docx/word/document.xml validates DOCUMENT No entities in internal subset No entities in external subset
Now please just believe me: This is cool.