Word-Processor File-Types for Scripting

Written for my blog on linuxquestions.org

Posted 29th May 2019 at 11:30 by Michael Uplawski

Tags abiword, ooxml, opendocument, xml

The Big Unification

Current Word-processors read and write different XML-based file-formats. The standards which are more or less implemented by these programs, are OOXML and/or OpenDocument.

I write “more or less”, because for diverse reasons, office-suites differ in the way that they compose documents of one certain type. The variations can result in a completely valid document or one that diverts from the standard and the existing schema-definitions (xsd). There is always the risk that different word-processors interpret the file-content in different ways, depending on their own standard-conformance or non-conformance.

Worse, if a word-processor misinterprets valid xml-code: The document which is held in memory while the program is used, may not correspond to the intentions of the original author. When it is saved again, the “changes” that the word-processor imposed upon reading the file, will become permanent in the updated version.

For these reasons, when you must exchange office-documents with other people, you try to ensure that important aspects of the document are honored by all the office-programs and program versions involved or you do simply not distribute this kind of file in the same way that you cannot be sure that your text-processor handles a foreign document the way it was meant to be, if the author used a different software or version.

But the one question which should be dealt sometimes, is: What are these standards good for in the first place?

Answer: To make you dream of a better world. That appears to be all.

Do it yourself

If it is still and de facto up to “the user” (which can be a software), how she/he/it creates and interprets the xml-code of an office-file, why can't I?

Generating your own office-files, you will quickly arrive at the point, where the different ways in which office-programs interpret your code, will limit your freedom of expression.

An example. I have a script which generates a document like the following: table in a wordprocessor file

This is a word-processor file containing a table with date values for the month of July 2019. The above screen shot is made from the SoftMaker office-suite and its text-processor component TextMaker.

What if I try to open the same file, which is in OOXML (docx), format in LibreOffice? The attempt is answered with the message that the file is corrupt but that LibreOffice can try to repair the file. If you let the program go ahead and repair, it comes back telling you that it is still not possible to open the file.

AbiWord tries something else but apparently does not know, what: AbiWord failure

This has some resemblance with the hex-dump of a zip-file and it is the result of attempting to read a binary file like text. OOXML and OpenDocument can both have complex file-structures and a docx-file is just the zip-archive of this structure (ODT is, too, most of the time). As, normally, AbiWord knows to extract these file-types prior display, something must be very wrong with my file...

It has nothing to do with XML, though. The way that the file-structure of an office-file is read from its zipped version is just that of the office-program, and not mine. In the case of my above test-case, it is the order of the files which renders the zip-file unpalatable to LibreOffice and AbiWord, though not for TextMaker.

Cool. That was docx, now how about ODT?

The problem with the file-structure does not exist for ODT, meaning that you can just blindly zip all the files which shall make part of your document and TextMaker, LibreOffice and AbiWord will be able to read it... in a way. Between LibreOffice and AbiWord there are few difficulties to expect.

Just one of the oddities of TextMaker is that it interprets white-space between tags and multiple white-spaces in text-nodes, which transgresses the rules of XML and leads to false alignments and indentations. This is never a problem for documents that you create in the text-processor, like most people do. But if you are about to compose your own ODT-document in XML and indent the code for readability, you must remove all such additional white-space from your document prior display in TextMaker... This is always a good idea, because it renders the file much smaller, but during development, you want to have your tags indented in a way that lets you easily evaluate the structure of the current XML-code.

One for the road

AbiWord is cool. Its own file-type “abw” is XML, too! But different from OOXML and OpenDocument, AbiWord's usual way to save documents is in one single XML-file. Although this possibility is covered by the OpenDocument standard, too, I do not know a program which saves ODT files in that way. So, you can just open any file that you saved from AbiWord as *.abw in an ordinary text-editor to see the tag-structure, formatting and content of the whole document. These files are not even compressed, you do not need to unzip them first.

I find the xml written by AbiWord the easiest to imitate in my own scripts. You define your paragraph-styles on top of the file, then use the pertinent tags in the content-section.

abw is probably the best choice, if you must generate word-processor files from a programmed routine or want to transform them into something else: As the format is simpler, addressing tags and their attributes in order to extract or modify content will be much simpler with the AbiWord format.

And even if it does not say so, LibreOffice reads AbiWord files, too.
Ω