chaos

Documentation Project fr

WIP

Motivation

"Automating the generation of documentation"

Automation is usually about saving time or simplifying things.

But I don’t care about that — I do it because the process is messy, and I like the challenge.., and also because many programs lack explanatory documentation, notably my own.

I want to describe the tools I use to generate software documentation quickly and at scale. The objectives are diverse, and I'm uncertain about priorities from the start. That's why I call it Work in Progress until further notice.

The files I create require a particular sequence of programs, but you can treat the tool chain presented below as a jumble of independent utilities and techniques. Some tools are described in their own right, so you can choose to adopt only the parts that suit your workflow. I also list alternatives at the end of the page for those who want to explore other options.

Accomplishments

Today, I can write one text file, run a single command in the terminal, and generate several different files. The original document, formatted in reStructuredText – reST, is converted directly into a man page and an HTML file. The HTML then serves as the foundation for a PDF transformation.

Tool chain

Some tools in this pipeline are standard (e.g., rst2html5, rst2man), while others are custom scripts (manheader, htidy, make_doc). The pipeline is designed for my personal use and may not be portable or reproducible in other environments.

A text file in reStructuredText format is written in any kind of basic text editor (not a word processor). Vim Example:
flnews_post_proc.rst (reStructuredText)
The only other tool, I call directly on the command line, is my own shell script make_doc. It finds in the current directory (where make_doc is called) all reST files to transform one by one. make_doc (My script)
make_doc calls rst2man to create a man page from the original reStructuredText rst2man (Docutils)
The man pages created by rst2man still lack field values for creation date, section number, version and a category. While rst2man could set some values, I prefer to make sure that they are all present in the man-page, by calling my ruby-script manheader. manheader (my script)
The final man page is compressed with gzip. gzip (GNU) Example:
flnews_post_proc.1.gz (man page)
make_doc then calls rst2html5 which produces a HTML-file. I impose that CSS code from an explicitly named CSS-stylesheet be embedded. rst2html5 (Docutils)
CSS (style sheet)
To normalize the HTML5 output and improve conformance toward XHTML5, HTML Tidy is used. A small shell script htidy centralizes some declarations and configuration options. htidy (my script)
Example:
flnews_post_proc.html
For the creation of a PDF file, Apache FOP is invoked on the previously created HTML-file with either a project specific XSL-FO style sheet or one for general use. There are two actions performed in the FOP/Java environment, and it appears to be normal that the question “who does what” is still not easy to answer:
  1. Somehow, HTML is transformed to FO (“Formatting Objects”), probably by delegating the XSL transformation to some Xalan derived Java-internal XSL processor (see below for an alternative).
  2. FOP produces PDF from the FO code.
All is fine in the end and I do not really care to know, why this is so. Confirming suavely that this or that tool or library does this or that will likely expose me to this or that fatal punishment. I won't.
FOP (Apache)
stylesheet.xsl (XSL-FO style sheet)
html2Pdf.xsl (XSL-FO style sheet)
The first PDF file is converted to a new one, claiming PDF version 2.0, with Ghostscript. ghostscript (Artifex)
A few meta data fields are set with the cpdf tool. cpdf (Coherent Graphics) Example:
flnews_post_proc.pdf

Decisions and constraints

I chose procedures which allow me much freedom to style the final documents in the way I want them. Given the range of tools available, I had to first identify which ones suited my needs.
Then, weighing their virtues and shortcomings, I organized how different tools interact and modified their configuration through trial and error until I could accept the results.

You can always grab any tool, feed it content, and accept the default output. But when you want to control the appearance of the resulting documents, you'll notice that individualism has its price: facilitating a task usually means surrendering freedom. Finding an acceptable compromise is what I'm exploring here.

Specifics of the created files

I had not been interested in tools, at first, but wanted halfway readable and – why not – agreeably styled documents.

HTML

The HTML output is styled using CSS from an external style sheet that I maintain. A generally applicable base style sheet exists and can be used as-is; in other cases it is adapted to the needs of a particular project. Each document therefore references a project-specific CSS file, whether modified or not.

HTML is generated directly from the original reStructuredText source using rst2html5, a tool provided by the Docutils package. I run HTML Tidy on the output — primarily to normalize and clean up the markup before converting it to PDF.

Useful commands
rst2html5
--rfc-references --title=`capitalize "$NAME"` \
--xml-declaration --stylesheet="$CSS" ./"$rst" "$HTML"

This command creates HTML from reST. The options to rst2html5 are listed when you execute the program with the usual --help argument. I do not explain them here.

tidy
logo-i -m -n -asxhtml --indent-cdata yes --output-xhtml \
yes --vertical-space yes --strict-tags-attributes yes --add-xml-decl yes \
--doctype
"$DOCTYPE" --drop-proprietary-attributes yes -utf8 "$PAGE"

Tidy ensures that the HTML code is XHTML-style output (more or less, according to the current living standard’s prestige) and suppresses the DOCTYPE declaration, which would otherwise interfere with the later call to FOP when a PDF is generated.

PDF

My PDF file embeds font subsets and otherwise resembles the HTML output a lot. However, all styling must be coded in an XSL Stylesheet (XSL-FO). When FOP is invoked, two transformations are performed in a row, which is why it is better called an XSL-FO Processor.

Another characteristic of the PDF file is the presence of a “Bookmark Tree,” an arborescent structure that can be displayed as an alternative to the page previews in the sidebar of PDF viewers. It allows you to navigate the document without having to return to some “table of contents” or similar. This structure is explicitly defined in the XSL-FO style sheet. I have chosen code that is reusable across projects and do not have to invest extra time and effort to regenerate the Bookmark Tree each time chapters of my documents are removed, added, or renamed. Many examples I found on the Web are not written this way, but always target specific chapter titles. That is uncomfortable when these titles need to be modified or moved. My first version of such a “dynamic” bookmark tree dates from before 2010. I was then paid for the development.
That this possibility is still widely ignored by users of XSL-FO transformations is a big surprise to me. Using a technology designed for automation, only to then counteract its purpose, calls into question the value of such technology.

Useful commands
export FOP_OPTS="-Djavax.xml.accessExternalStylesheet=all -Djavax.xml.accessExternalDTD=all"

For security reasons, the default behavior of FOP is to refuse nested style sheets. With these options set, they are accepted again. Those of you who enjoy a conceptual challenge can add FOP options that will cause another XSL processor to be used for the first part of the XSL-FO transformation. This way, you will know — once and for all — which tool does what.

~/bin/fop
logo-c ~/bin/fop-2.11/fop/conf/fop.xconf -xml "$HTML" -xsl "$XSL" -pdf "$PDF"

I am using a recent version of Apache FOP that I have installed in my home directory. Although this is currently the only version on this computer, I want to avoid eventual conflicts if it must one day coexist with an older program from the Debian repositories.

gs
logo-dEmbedAllFonts=true -dSubsetFonts=false -dPDFSETTINGS=/prepress \
-sDEVICE=pdfwrite -dOmitXMP=true -dCompatibilityLevel=2.0 \
-dNOPAUSE -dBATCH -o "$PDF_2" "$PDF"

This call to Ghostscript converts the first PDF file to a PDF format that should approach version 2.0 as much as possible and removes the XMP metadata tag, which is not needed in this kind of PDF file (I know it is not, but you may like XMP for some impossible reason). For full PDF version 2.0 compliance, other tools would be required.

cpdf
logo"$PDF" -set-creator "$VIM, $DOCUTILS" AND \
-set-producer "$FOP, $GS, $CPDF" AND \
-set-author "<michael.uplawski at uplawski.eu>" AND \
-set-title "$NAME manual" AND -o "$PDF_2"

cpdf is another powerful tool, but I use it merely to set some metadata fields to reasonable values. Ghostscript above can do that with the exception of the “Producer” field. XMP metadata could be copied from an XML file, using cpdf, but I consider this effort useless for the time — and XMP an idiotic standard.

Alternatives

Documentation files can be created in many more ways than those, explained above.

rst2pdf

logo You can create PDF from reST directly, using one of the tools named rst2pdf. One is part of the Docutils package, the other one a completely independent project. I do not like the output from the Docutils tool.

If you have not yet decided which technology you want to use, maybe try the tool from rst2pdf.org. It offers the possibility of user defined style sheets. These are raw text files in YAML-format.

As you may have noticed in the description of the Tool Chain, above, I already write CSS style sheets for my HTML and XSL style sheets for my PDFs. While I have written YAML to configure software, this begins to look like a jolly zoo of styling languages. I stick to XSLT for now and will not include rst2pdf in my tool chain. XSL-FO transformations are what I am most familiar with, when the production of a PDF must be automated.

Saxon XSLT and XQuery processor

logo If you want to use a recent open source version of the Saxon XSL processor, maybe because you need XSLT 3.0 functionality or because the FOP-way of making all <xsl:message /> output a “Warning” disturbs you, an easy way to “replace” the XSL-processor (whatever that means) is to run it on the input file to produce a FO-file, then call FOP only on the FO-file to produce the final PDF. No pottering about with classpath, Java options, or the FOP_OPTS, explained above.

java ~/bin/saxon-12.9/saxon-he-12.9.jar \
-s:"infile.html" -xsl:"stylesheet.xsl" -o:"output.fo" loglevel='debug'

This command calls the saxon processor on infile.html and produces a FO-file, while applying the styling defined in stylesheet.xsl. You can pass variable values to the XSL process, like loglevel, if the style sheet defines them internally. The result can be fed to FOP for the production of a PDF file.

LibTIFF tools for unchangeable PDF-content

With the LibTIFF library come a few utility programs which transform TIFF images to PDF. These files cannot be modified except with graphics software.
This alone is not sufficient as an argument for the procedure. But there will also be no display issues on computers that do not provide the necessary fonts or with printers that would choke on some of the glyphs provided by a certain font, if it is embedded in a PDF file.

Depending on what you are accustomed to, the files created with the LibTIFF tools will be bigger or smaller. A word processor, for example, can embed a lot of information in a PDF, some of it unnecessary or redundant. In this case, transforming each page of a PDF file to an image and back to a PDF may result in much smaller PDF files.
On the other hand, a PDF file created via a XSL-FO transformation, as explained above, is already very small, especially when it does not contain itself other images. Here, the conversion to and from images may blow up the file to a size several times that of the original PDF file.

gs
-sDEVICE=tiff24nc -r400x400 -sPAPERSIZE=a4 \
-sOutputFile=
page_%04d.tif -dNOPAUSE \
--
infile.pdf

This call to gs creates 1 Tiff image from each page of the PDF file “infile.pdf”. If the PDF file was flnews_post_proc.pdf, it results in 7 Tiff files, each with a numerical suffix indicating the position of the corresponding page in the PDF.

I choose a resolution of 400dpi to keep the text in the “picturized” version of the PDF readable. This value has though a significant impact on the file size.

tiffcp
page*.tif new_file.tif

tiffcp assembles one multilayer tiff-file from all the previously created Tiff files.

tiff2pdf
-z new_file.tif -o new_file.pdf

Eventually, from all the pages that were transformed to images, then to the layers of one big Tiff image, again a PDF file is created. The argument "-z" to tiff2pdf imposes that ZIP-compression is applied. As announced further up, when the original PDF file was very small, the one created from images can become extremely large: The new version flnews_post_proc_b.pdf is a file of 1.6M, i.e. 18 times the size of the one created with XSL-FO.

Writing man-Pages in a word processor

For many people “writing stuff” means using a word processor.
And most people who know how to write perfectly structured and comprehensible documentation are not computer enthusiasts and may not be interested in mastering a technology stack like the one I present on this page.

The good news is that word-processors nowadays write XML, unless specified otherwise. For each word processor may exist a file type that is considered the default, but most will be able to produce other XML formats as well:

Microsoft Word®
Writes Docx, which is “OOXML”. The files should have an extension docx and are compressed. When you unzip a file like this, the textual content is found in the file document.xml in a sub-directory “word”.
Apache OpenOffice
Writes ODT, which is “Open Document Format”, the file extension is odt. These files are normally compressed and when unzipped, the textual content is in content.xml.
LibreOffice
Also writes ODT.
SoftMaker TextMaker®
Writes TMDX which is a subset of OOXML and can otherwise be handled like docx.
AbiWord
Writes its own XML format in files of extension abw. These files are not necessarily compressed and all content and the definitions of the styles used are found in the same file.

All of the above programs can save files in OOXML format (docx) and ODF (odt).

While most modern word processors – if not all – do a good job with exporting to PDF, man pages are a foreign universe to them. The choice of an XML file format, however, opens the possibility to convert word processor output to just anything you like …

You could – for example – transform an AbiWord file to Troff typesetting instructions and thus a man page. Good Luck with that! (I do not doubt for a fraction of a second that there are people claiming that coding Troff were the only acceptable way to write man-pages).
But you can also just produce reStructuredText from any of the XML file types, listed above. While you are at it, you can identify sections in the original document that do not need to be included in end-user docmentation and encourage the author to use paragraph- and character-formats dedicated to communicate explanations to a potential translator.

Oh Ye Of Little Faith!
AbiWord

Here is an example XML code from AbiWord: flnews_post_proc.abw and here is the XSLT style sheet: abw2reST.xsl. Fed to the saxon XSL processor, they become this reStructuredText: test.rst.
Note that, again, the DOCTYPE declaration has to be removed from the input document.

Do it better

When different documents use the same templates, these will be recognizable in the resulting XML files. Many transformations to reStructuredText (and man pages) can thus benefit from the same preparatory work and XSL style sheet. It is never necessary to know the precise definitions of styles, configured in the original Word processor; all that is needed is the name of a style to reconstruct the structure of a document. The XSL style sheet does not need to reproduce the styles from the word processor.
But as style sheets can be nested, nothing prevents you to prepare a special treatment for some documents, irrespective of the Template that they were based on.

... or be a lamentable heap of misery

You can ask an AI to convert whatever you want to TROFF and be proud of it.

Comment by mail: address
Omega