Paradigms exist to be broken

or: How to create a

Dynamic bookmark-tree with Apache-FOP

Introduction

This page will eventually explain how you can dynamically generate a bookmark-tree in the PDF-documents by use of the Apache-FOP xsl/fo processor.

But before I show you the mere technicalities, you need to realize what's special in this procedure and what it means to break a paradigm. In fact, the alternative titles on this page should appeal to those who need a bookmark-tree just as much as to those who seek in their creations, in code or other media, the maximal possible freedom of expression.

XSL/FO

Maybe you need a quick introduction or recapitulation of the basics. If not, jump to the next chapter, below.
An XSL/FO processor like FOP reads files in XML-format and produces a document of arbitrary type. FOP is called an XSL/FO-processor because it does two of these transformations at a time:

  1. First it reads the XML of the original file and creates a different one. This new file is in FO-format (Formatting Objects) which describes how the final document is presenting its contents. This first transformation follows user-defined rules to ensure that an arbitrary number of different documents will be formatted similarly. The rules are read from an XSL-style-sheet that needs to be provided.
  2. During the second transformation, the FO-file is transformed into one of the output-types which the FO-processor supports. FOP can create many different documents from one and the same FO-file.

In the context of this page and the generation of the bookmark-tree, the second transformation from FO to PDF is of no interest. We concentrate on the XSL-style-sheet and the formatting-rules defined therein.

XML-code defines a hierarchical structure in which the document-content is organized. XSL in contrast defines how the nodes of the XML-structure shall be treated to generate the final PDF-output. As the input-document is organized hierarchically, the author of the XSL-style-sheet must honor the relation between the XML-tags. When a container-tag is handled, this handling includes that the children of the container are handled, too.

A container-structure can include more hierarchically structured sub-divisions and child-tags can be the parent-tags to others. The rule-set in the XSL-style-sheet must take this possibility into account and delegate automatically the processing of sub-tags to the pertinent formatting-rule, be it to itself, if a child-tag is of the same type as the one, just about to be processed. Recursion is not simply tolerated but actually sought for during the creation of the XSL-style-sheet.

Recursion helps to keep the number of processing-rules small and it ensures that the same kind of content will be formatted in the always same way, no matter how big the original XML-document is or at which position and how often the same kind of content needs to be produced.

XSL is a language and it lets you choose your style to express the formatting-rules. There are even alternative ways to define the same rule.

The paradigm: XSL-templates

An XSL-template defines a processing-rule to be applied, when the xsl-processor finds in a source-document

For example, you can define that the text within a tag <house/> shall always be of red color in the resulting PDF-file. The template will match all the tags <house/> and transform the included text-node into red text in the PDF.

<xsl:template match="house">
    <fo:inline color="#ff0000"><xsl:value-of select="."/></fo:inline>
</xsl:template>

Once defined, the same template will automatically become active for each and every <house/>-tag in the XML-file.

An alternative way to handle one kind of tag in the always same way is a for-each statement. To the beginner, it appears to be simpler to just state clearly that the content of each <house/>-tag shall become printed red and that would be it...

<xsl:template name="redhouse">
    <xsl:for-each select="//house">
        <inline color="#ff0000"><xsl:value-of select="."/></inline>
    <xsl:for-each>
</xsl:template>

Unfortunately, to honor the hierarchical structure of the original XML, many nested for-each statements would have to be written and if there is a doll's house anywhere in the big house, real problems begin.

So XSL-templates are not just best practice-, they are really useful and necessary to keep the XSL-code efficiant and readable.

As happens often, when incomparable techniques are evaluated as if they were just two ways to do the same, one of both gets the blessing of the experts' opinion and the label right, the other contempt and the label wrong. Ask in a XSL-centric community which way to process XML-tags is the best one, -for-each or a template-, and without having to explain in much detail your current XSL/FO project, you will receive the answer: Use a template!

Bookmark-tree

bookmark-tree

The bookmark-tree is the vertical structure which, when you see it in a PDF-reader, references the chapters of the document by naming their headers. You can even click on any bookmark to access the pertinent chapter directly.

The definition of such a bookmark includes the following details:

Bookmark-tree with Apache-FOP

With Apache-FOP, bookmark-trees are usually created apart from the remainder of the PDF-document that is: in a template which is called to create the structure, and not in a succession of templates which are triggered by tags in the source-document.

Although the opposite is possible, there is hardly any advantage to be taken from it. The reason is that bookmarks are mostly constructed explicitly with a jump target AND the text of the bookmark in mind.

You have to comprehend my last statement and I therefore resume shortly and put it another way: While the XSL/FO-processor is there to automate the production of a PDF-document from an arbitrary XML-file, the bookmark-tree is defined explicitly by naming the text of the bookmark AND the position in the content, where a mouse-click shall catapult us. I hope that you notice, how this is rather dumb. But here is an excerpt of an exemplary template like dozens that you find on the Internet. And they are often published to explain how you create bookmark-trees with Apache-FOP:

<fo:bookmark-tree >
   <fo:bookmark internal-destination="toc" >
      <fo:bookmark-title> Table des matières </fo:bookmark-title>
      <fo:bookmark internal-destination="chapitre1">
            <fo:bookmark-title>Mon premier chapitre </fo:bookmark-title>
      </fo:bookmark>
      <fo:bookmark internal-destination="chapitre2">
            <fo:bookmark-title>Mon deuxieme chapitre </fo:bookmark-title>
      </fo:bookmark>
   </fo:bookmark>
</fo:bookmark-tree>

So, where is the problem? Why is this dumb?

This way of creating the bookmarks is like buying a fine power-drill but to pound on it with a mallet when you want it to pierce a wall. With each new document a new definition of the bookmark-tree is due, no matter if you keep the template for its generation in a separate style-sheet. Even then, you are obliged to make sure that the connection between style-sheets does each time correspond to the document that you are about to create. Also, you must adapt the bookmark-tree in case that a chapter is added to or removed from the source-document or its header is changed. Unnerving work, when you consider that all the information that you need in the bookmark-tree is always present in the source-document!

Bookmark-tree with Apache-FOP, the real thing

What I want from an XSL-style-sheet is that it relieves me of the obligation to read the source-document. Furthermore, I want to blindly apply the XSL/FO-transformation to just any suitable file and expect that a bookmark-tree is created in the resulting PDFs.

Now the drop of bitterness: XSL-templates are just not up to the task. It appears to be impossible, -as I have not been shown any example that could convince me of the opposite-, to map dynamically the headers of the chapters in the contents of a document to the hierarchical structure of the bookmark-tree by means of matching templates.

My alternate approach takes into consideration that the three properties of any bookmark, target, title and position, must be anticipated before the XSL/FO-processor has had an opportunity to identify them in the document. Here is the excerpt of a style-sheet, which will provide a bookmark-tree for XHTML-files and transforms the headers from <h1> to <h5> into bookmarks:

XSL

It is too cumbersome to format the xslt-code in the file for display on this page. Please open the linked file in a text-editor or viewer of your choice.

Note that the target of each bookmark is identified by means of a unique id. This id is simply created upon handling any header-tag from <h1> to <h5>, in the templates which appear further down in the XSL-file.

Finally

I have used many words to explain the creation of a bookmark-tree, but they are not worth much, if I failed to demonstrate something else... Use a for-each loop, where a for-each loop is due! This is just one example for a situation that I have encountered often during my professional career as a software-developer. A structure/expression/design-pattern/custom which has once been recognized as useful at one point in time, is recommended to anybody who is confronted to a new task which may bare only slight resemblance to the previous. I can list a few of these annoying paradigms, as that's what they represent:

Use a xsl-template, no matter what.
While a for-each statement may be more adapted to the task at hand!
Make your class inherit (from any other arbitrary class).
There are people who state cold-bloodedly that inheritance were the best and original core-feature of object oriented programming (and OOP began with Java, you know? I hope, you did not). Avoid them like the plague.
Avoid threads (to avoid problems with threads).
And, by all means, avoid to demonstrate that with a little investment in good documentation, anybody can master threads and write more efficient software.
Initialize a value to just any variable(, to avoid the variable to be NULL).
Because the fact that a variable has not yet had an opportunity to adopt a value is generally not interesting to anybody. Base your programming work on assumptions that is what everybody does. And see in which state they left IT. Do not be like anybody.
Make sure that each function returns a value.
It does not matter which value, just return anything, because someone might have a use for it, some day, somewhere... or find a use for it... and pay... What do I know?
Always try to catch and handle each possible exception.
Let's say an OutOfMemory-Exception... You do not want your users to note, what your program just achieved on their system, do you? Read the history of the Apache-Tomcat Web-Container and hear from me like anybody came to accept the frequent crashes as part of their everyday-work. Do not be like anybody.

It rests with you to note where the recommendation fails to meet the requirements. Dare!

To conclude, I present you a PDF-file that may look familiar to you, apart from the fact, that you have not yet had opportunity to admire the bookmark-tree..: Paradigms exist to be broken. Ω