XML MATTERS #3 -- Getting Started with the DocBook XML Dialect --

XML is a simplified dialect of the Standard Generalized Markup Language (SGML). Many readers will be most familiar with SGML via one particular document type, HTML. XML documents are similar to HTML in being composed of text interspersed with and structured by markup tags in angle-brackets. But XML encompasses many systems of tags that allow XML documents to be used for many purposes: magazine articles and user documentation, files of structured data (like CSV or EDI files), messages for interprocess communication between programs, architectural diagrams (like CAD formats), and many other purposes. A set of tags can be created to capture any sort of structured information one might want to represent, which is why XML is growing in popularity as a common standard for representing diverse information.

Introduction

This column arises out of a very practical personal concern of its author. Electronic document formats are a mess. Over the years, your author has written a fairly large number of academic papers (mostly in Humanities topics), and I now wish to make all these papers available on my web-site (or even just be able to read them myself). Unfortunately, over the years my wordprocessors and platforms have changed numerous times, and I have many documents on disk that were composed using programs that I no longer own, could probably not obtain if I wanted to, and are unlikely to run on current computers. In the best of cases, I have been able to locate conversion programs that perform a moderately adequate job of converting to a program I can currently run; and failing that, in some other cases the original wordprocessor format is mostly ASCII, with only a moderate amount typographic fluff interspersed.

In other words, my electronic archives are a mess. Many individuals and organizations are in even worse shape. For corporations and agencies especially, truly massive numbers of important archival documents are being lost to changes in technologies. And these losses only get worse with time.

On the plus side, the means are available to create documents that will age much better than those I have accumulated. There are (at least) two keys to creating electronic textual documents that are resistent to changes in technology. XML/SGML generally, and DocBook specifically, provide both keys. The first (and more important, ultimately) key to time-resistent documents is open standards for document formats. There are two elements to these open standards: syntax and semantics (or, "what must a document look like?" and "what does a document mean?"). The syntax of a DocBook document is wholly contained in the simple rules of XML markup, and in the DocBook DTD that is inherent to every DocBook document. The semantics are slightly less distinct. Certain semantic features are contained in a DTD (what elements are the type of thing that can/must occur inside other elements?). And the DocBook tags are chosen in such a way as to have a certain "common-sense" semantic content, at least to English speakers. But some more detailed semantic issues must rely on background documentation like that in the Resources section, and on traditions of use and editorial judgements (e.g. what type of list is appropriate here?).

The second key is of less theoretic importance, but still of considerable practical significance. How easy is a document format to interpret and use outside of formal specifications? An old binary stream format is difficult to make sense of using a text viewer. An XML document is usually pretty reasonable looking to a human reader even without formal validation and processing (of course, plain ASCII is even better for casual eyeballs). Furthermore, some formats are simply a lot easier to reconstruct than others, even apart from the presence of formal specification. Imagine an historian 100 years in the future who finds two documents: one in MS-Word 97 accompanied by an MSDN file-format specification CD, and one in an XML format (even one missing a DTD). This hypothetical historian has a lot less work in front of her to reconstruct the XML document's contents (look at the many vendors--including Microsoft themselves with later versions--that have done a poor-to-mediocre job of writing converters even with specs). For that matter, imagine yourself five years in the future, after your employer has "upgraded" all your workstations to MS-Office 2005.

As a result of considerations of portability and technological change, I've decided to start a project of getting my past academic writing into DocBook format. I believe this project will help in preserving my writing, as well as in making it easier to make it available in current and future popular document formats (via conversions).

What Is Docbook?

DocBook is an SGML dialect developed by O'Reilly and HaL Computer Systems in 1991. It is currently maintained by the Organization for the Advancement of Structured Information Standards (OASIS). The purpose of DocBook is to describe the content of articles, books, technical manuals, and other prose documents. DocBook has a focus on technical writing styles, but is general enough to describe everything that goes into most styles of prose writing. An XML variant of the DocBook DTD is also available (and is the one that will be discussed in this article, inasmuch as small details differ).

The most important thing to keep in mind in understanding DocBook is that what is annotated in a DocBook document is entirely the semantics of the document, not its typography or appearance. This focus on document semantics stands in contrast to wordprocessors, HTML, and even TeX. Wordprocessor often allow style-sheets that mark conceptual categories like "Header, Level 2"--but increasingly most of what they do is attempt to deliver "what you see is what you get" (WYSIWYG). And even where style-sheets are used, they are not usually uniform across documents. Doing this brings in all sorts of assumptions such as page size and layout, available fonts, typestyles of elements, etc. Most of these assumptions have little to do with the actual concepts addressed in the writing, and almost all of them make it more difficult to adapt the document to a different format (whether a different printed layout, onscreen display, speech-synthesized versions, indexes for web-robots, or whatever). HTML started out close to DocBook (albeit simpler), but increasingly has added more-and-more typographic tags; what exists is a hodge-podge of semantics and typography (<h2> versus <b>, for example).

As an easy to understand example, many different conceptual elements are frequently rendered with italics in printed books. Different books use different conventions, but any of these DocBook tags might get rendered as italics when actually typeset: <emphasis>, <abbrev>, <citetitle>, <foreignphrase>, <classname>, <email> (and numerous other DocBook tags). Of course, any one of them might not get rendered in this manner. The decision how to render these, and other, elements is really accidental to the nature of the document considered as concepts; these decisions are the business of publishers and book-designers, not of authors (or at least they should be). DocBook gives you the essential concepts that go into composition of a prose work, not the accidents of how the work is finally rendered (which might be many different ways, in fact). Another advantage to DocBook style conceptual markup, besides the seperation of content and appearance, is that is lets us do things systematically on element types. For example, you could quickly identify a list of every foreign phrase used in a document by searching for all occurrences of the tag <foreignphrase> (perhaps you decide a glossary is needed for such phrases). Just looking for everything marked as italics in a wordprocessor is much less effective in this goal.

Ready, Set, Mark Up!

My first DocBook project will be a big one, but I'll do it in increments: convert my doctoral dissertation to DocBook. Besides being rather long as dissertations go, the specific document poses several challenges for a documentation system. It contains a number of names that require roman diacritics (but no non-European character sets); it has footnotes, cross-references, page numbering, multiple section levels; it uses some diagrams; there is a need to approximate some original typography for commentary, and it uses some unusual layout for specific effect; it has epigrams, a bibliography, appendices, a dedication, an abstract; it uses limited, but required, mathematical notations; it references not just books, but also URLs and email addresses. Overall, I happen to have written something that will provide a good workout a large portion of all the tags in DocBook. The whole dissertation is already available in its original WordPerfect 7 format, and in two differently formatted PDF versions. But none of the versions of all that portable or flexible. DocBook will improve things on all grounds (of course, for this article, we will only get as far as the markup, not the processing into target formats; but eventually everything will be in order).

Mertz Dissertation XML Document

<?xml version="1.0"?>
<!-- David Mertz Dissertation -->
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
               "http://gnosis.cx/download/docbook/4.12/docbookx.dtd" [
    <!ENTITY bookinfo SYSTEM "bookinfo.sgm">
        <!ENTITY abstract SYSTEM "abstract.sgm">
    <!ENTITY chap1 SYSTEM "chap1.sgm">
    <!ENTITY chap2 SYSTEM "chap2.sgm">
    <!ENTITY chap3 SYSTEM "chap3.sgm">
    <!ENTITY chap4 SYSTEM "chap4.sgm">
    <!ENTITY chap5 SYSTEM "chap5.sgm">
        <!ENTITY chap5_1 SYSTEM "chap5_1.sgm">
        <!ENTITY chap5_2 SYSTEM "chap5_2.sgm">
        <!ENTITY chap5_3 SYSTEM "chap5_3.sgm">
    <!ENTITY chap6 SYSTEM "chap6.sgm">
    <!ENTITY chap7 SYSTEM "chap7.sgm">
    <!ENTITY chap8 SYSTEM "chap8.sgm">
    <!ENTITY appendix1 SYSTEM "appendix1.sgm">
    <!ENTITY appendix2 SYSTEM "appendix2.sgm">
    <!ENTITY biblio SYSTEM "biblio.sgm">

    <!ENTITY Zizek "&Zcaron;i&zcaron;ek">
    <!ENTITY Mocnik "Mo&ccaron;nik">

]>
<book>
&bookinfo;
&chap1;
&chap2;
&chap3;
&chap4;
&chap5;
&chap6;
&chap7;
&chap8;
&appendix1;
&appendix2;
&biblio;
</book>

This first step is mostly planning, obviously. Creating the contents of the component-level elements (e.g. chapters) will be the real work. But by creating entity references to these component-level elements, we have divided the creation into more manageable chunks (and also made it easier to publish/export the individual chapters as their own documents). What we have indicated so far is that the type of document being created is a book, and that it will include set of component level elements pulled in from external files.

Some entities defined at this top level are not used immediately, but only within the included files. For example, the entity &abstract; is only inserted within the bookinfo.sgm document. Similarly with the sections inside Chapter 5. It is a judgement call about what to divide out; but my criterion was that I wanted to break into separate files the documents I might publish separately. I will probably make adjustments as I extend this DocBook project. The other type of entity I decided to define at this point were some names that do not fit in US-ASCII that I know I mention. I cannot type the diacritics directly, but typing e.g. &Zizek; is an inconspicuous approximation of what I actually want. You could also use abbreviations of whole phrases and the like, if you wished.

Inclusions

The files that are included using the above master document setup will consist of bare document root tags and their contents. No document type declarations or processing instructions will be included in the included files (this is required). The document type is already declared in the book master document, so we can keep it in one place. For example, the file bookinfo.sgm contains just the following:

Included XML/SGML Subdocument

<bookinfo>
  <title>The Speculum and The Scalpel</title>
  <subtitle>The Politics of Impotent Representation and
            Non-Represenational Terrorism</subtitle>
  <author><firstname>David</firstname><surname>Mertz</surname></author>
  &abstract;
</bookinfo>

Similarly, the chapter files each resepectively start and end with the <chapter> and </chapter> tags.

One advantage of the modular structure laid out is that it is easy to pull out individual components for separate publication. For example, I intend to first convert (and separately distribute) versions of Chapter 5. In anticipation of doing so, I have created a smaller wrapper for the chapter alone:

Chapter Level Subdocument Wrapper

<?xml version="1.0"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
               "file://g:/articles/scratch/docbook/4.12/docbookx.dtd" [
  <!ENTITY chap5_1 SYSTEM "chap5_1.sgm">
  <!ENTITY chap5_2 SYSTEM "chap5_2.sgm">
  <!ENTITY chap5_3 SYSTEM "chap5_3.sgm">
]>
<chapter>
  <title>Hegemony, and Other Passing Fads</title>
  <epigraph>
    <attribution>Gould, 1987b, quoting Gunnar Myrdal, <citetitle>An
      American Dilemma</citetitle> (1944)</attribution>
    <para>But there must be still other countless errors of the same
      sort that no living man can yet detect, because of the fog within which
      our type of Western culture envelops us.  Cultural influences have set
      up the assumptions about the mind, the body, and the universe with which
      we begin; pose the questions we ask; influence the facts we seek;
      determine the interpretations we give these facts; and direct our
      reaction to these interpretations and conclusions.</para>
  </epigraph>
&chap5_1;
&chap5_2;
&chap5_3;
</chapter>

The bulk of the marked-up content is in the three sections (each with a top-level sect1 as their root). But I have the option of processing the same section content as part of either the book-level or chapter-level wrapper (and I will probably also pull out Section 2 as an article by itself, which obeys the same structure as a chapter).

Continuing...

What this column has begun is only enough for readers (and its author) to get a general sense of DocBook. Subsequent columns will address some greater details of the DocBook tags, and how they are structured. In addition, we have yet to get around to discussing how to convert DocBook documents to more directly readable formats, how to validate them, and how to perform processing operations on them. Stay tuned.

In the meanwhile, it is a good idea to start skimming through some of the DocBook reference material in the Resources. DocBook has a lot of tags available--probably more than anyone is going to remember. But once you get a sense of what types of tags to look for, and how to put them together, the going gets easier. It probably will not hurt to keep a reference on your lap while you work with DocBook (even if you use specialized tools to help with the editing)

Resources

By all means, the best place to get started in a more detailed understanding of DocBook is. The ink-on-paper version is:

Organization for the Advancement of Structured Information Standards (OASIS) home page:

In some respect, a format even more portable and time-protected than DocBook is plain ASCII--or "smart ASCII" that incorporates simple style annotations (in the way evolved on Usenet). Of course, ASCII will not be able to capture all the semantic structure of DocBook, but a lot of times you do not need to. Project Gutenberg is an example of attempts to preserve and utilize texts in this neutral manner:

For an important tool with a somewhat overlapping purpose to DocBook, TeX is a tool worth learning about. The focus of TeX is closer to typography, but TeX also has many elements of semantic markup (especially for mathematics). A good starting point is:

My own articles (including the draft of this one) have used a similar "smart ASCII" format for their originals. Markup is be automated using the tool Txt2Html (also see the ASCII version of this article):

The obscure philosophy dissertation I have undertaken to convert will probably have minimal interest to most XML developers (or even make much sense). But the actual formats used might be of some interest. The document was written originally in WordPerfect 7 (with portions imported from other wordprocessor format along the way). A moderate effort was made to use stylesheets for significant elements to make global changes easier. As one prior attempt at (web) publication, I output the document to PDF format in a style that typographically resembles a printed magazine/journal articles more than a submitted thesis. PDF is not a bad format, but it fails to seperate content from layout (or merely does not attempt to):

About The Author

It might be catachretic, but it is not a malapropism to describe David Mertz juxtopositions of interests herein as sylleptic. Words is words. David may be reached at [email protected]; his life pored over athttp://gnosis.cx/publish/. Suggestions and recommendations on this, past, or future, columns are welcomed.

Xml Matters #3

Getting Started with the DocBook XML Dialect

What Is Xml?