David Mertz, Ph.D.
Archivist, Gnosis Software, Inc.
October 2000
This column continues the discussion of DocBook begun in XML Matters #3. In this column we will look at some DocBook tags in greater detail, and also look at the actual composition of DocBook documents. In the next column, we will examine ways to transform DocBook documents to other formats using XLST (Extensible Stylesheet Language Transformations)..
XML is a simplified dialect of the Standard Generalized Markup Language (SGML). Many readers will be most familiar with SGML via one particular document type, HTML. XML documents are similar to HTML in being composed of text interspersed with and structured by markup tags in angle-brackets. But XML encompasses many systems of tags that allow XML documents to be used for many purposes: magazine articles and user documentation, files of structured data (like CSV or EDI files), messages for interprocess communication between programs, architectural diagrams (like CAD formats), and many other purposes. A set of tags can be created to capture any sort of structured information one might want to represent, which is why XML is growing in popularity as a common standard for representing diverse information.
DocBook is an SGML dialect developed by O'Reilly and HaL Computer Systems in 1991. It is currently maintained by the Organization for the Advancement of Structured Information Standards (OASIS). The purpose of DocBook is to describe the content of articles, books, technical manuals, and other prose documents. DocBook has a focus on technical writing styles, but is general enough to describe everything that goes into most styles of prose writing. An XML variant of the DocBook DTD is also available (and is the one that will be discussed in this article, inasmuch as small details differ).
DocBook is a rather complicated DTD with hundreds of elements. Few people will be familiar with all the elements of DocBook; but fortunately, there is really no need to know all of DocBook in order to work with it productively. The basic elements are arranged in a logical way, and most elements follow the similar patterns for nesting of child elements.
The key to working with DocBook is having a good reference handy while you are working. I am personally partial to using a written text (which now means O'Reilly's excellent text, see Resources), but the identical material is also available online (see Resources). There are two general approaches to creating DocBook content (I have played with both in the process of working on these columns): use a specialized XML editor, or use a generic text editor plus an external validator. The main thing is that DocBook is detailed enough that you will need some automation in assuring conformance to the DTD; it is easy to make small typos. You can, of course work for stretches and validate only occassionally (fixing minor glitches should not take long).
If you decide to use a specialized XML editor, you will probably be presented with some assistance in entering elements and attributes. Many of these programs provide context sensitive prompts for available (sub-)tags, or at least lists of tags that exist in the current (DocBook) DTD. On the down side though, specialized editors are generally less flexible in other ways than good general purpose text editors.
Unfortunately, the quality of tools available for working with XML is still disappointing (at least to me). I have tested a fairly large number of XML validation and transformation tools, and almost all of them fail in some respects when trying to work with DocBook. Specifically, I have yet to locate a wholly accurate command-line XML validator (reader suggestions are greatly welcomed). What I have settled for as good-enough is using XML Spy under Win32 (see Resources for my review of this product), and Xeena under other (Java-supporting) platforms. Both of these do a good job of validation, although with more overhead than should really be necessary. Hopefully, these matters will improve over time.
The first thing to do in preparing an XML DocBook document is to prepare its declaration. We already saw a couple examples of this in the previous column, but without explanation of what was going on. Let us look at a document template, and see what is going on with it:
<?xml version="1.0"?> <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" [ <!ENTITY Zizek "Žižek"> <!ENTITY Mocnik "Močnik"> ]> <?xml-stylesheet type="text/xsl" href="chapter.xsl"?> <chapter> <!-- The actual chapter contents are here --> </chapter>
The very first thing that occurs is the <?xml>
declaration
this is just there to show that the document is XML. The next
thing we include is the document type declaration (that is, the
<!DOCTYPE>
tag). It is worth looking in some detail at what
makes up the document type declaration.
The first thing to notice in the <!DOCTYPE>
tag is that it
includes immediately the name of the root element that will be
used in this specific document. It is an important decision
what type of root element to use basically it says what
purpose this document will serve, at least in broad terms. The
choice of root element generally determines the rough size of
the document in question. At the broadest scale, a declaration
might be for a set
, which includes two or more book's. Use
this if what is intended is a whole reference collection, or
the like (notice that we will not necessarily put everything in
the same file though, we might use inclusions, as sketched in
the previous column). More likely, you will be creating a
'book
, which is a collection of part's or 'chapter's
(plus some other bits at the same conceptual level as
chapters/parts). Or even more modestly, you might be creating
a 'chapter
or an article
. That is what we are working on, a
chapter. In practice, a chapter
or article
is the smallest
conceptual part used for a DocBook document.
Continuing with the "attributes" of the <!DOCTYPE>
declaration, we next see the PUBLIC identifier and the system
identifier. The part that comes after the word PUBLIC is an
SGML feature, and you do not really need to use it in XML
documents. But if you do include it, be sure to spell it in
exactly the way indicated in the actual DTD. The actual DTD is
indicated in the system identifier as a URL. That is where all
the actual DocBook definitions live (go ahead and download the
URL, if you'd like to look at it; it references a number of
other files in the same domain). Spell this right also, or
else your validating programs won't be able to find the DTD.
Inside the square brackets in the '<!DOCTYPE> tag is the "internal subset." This is an odd name, but all it amounts to is a way to declare some special features of your specific document. In this case, we create a couple aliases for names that are hard to type on a US keyboard.
After the document type declaration tag, we have a processing instruction in our example. This part is not really necessary, and we will not go into detail about XSLT until the next column. But the idea here is very similar to that with cascading style sheets (CSS) for HTML documents. We added a reference to an XSL document that contains some rules for how we plan on transforming the DocBook document. A processing instruction like this is quite optional, even for a transformation tool (much as with CSS). Depending on the tool used, you can generally specify a transformation using whatever XSLT you want. Adding the processing instruction is just a polite suggestion about one way to do it.
The final bit is the root element mentioned. We already
effectively promised to use the <chapter>
tag in the
declaration, so we better follow our promise, and put it in.
Everything the makes up the blood-and-guts of the chapter goes
inside this root element.
A <chapter>
has a similar structure to an <appendix>
or a
<preface>
. An <article>
is mostly the same structure also
(the main difference is that the front-matter of an article is
generally enclosed in an <artheader>
element). Things like
chapters, articles, prefaces, bibliographies, and so on, are
all kinds of "components" of documents. That is to say, a
component is something that addresses the same topic in a
moderate specficity. As with writing in general, judgement
calls are necessary to decide just how close together ideas
are. But the words used for elements is a good for their
English meanings.
A component, in turn, has some front-matter, followed by some
sections and/or block elements. <title>
is usually required
as front-matter for components, and also for sections. Most
other front-matter is optional, but might include author
information, abstracts, graphics, or other information that has
more to do with describing a component than really
constituting the component. Let's look at an example a
valid, but hightly abridged, chapter (assume declarations as
discussed):
<chapter> <title>Hegemony, and Other Passing Fads</title> <epigraph> <attribution> Gould, 1987b, quoting Gunnar Myrdal, <citetitle>An American Dilemma</citetitle> (1944) </attribution> <para> But there must be still other countless errors of the same sort that no living man can yet detect, because of the fog within which our type of Western culture envelops us. Cultural influences have set up the assumptions about the mind, the body, and the universe with which we begin; pose the questions we ask; influence the facts we seek; determine the interpretations we give these facts; and direct our reaction to these interpretations and conclusions. </para> </epigraph> <sect1> <title>Day-Care Devil Worshipers</title> <!-- para's, sect2's, epigraph's, and other block elements --> </sect1> <sect1> <!-- more blocks --> </sect1> </chapter>
For a moderately large component, you will probably want to divide it into sections (as above). But for a short component, you have the option of launching directly into block elements.
Let us clarify what these things mean. A block element is
basically either a paragraph, or something at the same
conceptual/hierarchical level as a paragraph (such as a list,
and equation, an illustration, and so on). The only thing
"smaller" than a block element is an inline element. Most
usually, a block element will be set apart from other blocks
with vertical whitespace, framing boxes, or the like; an inline
element will be continuous with the words around it, but might
be marked by a different font, color, hyperlink, or the like.
As an example, in the above chapter, the epigraph is much like
a short section. It contains two blocks: the attribution, and
the epigraph <para>
(a different epigraph might be multiple
paragraphs). This atttibution contains a <citetitle>
, but
that citation will likely be rendered inline when printed,
perhaps by italics or underlining. Or maybe it will be a
hotlink to the bibliography if rendered to HTML.
Sections are bigger than blocks, and are in fact just a list of
blocks. How big they are is for authorial and editorial
judgement. But there are two main strategies for making
sections. You can either use the <sect1>
, <sect2>
...
<sect5>
hierarchy, or you can use the <section>
element
nested recursively. For my own purpose--writing
philosophical prose--I felt that explicitly numbered section
levels was better. I had a distinct sense of how important
each type of section must be, and the numbering matched that
well. However, for something like a technical reference, you
are more likely to consider that your material might be nested
in different places and different depths appropriately. In
that case, the <section>
element works better (and can nest
to more than five levels). There are some other specialized
block types, but the above are the most general ones.
The elements and structure outlined here is enough to get started on creating your own DocBook documents. Take a look at those I created (see Resources) for some more details, or also check out the more extensive tag documentation in the Resources. In the next column we will get around to transforming our DocBook source document into some other formats, and introduce extensible stylesheet language transformations (useful outside DocBook also).
OASIS's recommendations on XML tools:
http://www.oasis-open.org/docbook/tools/index.html
IBM alphaWorks' Xeena XML Editor (free-of-cost license):
http://www.alphaworks.ibm.com/tech/xeena
David Mertz XML Spy Review:
http://webreview.com/wr/pub/2000/09/01/feature/index04.html
Icon Information-Systems' XML Spy Homepage (commercial XML editor):
http://www.xmlspy.com/
Scholarly Technology Group's Web-based XML Validation (source available and liberally licensed):
http://www.stg.brown.edu/service/xmlvalid/
SoftQuad's XMetal Homepage (commercial XML editor):
http://softquad.com/index_main.html
Extensibility's XML Instance (commercial XML editor):
http://www.extensibility.com/products/xml_instance/index.htm
Sabletron XSL Processor (open source):
http://www.gingerall.com/charlie-bin/get/webGA/act/sablotron.act
By all means, the best place to get started in a more detailed understanding of DocBook is. The ink-on-paper version is:
DocBook: The Definitive Guide, Norman Walsh & Leonard Muellner, O'Reilly, Cambridge, MA 1999.
If you wish to use an electronic version, refer to:
http://www.docbook.org/tdg/index.html
Organization for the Advancement of Structured Information Standards (OASIS) home page:
http://www.oasis-open.org/
The obscure philosophy dissertation I have undertaken to convert will probably have minimal interest to most XML developers (or even make much sense). But the markup might well be of some interest as an example. Both one possible HTML presentation and the XML/DocBook source are at the below links:
http://gnosis.cx/publish/mertz/chap5.html
http://gnosis.cx/publish/mertz/chap5.xml
Files used and mentioned in this article:
http://gnosis.cx/download/xml_matters_4.zip
David Mertz became disenchanted with the academy and became a technical journalist: post hoc ergo propter hoc. David may be reached at [email protected]; his life pored over at http://gnosis.cx/publish/. Suggestions and recommendations on this, past, or future, columns are welcomed.