XML MATTERS #3
Getting Started with the DocBook XML Dialect
David Mertz, Ph.D.
Archivist, Gnosis Software, Inc.
September 2000
Getting started with working with DocBook. Why to use
DocBook. Planning a large project. Modularizing your
documentation.
WHAT IS XML?
------------------------------------------------------------------------
XML is a simplified dialect of the Standard Generalized Markup
Language (SGML). Many readers will be most familiar with SGML
via one particular document type, HTML. XML documents are
similar to HTML in being composed of text interspersed with and
structured by markup tags in angle-brackets. But XML
encompasses many systems of tags that allow XML documents to be
used for many purposes: magazine articles and user
documentation, files of structured data (like CSV or EDI
files), messages for interprocess communication between
programs, architectural diagrams (like CAD formats), and many
other purposes. A set of tags can be created to capture any
sort of structured information one might want to represent,
which is why XML is growing in popularity as a common standard
for representing diverse information.
INTRODUCTION
-------------------------------------------------------------------
This column arises out of a very practical personal concern of
its author. Electronic document formats are a mess. Over the
years, your author has written a fairly large number of
academic papers (mostly in Humanities topics), and I now wish
to make all these papers available on my web-site (or even just
be able to read them myself). Unfortunately, over the years
my wordprocessors and platforms have changed numerous times,
and I have many documents on disk that were composed using
programs that I no longer own, could probably not obtain if I
wanted to, and are unlikely to run on current computers. In
the best of cases, I have been able to locate conversion
programs that perform a moderately adequate job of converting
to a program I can currently run; and failing that, in some
other cases the original wordprocessor format is mostly ASCII,
with only a moderate amount typographic fluff interspersed.
In other words, my electronic archives are a mess. Many
individuals and organizations are in even worse shape. For
corporations and agencies especially, truly massive numbers of
important archival documents are being lost to changes in
technologies. And these losses only get worse with time.
On the plus side, the means are available to create documents
that will age much better than those I have accumulated. There
are (at least) two keys to creating electronic textual
documents that are resistent to changes in technology.
XML/SGML generally, and DocBook specifically, provide both
keys. The first (and more important, ultimately) key to
time-resistent documents is open standards for document
formats. There are two elements to these open standards:
syntax and semantics (or, "what must a document look like?" and
"what does a document mean?"). The syntax of a DocBook
document is wholly contained in the simple rules of XML markup,
and in the DocBook DTD that is inherent to every DocBook
document. The semantics are slightly less distinct. Certain
semantic features are contained in a DTD (what elements are the
type of thing that can/must occur inside other elements?). And
the DocBook tags are chosen in such a way as to have a certain
"common-sense" semantic content, at least to English speakers.
But some more detailed semantic issues must rely on background
documentation like that in the Resources section, and on
traditions of use and editorial judgements (e.g. what type of
list is appropriate here?).
The second key is of less theoretic importance, but still of
considerable practical significance. How easy is a document
format to interpret and use outside of formal specifications?
An old binary stream format is difficult to make sense of using
a text viewer. An XML document is usually pretty reasonable
looking to a human reader even without formal validation and
processing (of course, plain ASCII is even better for casual
eyeballs). Furthermore, some formats are simply a lot easier
to reconstruct than others, even apart from the presence of
formal specification. Imagine an historian 100 years in the
future who finds two documents: one in MS-Word 97 accompanied
by an MSDN file-format specification CD, and one in an XML
format (even one missing a DTD). This hypothetical historian
has a lot less work in front of her to reconstruct the XML
document's contents (look at the many vendors--including
Microsoft themselves with later versions--that have done a
poor-to-mediocre job of writing converters even with specs).
For that matter, imagine yourself five years in the future,
after your employer has "upgraded" all your workstations to
MS-Office 2005.
As a result of considerations of portability and technological
change, I've decided to start a project of getting my past
academic writing into DocBook format. I believe this project
will help in preserving my writing, as well as in making it
easier to make it available in current and future popular
document formats (via conversions).
WHAT IS DOCBOOK?
------------------------------------------------------------------------
DocBook is an SGML dialect developed by O'Reilly and HaL
Computer Systems in 1991. It is currently maintained by the
Organization for the Advancement of Structured Information
Standards (OASIS). The purpose of DocBook is to describe the
content of articles, books, technical manuals, and other prose
documents. DocBook has a focus on technical writing styles,
but is general enough to describe everything that goes into
most styles of prose writing. An XML variant of the DocBook
DTD is also available (and is the one that will be discussed in
this article, inasmuch as small details differ).
The most important thing to keep in mind in understanding
DocBook is that what is annotated in a DocBook document is
entirely the *semantics* of the document, not its typography or
appearance. This focus on document semantics stands in
contrast to wordprocessors, HTML, and even TeX. Wordprocessor
often allow style-sheets that mark conceptual categories like
"Header, Level 2"--but increasingly most of what they do is
attempt to deliver "what you see is what you get" (WYSIWYG).
And even where style-sheets are used, they are not usually
uniform across documents. Doing this brings in all sorts of
assumptions such as page size and layout, available fonts,
typestyles of elements, etc. Most of these assumptions have
little to do with the actual *concepts* addressed in the
writing, and almost all of them make it more difficult to adapt
the document to a different format (whether a different printed
layout, onscreen display, speech-synthesized versions, indexes
for web-robots, or whatever). HTML started out close to
DocBook (albeit simpler), but increasingly has added
more-and-more typographic tags; what exists is a hodge-podge of
semantics and typography ('
' versus '', for example).
As an easy to understand example, many different conceptual
elements are frequently rendered with italics in printed books.
Different books use different conventions, but any of these
DocBook tags might get rendered as italics when actually
typeset: '', '', '',
'', '', '' (and numerous other
DocBook tags). Of course, any one of them might -not- get
rendered in this manner. The decision how to render these, and
other, elements is really accidental to the nature of the
document considered as concepts; these decisions are the
business of publishers and book-designers, not of authors (or
at least they should be). DocBook gives you the essential
concepts that go into composition of a prose work, not the
accidents of how the work is finally rendered (which might be
many different ways, in fact). Another advantage to DocBook
style conceptual markup, besides the seperation of content and
appearance, is that is lets us do things systematically on
element types. For example, you could quickly identify a list
of every foreign phrase used in a document by searching for all
occurrences of the tag '' (perhaps you decide a
glossary is needed for such phrases). Just looking for
everything marked as italics in a wordprocessor is much less
effective in this goal.
READY, SET, MARK UP!
------------------------------------------------------------------------
My first DocBook project will be a big one, but I'll do it in
increments: convert my doctoral dissertation to DocBook.
Besides being rather long as dissertations go, the specific
document poses several challenges for a documentation system.
It contains a number of names that require roman diacritics
(but no non-European character sets); it has footnotes,
cross-references, page numbering, multiple section levels; it
uses some diagrams; there is a need to approximate some
original typography for commentary, and it uses some unusual
layout for specific effect; it has epigrams, a bibliography,
appendices, a dedication, an abstract; it uses limited, but
required, mathematical notations; it references not just books,
but also URLs and email addresses. Overall, I happen to have
written something that will provide a good workout a large
portion of all the tags in DocBook. The whole dissertation is
already available in its original WordPerfect 7 format, and in
two differently formatted PDF versions. But none of the
versions of all that portable or flexible. DocBook will
improve things on all grounds (of course, for this article, we
will only get as far as the markup, not the processing into
target formats; but eventually everything will be in order).
Enough prefacing, let us create the document:
#----------- Mertz Dissertation XML Document ------------#
]>
&bookinfo;
&chap1;
&chap2;
&chap3;
&chap4;
&chap5;
&chap6;
&chap7;
&chap8;
&appendix1;
&appendix2;
&biblio;
This first step is mostly planning, obviously. Creating the
contents of the component-level elements (e.g. chapters) will
be the real work. But by creating entity references to these
component-level elements, we have divided the creation into more
manageable chunks (and also made it easier to publish/export
the individual chapters as their own documents). What we have
indicated so far is that the type of document being created is
a -book-, and that it will include set of component level
elements pulled in from external files.
Some entities defined at this top level are not used
immediately, but only within the included files. For example,
the entity '&abstract;' is only inserted within the
'bookinfo.sgm' document. Similarly with the sections inside
Chapter 5. It is a judgement call about what to divide out;
but my criterion was that I wanted to break into separate files
the documents I might publish separately. I will probably make
adjustments as I extend this DocBook project. The other type
of entity I decided to define at this point were some names
that do not fit in US-ASCII that I know I mention. I cannot
type the diacritics directly, but typing e.g. '&Zizek;' is an
inconspicuous approximation of what I actually want. You could
also use abbreviations of whole phrases and the like, if you
wished.
INCLUSIONS
------------------------------------------------------------------------
The files that are included using the above master document
setup will consist of bare document root tags and their
contents. No document type declarations or processing
instructions will be included in the included files (this is
required). The document type is already declared in the book
master document, so we can keep it in one place. For example,
the file 'bookinfo.sgm' contains just the following:
#------------ Included XML/SGML Subdocument -------------#
The Speculum and The Scalpel
The Politics of Impotent Representation and
Non-Represenational Terrorism
DavidMertz
&abstract;
Similarly, the chapter files each resepectively start and end
with the '' and '' tags.
One advantage of the modular structure laid out is that it is
easy to pull out individual components for separate
publication. For example, I intend to first convert (and
separately distribute) versions of Chapter 5. In anticipation
of doing so, I have created a smaller wrapper for the chapter
alone:
#---------- Chapter Level Subdocument Wrapper -----------#
]>
Hegemony, and Other Passing Fads
Gould, 1987b, quoting Gunnar Myrdal, An
American Dilemma (1944)
But there must be still other countless errors of the same
sort that no living man can yet detect, because of the fog within which
our type of Western culture envelops us. Cultural influences have set
up the assumptions about the mind, the body, and the universe with which
we begin; pose the questions we ask; influence the facts we seek;
determine the interpretations we give these facts; and direct our
reaction to these interpretations and conclusions.
&chap5_1;
&chap5_2;
&chap5_3;
The bulk of the marked-up content is in the three sections
(each with a top-level 'sect1' as their root). But I have the
option of processing the same section content as part of either
the book-level or chapter-level wrapper (and I will probably
also pull out Section 2 as an 'article' by itself, which obeys
the same structure as a 'chapter').
CONTINUING...
------------------------------------------------------------------------
What this column has begun is only enough for readers (and its
author) to get a general sense of DocBook. Subsequent columns
will address some greater details of the DocBook tags, and how
they are structured. In addition, we have yet to get around to
discussing how to convert DocBook documents to more directly
readable formats, how to validate them, and how to perform
processing operations on them. Stay tuned.
In the meanwhile, it is a good idea to start skimming through
some of the DocBook reference material in the Resources.
DocBook has a lot of tags available--probably more than anyone
is going to remember. But once you get a sense of what types
of tags to look for, and how to put them together, the going
gets easier. It probably will not hurt to keep a reference on
your lap while you work with DocBook (even if you use
specialized tools to help with the editing)
RESOURCES
------------------------------------------------------------------------
By all means, the best place to get started in a more detailed
understanding of DocBook is. The ink-on-paper version is:
_DocBook: The Definitive Guide_, Norman Walsh & Leonard
Muellner, O'Reilly, Cambridge, MA 1999.
If you wish to use an electronic version, refer to:
http://www.docbook.org/tdg/index.html
Organization for the Advancement of Structured Information
Standards (OASIS) home page:
http://www.oasis-open.org/
In some respect, a format even more portable and time-protected
than DocBook is plain ASCII--or "smart ASCII" that incorporates
simple style annotations (in the way evolved on Usenet). Of
course, ASCII will not be able to capture all the semantic
structure of DocBook, but a lot of times you do not need to.
Project Gutenberg is an example of attempts to preserve and
utilize texts in this neutral manner:
http://www.gutenberg.net/
For an important tool with a somewhat overlapping purpose to
DocBook, TeX is a tool worth learning about. The focus of TeX
is closer to typography, but TeX also has many elements of
semantic markup (especially for mathematics). A good starting
point is:
http://www.tug.org/
My own articles (including the draft of this one) have used a
similar "smart ASCII" format for their originals. Markup is
be automated using the tool Txt2Html (also see the ASCII
version of this article):
http://gnosis.cx/cgi-bin/txt2html.cgi
http://gnosis.cx/publish/programming/xml_matters_3.txt
The obscure philosophy dissertation I have undertaken to
convert will probably have minimal interest to most XML
developers (or even make much sense). But the actual formats
used might be of some interest. The document was written
originally in WordPerfect 7 (with portions imported from other
wordprocessor format along the way). A moderate effort was
made to use stylesheets for significant elements to make global
changes easier. As one prior attempt at (web) publication, I
output the document to PDF format in a style that
typographically resembles a printed magazine/journal articles
more than a submitted thesis. PDF is not a bad format, but it
fails to seperate content from layout (or merely does not
attempt to):
http://gnosis.cx/publish/mertz/Dissertation_PageFormat.pdf
http://gnosis.cx/publish/mertz/master.dis
Files used and mentioned in this article:
http://gnosis.cx/download/xml_matters_3.zip
ABOUT THE AUTHOR
------------------------------------------------------------------------
{Picture of Author: http://gnosis.cx/cgi-bin/img_dqm.cgi}
It might be catachretic, but it is not a malapropism to describe
David Mertz juxtopositions of interests herein as sylleptic. Words
is words. David may be reached at mertz@gnosis.cx; his life pored
over at http://gnosis.cx/publish/. Suggestions and recommendations
on this, past, or future, columns are welcomed.