Xml Zone Tip: The txt2dw Utility

Writing developerWorks content the easy way


David Mertz, Ph.D.
Luddist, Gnosis Software, Inc.
October, 2001

IBM developerWorks is moving towards a custom XML dialect as the source format for the articles that appear here. But writing XML is always going to be difficult for people (but easy for machines). One approach to the "human interface" problem is the public domain txt2dw utility that author David Mertz uses for his own articles.

Source Formats And Human Interfaces

There are lots of good reasons why the organization you work at is likely to adapt XML dialects for many of its documentation needs. For all these same reasons, IBM developerWorks-- starting with the XML Zone--has developed its own XML DTD for articles. Once you have an XML source, either a shared standard like DocBook or TEI or an in-house disalect, it is easy to transform the source into arbitrary target formats (HTML, PDF, other XML, etc). Moreover, validation against a DTD provides a nice check that a document contains all the parts it needs to have, with all the right relationships between them. In addition, XML is a much more platform- and tool-neutral format than those used by proprietary (or even open source) word processors and publishing applications.

The problem with XML, however, is that it is a really crummy human interface. Even though XML is just ASCII bytes, typing them in a text editor is a lot of work. Besides requiring a littering of angle brackets and punctuation to interupt the flow of a touch typist, it is difficult consistently to make sure that every tag gets closed in the right order; and it is especially difficult to know a moderately complex DTD well enough to remember exactly what elements and attributes are allowed at each point in a document. Worst of all, the cruft of XML tags makes visually scanning a document significantly harder.

Making It Easy For Writers

At least two approaches ease the pain of editing XML documents with a text editor. One approach is to use a higher level tool for the editing. An XML-aware editor can automate conformance with a DTD, and some of them can even hide or highlight the XML markup to make visual scanning easier. Many IBM writers, myself included, are particularly fond of XMetal, but many excellent programs exist. All these programs, however, run on specific platforms; they each have their own set of quirks (different from those of a favorite text editor); and many of them will set you back a large number of dollars.

The second approach is the one txt2dw takes. Let writers write using the tools that don't get in their way. Then let computers worry about how the documents need to be formatted. Word processors try to take this approach; but the state of tools for getting from a word processor to XML is still crude. To my own mind, a better idea is to use the "smart ASCII" markup format that has informally evolved in email, on the Usenet, and in README's and project documentation for open source software projects. One can formalize it just a little bit without getting in the way of writers (but simultaneously aiding the converter).

Using 'txt2dw'

The use of txt2dw could hardly be any simpler. Just read some "smart ASCII" input from STDIN, and write some valid XML to STDOUT. For example:

% txt2dw.py < MyArticle.txt > MyArticle.xml

At this point, one has an XML formatted document. Most likely, the eventual target will be something different from XML. In my own case--and for many writers--the eventual target format is not really all that interesting (that is for editors and publishers to worry about, and change as needed). All that really matters if the utility does its job is that the XML version is valid according to article.dtd.

However, someone will want to transform the XML to something else. XSLT is a common transformation technique, and one for which IBM developerWorks uses the custom stylesheet article-html.xsl. Assuming you want the HTML version developerWorks will use, just run something like:

% xslt article-html.xsl < MyArticle.xml > MyArticle.html

The exact details will vary with the XSLT engine one uses, but the idea will be the same.

"smart Ascii" Format

For the most part, "smart ASCII" is what you have been writing for years if you use email and the Usenet. Most of the details are documented at the top of the script. Asterisks surround bold or heavily emphasized phrases; dashes surround italicized or lightly emphasized phrases; underscores introduce Book or Series Titles. I have adopted the use of single quotes to set apart appnames and filenames (usually rendered in a fixed font), and square brackets to indicate libraries and modules. Take a look at the ASCII version of this Tip in the Resources for how these features started out. These conventions are not quite universal, but they will also not be unfamiliar to readers. They are all very quick to type.

Anything that looks like a URL is turned into a link automatically. A fairly simple special format with curly braces and the ALT text before a colon is used to insert images, such as charts and graphs. See the author blurb for an example with the author's photograph.

At a paragraph level, a few types of paragraphs are allowed, and are indicated by indentation level. Headers are not indented; in addition, any header line that consists just of a row of dashes is stripped out (this helps prettify the ASCII originals). Regular text paragraphs are indented two spaces. Block quotes are indented four spaces. Code samples are indented six spaces (or more). If a code sample begins with a line the consists of a pound sign, some dashes, a title, some more dashes, then another pound sign, that line is treated as a label for the code sample (in many programming languages, it would be a comment line anyway). If not, no harm is done.

There are a few features of txt2dw that are more rigid than I would like. These were concessions to the fairly rigid format of article.dtd. On the plus side, the rigid constraints were exactly the conventions I had adopted anyway, so obeying them was not difficult. Moreover, none of them look odd or unnatural (but one still has to remember to use these features, or create a template that does so). A few moderately intelligent changes are made when ALLCAPS sections are encountered. A usable template is below (as a code sample):

Template for 'txt2dw' "smart ASCII" source

SERIES: Main Title
Subtitle

Author Name
Title, Affiliation
Date

    Abstract of the article (block quote indented)...

FIRST SECTION
----------------------------------------------------------

  Regular paragraph...

      #----- Title of code sample -----#
      Sample code line 1
      [...]

  Regular paragraph...

MORE SECTIONS...
----------------------------------------------------------

  [...]

  {Picture of Author: http://mysite/mypic.png}
  Author blurb...


Resources

The txt2dw.py utility can be downloaded from:

http://gnosis.cx/download/txt2dw.py

Users who want to include Python source code examples, might want to pick up the supporting module dw_colorize (other languages might be supported later):

http://gnosis.cx/download/dw_colorize.py

This article, in its original "smart ASCII" form, can be found at:

http://gnosis.cx/publish/programming/txt2dw_tip.txt

Terrence Parr has written a wonderful installment of his Soapbox column, called "Humans should not have to grok XML." I couldn't agree more:

http://www-106.ibm.com/developerworks/xml/library/x-sbxml.html

I looked at a number of custom XML editors in my column "XML Matters #6 A roundup of editors." Find that at:

http://www-106.ibm.com/developerworks/library/x-matters6.html

"Smart ASCII" can also be converted directly to HTML using a related utility Txt2Html. That was discussed in "Charming Python: Converting text to HTML using Txt2Html":

http://www-106.ibm.com/developerworks/library/python3.html

The DocBook dialect of XML, and many of the reasons one would want to use XML for prose-oriented documents was discussed in several of my XML Matters installments:

Getting started with the DocBook XML dialect: http://www-106.ibm.com/developerworks/library/xml-matters3.html
Getting comfortable with the DocBook XML dialect: http://www-106.ibm.com/developerworks/library/x-matters4.html
Transforming DocBook documents using XSLT: http://www-106.ibm.com/developerworks/library/x-matters5.html

About The Author

Picture of Author David Mertz greatly welcomes feedback on ways to tweak and improve txt2dw, or any of his public domain utilities. David may be reached at [email protected]; his life pored over at http://gnosis.cx/publish/.