David Mertz, Ph.D.
Floating Signifier, Gnosis Software, Inc.
December, 2002
The document format called reStructuredText has been adopted as (one of) the official source format(s) for Python documentation, but is also useful for other types of documentation. reStructuredText is an interesting hybrid of technologies--in syntax and appearance it is similar to other "almost-plaintext" formats, but in semantics and API it is very close to XML. Moreover, existing tools can transform reStructuredText into several XML dialects (docutils, DocBook, OpenOffice), along with other useful formats like LaTeX, HTML, PDF.
reStructuredText
Previous articles I have written for the XML Zone have looked at alternatives to XML--document formats that satisfy many of the same purposes for which you might use XML. reStructuredText continues this tradition. In contrast to YAML, which is good for data formats, reStructuredText is designed for documentation; in contrast to smart ASCII, reStructuredText is heavier, more powerful, and more formally specified. All of these formats, in contrast to XML, are easy and natural to read and edit with standard text editors. Working with XML more-or-less requires specialized XML editors, such as those I have reviewed previously.
reStructuredText--frequently abbreviated as reST--is part of the Python Docutils project. The goal of this project is to create a set of tools for manipulating plaintext documents, including exporting them to structured formats like HTML, XML, and TeX. While this project comes from the Python community, the needs it addresses extend beyond Python. Programmers and writers of all stripes frequently create documents such as READMEs, HOWTOs, FAQs, application manuals, and in Python's case PEPs (Python Enhancement Proposals). For these types of documents, requiring users to deal with verbose and difficult formats like XML or LaTeX is not generally reasonable, even if those users are programmers. But it is still often desirable to utilize these types of documents for purposes beyond simple viewing: i.e., indexing, compilation, pretty-printing, filtering, etc.
For Python programmers, the Docutils tools can satisfy a similar purpose to JavaDoc does for Java programmers, or POD does for Perl programmers. The documentation within Python modules can be converted to Docutils "document trees", and thence to various output formats (usually within a single script). But for this article, the more interesting use is for general documentation. For articles like this, and even for my forthcoming book, I write using smart ASCII; but I am coming to feel I would be better off with the formality of reStructuredText (and I may develop tools to convert my existing documents).
As of this writing, the Docutils project is under development, and has not released a "stable" version. The tools that exist are good, but the overall project is a mixture of promises, good intentions, partial documentation, and some actual working tools. However, progress is steady, and what you can do already is very useful.
reStructuredText
Readers will get a better sense of what reStructured text is about with a brief example. The following text is an example in PEP 287 (of part of a hypothetical PEP):
Abstract This PEP proposes adding frungible doodads [1] to the core. It extends PEP 9876 [2] via the BCA [3] mechanism. ... References and Footnotes [1] http://www.example.org/ [2] PEP 9876, Let's Hope We Never Get Here http://www.python.org/peps/pep-9876.html [3] "Bogus Complexity Addition"
The format is exactly how PEPs prior to 287 were formatted. If reStructuredText is used to "markup" the same PEP, it could look like:
Abstract ======== This PEP proposes adding `frungible doodads`_ to the core. It *extends* PEP 9876 [#pep9876]_ via the BCA [#]_ mechanism. ... References & Footnotes ====================== .. _frungible doodads: http://www.example.org/ .. [#pep9876] PEP 9876, Let's Hope We Never Get Here .. [#] "Bogus Complexity Addition"
There are a few details that differ from the plaintext. But readability is really not harmed by the very light sprinkling of special characters. You would not need to look twice to read this if you saw it in a text editor or a printed page.
The above reST formatted document can be automatically transformed into an XML dialect, such as that defined by the Docutils Generic DTD:
<?xml version="1.0" encoding="UTF-8"?> <document source="test"> <section id="abstract" name="abstract"> <title>Abstract</title> <paragraph>This PEP proposes adding <reference refname="frungible doodads">Frungible doodads</reference> to the core. It<emphasis>extends</emphasis><reference refuri="http://www.python.org/peps/pep-9876.html"> PEP 9876</reference><footnote_reference auto="1" id="id1" refname="pep9876"/> via the BCA <footnote_reference auto="1" id="id2"/> mechanism.</paragraph> <paragraph>...</paragraph> </section> <section id="references-footnotes" name="references & footnotes"> <title>References & Footnotes</title> <target id="frungible-doodads" name="frungible doodads" refuri="http://www.example.org/"/> <footnote auto="1" id="pep9876" name="pep9876"> <paragraph><reference refuri="http://www.python.org/peps/pep-9876.html">PEP 9876</reference>, Let's Hope We Never Get Here </paragraph> </footnote> <footnote auto="1" id="id3"> <paragraph>"Bogus Complexity Addition" </paragraph> </footnote> </section> </document>
You can see several things in contrasting these three formats. The most dramatic difference is how much harder it is to skim the XML version. But it is also notable just how much information the reStructuredText tools have located in the reST document. References of several types are properly matched up, document sections are identified, character-level typographic markup is added. In other examples, linked TOCs can be generated during processing, along with other special directives.
The docutils
package consists of quite a few subpackages, in
a fairly complicated relationship to each other. PEP 258,
Docutils Design Specification, contains a chart that is
useful for understanding the overall pattern:
. +---------------------------+ | Docutils: | | docutils.core.Publisher, | | docutils.core.publish_*() | +---------------------------+ / | \ / | \ / | \ +--------+ +-------------+ +--------+ | READER | ----> | TRANSFORMER | ====> | WRITER | +--------+ +-------------+ +--------+ / \\ | / \\ | / \\ | +-------+ +--------+ +--------+ | INPUT | | PARSER | | OUTPUT | +-------+ +--------+ +--------+
A more complete explanation of the component subpackages is contained in that PEP, but a brief explanation is worth repeating here.
The heavy work of converting a reST text into a tree of nodes
is done by the docutils.parsers.rst
subpackage. The
reStructuredText parser treats a source in a line-oriented
fashion, looking for a state transition on each line; if none
of the other transition patterns are found, the text
transition catches the line. Transitions consist of features
like change in indentation, special leading symbols, and so on.
The default just includes the next line as more text within the
current node.
This structure is similar to that used in the smart ASCII parsers
txt2dw
and txt2html
. Other parsers would live under the
docutils.parsers
hierarchy, but none are currently provided.
There is an experimental Python source code parser though, which
treats a Python source file as a document tree.
Once a tree of nodes is generated for a document, the
docutils.transforms
subpackage is enlisted to massage the
tree in various ways. For example, if you have specified a
directive to include a table-of-contents, the document tree is
walked to identify listed items. Also, some cleanup of
references and links is performed at this stage. During the
initial pass, the places in the tree where unresolved elements
will go is filled with placeholders that cue the
transformations.
Of most interest to readers of this article are probably the
various docutils.writers
modules. Some of the more
interesting writers are still kept in the experimental
"sandbox" area at the time of this writing (check the Docutils
website), but the principles are the same in any case. A
writer module should define a Writer
class that inherits from
docutils.writers.Writer
This Writer
class defines some
settings, but mostly defines a .translate()
method, that
might look something like:
def translate(self): visitor = DocBookTranslator(self.document) self.document.walkabout(visitor) self.output = visitor.astext()
The writer, as you can see, depends on a "visitor" that knows
what to do with nodes of each type. A visitor will generally
inherit from docutils.nodes.NodeVisitor
. Programming a visitor
is quite a lot like programming a SAX
, expat
, REXML
, or
other event-oriented XML parser. However, a visitor is even
closer to the programming style of Python's xmllib
module.
That is, a visitor will have a .visit_FOO()
and
.depart_FOO()
method for each type of node, rather than
switching on type within large .startElement()
and
endElement()
methods. OOP purists are likely to prefer this
style. An simple example from the Docbook/XML writer is:
class DocBookTranslator(nodes.NodeVisitor): [...lots of methods...] def visit_block_quote(self, node): self.body.append(self.starttag(node, 'blockquote')) def depart_block_quote(self, node): self.body.append('</blockquote>\n') [...lots more methods...]
Programming a custom writer/visitor is a straightforward enough matter, and existing writers exist for Docutils/XML, HTML, PEP-HTML, PseudoXML (a sort of "light" XML that combines start tags with indentation, but no closing tags), LaTeX, DocBook/XML, PDF, OpenOffice/XML, and Wiki-HTML.
You may transform a reStructuredText document into a tree of nodes that can be manipulated in a DOM-like fashion. The below is an example using the prior brief example of a reST PEP.
>>> txt = open('pep.txt').read() >>> def rst2tree(txt): ... import docutils.parsers.rst ... parser = docutils.parsers.rst.Parser() ... document = docutils.utils.new_document("test") ... document.settings.tab_width = 4 ... document.settings.pep_references = 1 ... document.settings.rfc_references = 1 ... parser.parse(txt, document) ... return document ... >>> doc = rst2tree(txt) >>> doc.children [<section "abstract": <title...><paragraph...><paragraph...>>, <section "references & footnotes": <title...> <target "frungible doodads"...><footnote "pep9 ...>] >>> print doc.autofootnotes [<footnote "pep9876": <paragraph...>>, <footnote: <paragraph...>>] >>> print doc.autofootnotes[0].rawsource PEP 9876, Let's Hope We Never Get Here
One thing to notice in contrast with DOM is that reStructuredText
is already a fixed document dialect. So rather than use generic
methods to search for matching nodes, you can search for nodes
using attributes named for their meaning. The .children
attribute is generically hierarchical, but most attributes
collect nodes of a given type.
One convenient method of reST nodes is .pformat()
, which
produces a pseudo-XML representation of the document tree for
pretty-printing. E.g.:
>>> print doc.autofootnotes[0].pformat(' ') <footnote auto="1" id="pep9876" name="pep9876"> <paragraph> <reference refuri="http://www.python.org/peps/pep-9876.html"> PEP 9876, Let's Hope We Never Get Here
Node methods like .remove()
, .copy()
, .append()
,
.insert()
are useful for pruning and manipulating trees.
For XML programmer, a possibly more desirable API is actual DOM. Fortunately, this API is a single method call away:
>>> dom = doc.asdom() >>> foot0 = dom.getElementsByTagName('footnote')[0] >>> print foot0.toprettyxml(' ') <footnote auto="1" id="pep9876" name="pep9876"> <paragraph> <reference refuri="http://www.python.org/peps/pep-9876.html"> PEP 9876 </reference> , Let's Hope We Never Get Here </paragraph> </footnote>
Unfortunately, as of this writing, there are no tools or
functions to convert a DOM tree or XML document back into
reStructuredText. It would be nice, especially, to have a
reader for the Docutils Generic DTD; this would let us produce
a reST document tree for the corresponding XML. We could write
it back out as reST with the .astext()
node method. It would
not be hard to write such a reader, and I am sure it will
happen over time (perhaps by me or one of my readers).
The Doctuils website is at the below URL. You can find
extensive references both for the reStructuredText format
itself, and for the docutils
package.
http://docutils.sourceforge.net/
Python Enhancement Proposal 287 recommends the use of reStructuredText for inline documentation of Python code. This PEP also usefully contrasts reST with other documentation formats considered for the same purpose (XML, TeX, HTML, POD, SEText, etc).
http://docutils.sourceforge.net/spec/pep-0287.html
The Docutils Generic XML DTD can be found at:
http://docutils.sourceforge.net/spec/docutils.dtd
You can read about the smart ASCII format, and converting it to the XML format used by developerWorks at:
http://www-106.ibm.com/developerworks/library/x-tipt2dw.html
I wrote about YAML, a data-oriented alternative to XML, at:
http://www-106.ibm.com/developerworks/library/x-matters23.html
David Mertz wishes to let a thousand flowers bloom. David may be reached at [email protected]; his life pored over at http://gnosis.cx/publish/.