David Mertz, Ph.D.
Subsumer, Gnosis Software, Inc.
May, 2002
Most hitherto existing XML API's have enforced
well-formedness at a programmatic level, but hardly any can
guarantee validity. This is a serious weakness in the whole
field of XML processing. This installment discusses its
author's gnosis.xml.validity
library for enforcing validity
in Python objects intended for XML serialization.
A tip I wrote previously for IBM developerWorks' XML Zone took a conceputal look at reconciling object-oriented programming techiques with XML validity contraints. This installment of XML Matters presents an early version of an actual Python module for doing it. One could create an analogous capability in other programming languages, but Python provides particularly versatile reflection mechanisms, and allows clear expression of validity constraints.
On the face of it, Python--with its extremely dynamic (albeit strict) typing--might seem like a strange choice for implementing what is, essentially, an elaborate type system. But any oddness one might perceive is superficial. In fact, the type systems of languages like Java, C++, or C#, while static, are far too impoverished to offer much meaningful help to XML validity constraints. A pure functional language like Haskell might offer type hierarchies, discriminated unions, quantification and existential types, and so on, but OOP languages typically lack these things. In statically typed OOP languages, one would have to build just as much custom validation into a library as does the currently discussed Python library.
The module gnosis.xml.validity
can helpfully be contrasted
with several other XML-related. Two other libraries that have
been incorporated into the author's gnosis.xml
package,
were discussed in earlier articles. gnosis.xml.pickle
is
able to produce a specialized XML serialization of any Python
object whatsoever. As with Python's standard pickle
and
cPickle
modules, this provides a way to save and restore
objects. gnosis.xml.objectify
operates in a reverse
direction: given an arbitrary XML document, we can generate a
"Pythonic" object (with a slight loss of information about the
original XML).
Python standard library includes support for DOM and SAX
processing of XML documents. Widely used 3rd party Python
packages extend the support to include XSLT processing. DOM
(specifically xml.dom.minidom
) offers a rather heavy API for
OOP-style manipulation of XML documents--with methods common
across DOM implementation in many programming languages. SAX
treats an XML document as a series of parsing events, and
basically allows a procedural programming style. XSLT declares
a set of rules for transforming an XML document into something
else (such as a different XML document).
All of these libraries are useful, but none of them prevent an application from modifying aN XML-representation object in ways that break the validity of the underlying XML. For example, deleting, adding, or moving a DOM node can easily create a DOM hierarchy that cannot be dumped into valid XML document.
The basic idea of XML validity is to specify what can occur
inside an element, how often it can occur, and what
alternatives exist about what can occur. As well, when
multiple things can occur inside an element, the order of
occurence can be specified (or left open, as needed). DTD's
differ somewhat from W3C XML Schemas in what they can express,
but the jist is the same. Let us look at a highly simplified
hypothetical dissertation.dtd
:
<!ELEMENT dissertation (dedication?, chapter+, appendix*)> <!ELEMENT dedication (#PCDATA)> <!ELEMENT chapter (title, paragraph+)> <!ELEMENT title (#PCDATA)> <!ELEMENT paragraph (#PCDATA | figure | table)+> <!ELEMENT figure EMPTY> <!ELEMENT table EMPTY> <!ELEMENT appendix (#PCDATA)>
In other words, a dissertation may contain one dedication,
must contain (one or more) chapters, and may contain (zero
or more) appendixes. The various subelement occur in the
listed order (if at all). Some elements contain only character
data. In the case of the <paragraph>
tag, it may contain
either character data or a <figure>
subelement or a
<table>
subelement, or any combination of each of them.
Structures can nest, but every basic validity concept is in the
example.
What the gnosis.xml.validity
module does is let you create,
e.g., a disseration
Python object that can only represent a
valid disseration. Moreover, when transformed into XML--using
the print
command or str()
function--the XML automatically
matches the desired DTD.
The easiest way to understand what gnosis.xml.validity
does
is to see it used. In attitude, gnosis.xml.validity
owes a
heritage to the Spark
parser. That is, "validity classes"
are defined using Python reflection rather than traditional
sequential programming. The symmetry is interesting inasmuch
as Spark
and gnosis.xml.validity
in a sense do exactly
opposite things--the former assumes rule-based structure in
external texts, the latter enforces it in internal objects.
A validity class is based very closely on a corresponding DTD
or XML Schema. A class simply inherits from a relevant
validity type, then specializes (if necessary) by adding a
class attribute. A convention is used that any class named
with an initial underscore represents a structure that does not
have a corresponding tag. For example, a <paragraph> element
in a disseration can contain a collection of PCDATA and <figure>
and <table> elements. The disjunction type that is assembled
into a <paragraph> collection does not itself have an XML tag.
Therefore, this disjuction type is named _mixedpara
in the
below example:
from gnosis.xml.validity import * class appendix(PCDATA): pass class table(EMPTY): pass class figure(EMPTY): pass class _mixedpara(Or): _disjoins = (PCDATA, figure, table) class paragraph(Some): _type = _mixedpara class title(PCDATA): pass class _paras(Some): _type = paragraph class chapter(Seq): _order = (title, _paras) class dedication(PCDATA): pass class _apps(Any): _type = appendix class _chaps(Some): _type = chapter class _dedi(Maybe): _type = dedication class dissertation(Seq): _order = (_dedi, _chaps, _apps)
As with a DTD, the top level of a particular object/XML
document can be any tag whose rules are given. dissertation
happens to be the highest level available here, but one can
create documents of lower types also. Let us take a look:
>>> from dissertation import chapter, title, _paras, paragraph, PCDATA >>> chap1 = chapter(( title(PCDATA('About Validity')), ... _paras([paragraph(PCDATA('It is a good thing'))]) ... )) >>> print chap1 <chapter><title>About Validity</title> <paragraph>It is a good thing</paragraph> </chapter>
A <chapter> is initialized with a tuple containing a <title>
and a _paras
list. A <title>, in turn is initialized with
some PCDATA
, which is itself initialized with a (Unicode)
string. Likewise, a _paras
list contains some <paragraph>'s,
which are themselves initialized with PCDATA
. Once an
appropriate object exists, it simply prints itself as valid
XML.
All of those nested initialization, although obeying the
details of the specified DTD validity rules, are rather
cumbersome to bother with. Therefore gnosis.xml.validity
allows a much friendlier style for initialization. Whenever
a particular type is required, the initializer for that type is
transparently lifted into the type itself. Moreover, when a
"quantification" type would normally be initialized by a list
of things of the right type, specifying just one thing lifts
the thing into a length one list of the thing. "Lifting" is
recursive. One note is that Seq
types that use
lifting must use the factory function LiftSeq()
, but other
types can lift their own initialization arguments (the details
have to do with "new-style" inheritance from immutable Python
types). This sounds complicated, but it is enormously obvious
in practice:
>>> from dissertation import LiftSeq >>> chap1 = LiftSeq(chapter,('About Validity','It is a good thing')) >>> print chap1 <chapter><title>About Validity</title> <paragraph>It is a good thing</paragraph> </chapter>
So far, we have created some valid XML/objects. But so what? We
could have also just written the valid XML text by hand. The
value of gnosis.xml.validity
comes when you want to modify an
object in either valid or invalid ways. For example, here is a
valid modification:
>>> paras_ch1 = chap1[1] >>> paras_ch1 += [paragraph('OOP can enforce it')] >>> print chap1 <chapter><title>About Validity</title> <paragraph>It is a good thing</paragraph> <paragraph>OOP can enforce it</paragraph> </chapter>
What happens, to the contrary, when we try something that is not allowed? For example, a dissertation can have at most one dedication (at least as we have specified the example):
>>> from dissertation import _dedi, dedication >>> Maybe_dedication = _dedi([]) >>> print Maybe_dedication >>> Maybe_dedication.append(dedication("To Mom.")) >>> print Maybe_dedication <dedication>To Mom.</dedication> >>> Maybe_dedication.append(dedication("Also to Dad.")) Traceback (most recent call last): File "<pyshell#71>", line 1, in ? Maybe_dedication.append(dedication("Also to Dad.")) File "validity.py", line 140, in append raise LengthError, self.length_message % self._tag LengthError: List <_dedi> must have length zero or one
Likewise, one cannot include something of the wrong type, even if the length of a quantification would be OK:
>>> from gnosis.xml.validity import ValidityError >>> try: ... paras_ch1.append(dedication("To my advisor")) ... except ValidityError, x: .... print x Items in _paras must be of type <class 'dissertation.paragraph'> (not <class 'dissertation.dedication'>)
All the exceptions that might be raised by violating
constraints are descended from ValidityError
. Programming
using the gnosis.xml.validity
library will probably involve
wrapping many operation in try/except
blocks; it should not
be possible to create an invalid object by attempting a
disallowed operation.
A first note is that gnosis.xml.validity
is strictly for
Python 2.2+. Although it is possible to implement it in
earlier Python versions, I felt this project makes a good
testing ground for some newer Python features. Specifically,
the library takes advantage of the type/class unification, and
new-style classes. I have some ideas about doing some tricky
stuff with metaclasses in future library versions, and I might
even work in properties and slots.
The design of gnosis.xml.validity
relies heavily on Python's
introspection/reflection capabilities. Several abstract
classes comprise the main functionality. Each of these classes
must have concrete children to actually do anything, although
all the children need to implement is (at most) one class
attribute each. When an XML tag corresponds to a class, the
tag name is taken directly from the class name. As noted
earlier, if a class name begins with an underscore, it has no
corresponding XML tag. The basic rule here is that any
"tagged" validity class serializes itself with surrounding
open/close tags; a "tagless" class just serializes its raw
content (which might, however, include items that themselves
have tags). A limitation this scheme imposes is that
gnosis.xml.validity
cannot work with DTD's specifying XML
tags with lead underscores; this limitation could be removed
in future versions, but probably will not unless users have a
need for this.
The base abstract classes consist of the following:
PCDATA: This one may be used directly, and so is not really
abstract. An XML element that contains PCDATA should inherit
from this, but need not provide any further specialization.
But in an alternation list for the Or
type, one simply lists
PCDATA
. This is very closely modelled on DTD syntax. I
recommend listing PCDATA
first in such a list (as DTD's
require), but that is not currently mandatory.
EMPTY: Also modelled on DTD syntax. As with PCDATA
, this
class should be inherited from, but no further specialization
is required.
Or: A child of Or
must add a _disjoins
tuple as a class
attribute. Normally, that one attribute will be the whole
implementation. Listed in the tuple should be other validity
classes. Conceptually, a disjunction should involve two or
more things, but no error is currently raised if there are
fewer disjoins.
Seq: A child of Seq
must add an _order
tuple as a class
attribute. Normally, that one attribute will be the whole
implementation. Listed in the tuple should be two or more
other validity classes; as with Or
the tuple length is not
currently checked. In instantiating a Seq
child, it is
usually safer to utilize the factory function ListSeq()
.
Quantification: This abstract class is a special case, in a
way. The examples in this article have not used
Quantification
, but have instead used (still abstract)
children of it. For example, this is the implementation of the
class Some
:
class Some(Quantification): length_message = "List <%s> must have length >= 1" min_length = 1 max_length = maxint
The classes Maybe
and Any
have similar implementation.
These three Quantification
children cover all the
quantification options for DTD's, but XML Schemas can allow
others, e.g. Three_to_Seven
, whose implementation is
straightforward. I realize that a pretty good length_message
could be generated from the other attributes, but I felt like
the pluralization and phrasing of messages was better done by a
programmer.
A concrete descendent of Quantification
must add a _type
class attribute, which points simply to another validity class.
In principle, a concrete child could add its own min_length
,
max_length
and length_message
--but using an intermediary
feels like better design.
As of this writing gnosis.xml.validity
is largely a
proof-of-concept. A few things are still missing. The most
glaring absence is the complete lack of facility for adding XML
tag attributes--let alone enforcing their validity. In
structure, attributes look a lot like subelements--merely
unordered ones--so similar enforcement mechanism can be
added to later versions of gnosis.xml.validity
. This
addition is certainly the highest priority for a next feature.
There are some other conveniences would be nice to have in
gnosis.xml.validity
. It would be nice to generate a set of
Python validity classes automatically from a DTD or XML Schema.
Unlike in a DTD, however, a set of Python validity classes need
to be defined in a particular order--or at least in an order
that defines each class earlier than it is named in an
attribute of another class.
Reading from an existing, and valid, XML document would often be useful. It is not necessarily obvious what the best way to achieve this is. Since member items need to be valid object prior to their inclusion in larger structures, the simplest recursive descent approach would not work. But it should be possible to deserialize an XML document to corresponding validity classes.
Finally, some sort of higher level interface to the presented validity classes might ease work with them. The strategy used in the library now is to raise exceptions for every disallowed action; but there may be ways of wrapping this in more convenient API's. Perhaps silent failure or flag return values would be useful, or maybe some other sort of fallback operations for error cases. Deciding the right interfaces probably will require more experimentation by users (including myself).
I welcome reader feedback about what direction later versions
of gnosis.xml.validity
should take. I believe the initial
functionality will already aid a variety of XML programming
tasks, but given how little similar library development has
been done elsewhere, my intuitions about what is most useful
are still vague.
The general goals that went into the development of the
gnosis.xml.validity
library were outlined in the XML Zone
tipe at:
http://www-106.ibm.com/developerworks/library/x-tipoop.html
The Haskell library HaXml
accomplishes everything that mine
does, but within the framework of a pure functional language.
While this is very different, conceptually, from an
object-oriented approach, readers can read about HaXml
in an
ealier installment of this column:
http://www-106.ibm.com/developerworks/library/x-matters14.html
XML Matters #7 (developerWorks, March 2001) compared DTDs and Schemas. For the issues with each, take a look there.
http://www-106.ibm.com/developerworks/xml/library/x-matters7.html
The most current version of Gnosis_Utils can always be found at
the below URL. Make sure to download at least version 1.0.2 to
obtain gnosis.xml.validity
:
http://gnosis.cx/download/Gnosis_Utils-current.tar.gz
David Mertz uses a wholly unstructured brain to write about structured document formats. David may be reached at [email protected]; his life pored over at http://gnosis.cx/publish/.