Xml Matters #20: Squeezing Oop Data Into Xml Rules

The gnosis.xml.validity Library


David Mertz, Ph.D.
Subsumer, Gnosis Software, Inc.
May, 2002

Most hitherto existing XML API's have enforced well-formedness at a programmatic level, but hardly any can guarantee validity. This is a serious weakness in the whole field of XML processing. This installment discusses its author's gnosis.xml.validity library for enforcing validity in Python objects intended for XML serialization.

Implementing Constraints?

A tip I wrote previously for IBM developerWorks' XML Zone took a conceputal look at reconciling object-oriented programming techiques with XML validity contraints. This installment of XML Matters presents an early version of an actual Python module for doing it. One could create an analogous capability in other programming languages, but Python provides particularly versatile reflection mechanisms, and allows clear expression of validity constraints.

On the face of it, Python--with its extremely dynamic (albeit strict) typing--might seem like a strange choice for implementing what is, essentially, an elaborate type system. But any oddness one might perceive is superficial. In fact, the type systems of languages like Java, C++, or C#, while static, are far too impoverished to offer much meaningful help to XML validity constraints. A pure functional language like Haskell might offer type hierarchies, discriminated unions, quantification and existential types, and so on, but OOP languages typically lack these things. In statically typed OOP languages, one would have to build just as much custom validation into a library as does the currently discussed Python library.

The module gnosis.xml.validity can helpfully be contrasted with several other XML-related. Two other libraries that have been incorporated into the author's gnosis.xml package, were discussed in earlier articles. gnosis.xml.pickle is able to produce a specialized XML serialization of any Python object whatsoever. As with Python's standard pickle and cPickle modules, this provides a way to save and restore objects. gnosis.xml.objectify operates in a reverse direction: given an arbitrary XML document, we can generate a "Pythonic" object (with a slight loss of information about the original XML).

Python standard library includes support for DOM and SAX processing of XML documents. Widely used 3rd party Python packages extend the support to include XSLT processing. DOM (specifically xml.dom.minidom) offers a rather heavy API for OOP-style manipulation of XML documents--with methods common across DOM implementation in many programming languages. SAX treats an XML document as a series of parsing events, and basically allows a procedural programming style. XSLT declares a set of rules for transforming an XML document into something else (such as a different XML document).

All of these libraries are useful, but none of them prevent an application from modifying aN XML-representation object in ways that break the validity of the underlying XML. For example, deleting, adding, or moving a DOM node can easily create a DOM hierarchy that cannot be dumped into valid XML document.

What Makes Up Validity?

The basic idea of XML validity is to specify what can occur inside an element, how often it can occur, and what alternatives exist about what can occur. As well, when multiple things can occur inside an element, the order of occurence can be specified (or left open, as needed). DTD's differ somewhat from W3C XML Schemas in what they can express, but the jist is the same. Let us look at a highly simplified hypothetical dissertation.dtd:

A "dissertation" DTD with all basic constraints

<!ELEMENT dissertation (dedication?, chapter+, appendix*)>
<!ELEMENT dedication (#PCDATA)>
<!ELEMENT chapter (title, paragraph+)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT paragraph (#PCDATA | figure | table)+>
<!ELEMENT figure EMPTY>
<!ELEMENT table EMPTY>
<!ELEMENT appendix (#PCDATA)>

In other words, a dissertation may contain one dedication, must contain (one or more) chapters, and may contain (zero or more) appendixes. The various subelement occur in the listed order (if at all). Some elements contain only character data. In the case of the <paragraph> tag, it may contain either character data or a <figure> subelement or a <table> subelement, or any combination of each of them. Structures can nest, but every basic validity concept is in the example.

What the gnosis.xml.validity module does is let you create, e.g., a disseration Python object that can only represent a valid disseration. Moreover, when transformed into XML--using the print command or str() function--the XML automatically matches the desired DTD.

Validity In Action

The easiest way to understand what gnosis.xml.validity does is to see it used. In attitude, gnosis.xml.validity owes a heritage to the Spark parser. That is, "validity classes" are defined using Python reflection rather than traditional sequential programming. The symmetry is interesting inasmuch as Spark and gnosis.xml.validity in a sense do exactly opposite things--the former assumes rule-based structure in external texts, the latter enforces it in internal objects.

A validity class is based very closely on a corresponding DTD or XML Schema. A class simply inherits from a relevant validity type, then specializes (if necessary) by adding a class attribute. A convention is used that any class named with an initial underscore represents a structure that does not have a corresponding tag. For example, a <paragraph> element in a disseration can contain a collection of PCDATA and <figure> and <table> elements. The disjunction type that is assembled into a <paragraph> collection does not itself have an XML tag. Therefore, this disjuction type is named _mixedpara in the below example:

dissertation.py

from gnosis.xml.validity import *
class appendix(PCDATA):   pass
class table(EMPTY):       pass
class figure(EMPTY):      pass
class _mixedpara(Or):     _disjoins = (PCDATA, figure, table)
class paragraph(Some):    _type = _mixedpara
class title(PCDATA):      pass
class _paras(Some):       _type = paragraph
class chapter(Seq):       _order = (title, _paras)
class dedication(PCDATA): pass
class _apps(Any):         _type = appendix
class _chaps(Some):       _type = chapter
class _dedi(Maybe):       _type = dedication
class dissertation(Seq):  _order = (_dedi, _chaps, _apps)

As with a DTD, the top level of a particular object/XML document can be any tag whose rules are given. dissertation happens to be the highest level available here, but one can create documents of lower types also. Let us take a look:

Creating a valid disseration chapter

>>> from dissertation import chapter, title, _paras, paragraph, PCDATA
>>> chap1 = chapter(( title(PCDATA('About Validity')),
...                   _paras([paragraph(PCDATA('It is a good thing'))])
...                ))
>>> print chap1
<chapter><title>About Validity</title>
<paragraph>It is a good thing</paragraph>
</chapter>

A <chapter> is initialized with a tuple containing a <title> and a _paras list. A <title>, in turn is initialized with some PCDATA, which is itself initialized with a (Unicode) string. Likewise, a _paras list contains some <paragraph>'s, which are themselves initialized with PCDATA. Once an appropriate object exists, it simply prints itself as valid XML.

All of those nested initialization, although obeying the details of the specified DTD validity rules, are rather cumbersome to bother with. Therefore gnosis.xml.validity allows a much friendlier style for initialization. Whenever a particular type is required, the initializer for that type is transparently lifted into the type itself. Moreover, when a "quantification" type would normally be initialized by a list of things of the right type, specifying just one thing lifts the thing into a length one list of the thing. "Lifting" is recursive. One note is that Seq types that use lifting must use the factory function LiftSeq(), but other types can lift their own initialization arguments (the details have to do with "new-style" inheritance from immutable Python types). This sounds complicated, but it is enormously obvious in practice:

>>> from dissertation import LiftSeq
>>> chap1 = LiftSeq(chapter,('About Validity','It is a good thing'))
>>> print chap1
<chapter><title>About Validity</title>
<paragraph>It is a good thing</paragraph>
</chapter>


Validity Enforcement

So far, we have created some valid XML/objects. But so what? We could have also just written the valid XML text by hand. The value of gnosis.xml.validity comes when you want to modify an object in either valid or invalid ways. For example, here is a valid modification:

Adding a paragraph (valid operation)

>>> paras_ch1 = chap1[1]
>>> paras_ch1 += [paragraph('OOP can enforce it')]
>>> print chap1
<chapter><title>About Validity</title>
<paragraph>It is a good thing</paragraph>
<paragraph>OOP can enforce it</paragraph>
</chapter>

What happens, to the contrary, when we try something that is not allowed? For example, a dissertation can have at most one dedication (at least as we have specified the example):

Creating an optional dedication

>>> from dissertation import _dedi, dedication
>>> Maybe_dedication = _dedi([])
>>> print Maybe_dedication

>>> Maybe_dedication.append(dedication("To Mom."))
>>> print Maybe_dedication
<dedication>To Mom.</dedication>

>>> Maybe_dedication.append(dedication("Also to Dad."))
Traceback (most recent call last):
  File "<pyshell#71>", line 1, in ?
    Maybe_dedication.append(dedication("Also to Dad."))
  File "validity.py", line 140, in append
    raise LengthError, self.length_message % self._tag
LengthError: List <_dedi> must have length zero or one

Likewise, one cannot include something of the wrong type, even if the length of a quantification would be OK:

Attempting to add item of wrong type

>>> from gnosis.xml.validity import ValidityError
>>> try:
...     paras_ch1.append(dedication("To my advisor"))
... except ValidityError, x:
....    print x
Items in _paras must be of type <class 'dissertation.paragraph'>
(not <class 'dissertation.dedication'>)

All the exceptions that might be raised by violating constraints are descended from ValidityError. Programming using the gnosis.xml.validity library will probably involve wrapping many operation in try/except blocks; it should not be possible to create an invalid object by attempting a disallowed operation.

Some Words On The Implementation

A first note is that gnosis.xml.validity is strictly for Python 2.2+. Although it is possible to implement it in earlier Python versions, I felt this project makes a good testing ground for some newer Python features. Specifically, the library takes advantage of the type/class unification, and new-style classes. I have some ideas about doing some tricky stuff with metaclasses in future library versions, and I might even work in properties and slots.

The design of gnosis.xml.validity relies heavily on Python's introspection/reflection capabilities. Several abstract classes comprise the main functionality. Each of these classes must have concrete children to actually do anything, although all the children need to implement is (at most) one class attribute each. When an XML tag corresponds to a class, the tag name is taken directly from the class name. As noted earlier, if a class name begins with an underscore, it has no corresponding XML tag. The basic rule here is that any "tagged" validity class serializes itself with surrounding open/close tags; a "tagless" class just serializes its raw content (which might, however, include items that themselves have tags). A limitation this scheme imposes is that gnosis.xml.validity cannot work with DTD's specifying XML tags with lead underscores; this limitation could be removed in future versions, but probably will not unless users have a need for this.

The base abstract classes consist of the following:

PCDATA: This one may be used directly, and so is not really abstract. An XML element that contains PCDATA should inherit from this, but need not provide any further specialization. But in an alternation list for the Or type, one simply lists PCDATA. This is very closely modelled on DTD syntax. I recommend listing PCDATA first in such a list (as DTD's require), but that is not currently mandatory.

EMPTY: Also modelled on DTD syntax. As with PCDATA, this class should be inherited from, but no further specialization is required.

Or: A child of Or must add a _disjoins tuple as a class attribute. Normally, that one attribute will be the whole implementation. Listed in the tuple should be other validity classes. Conceptually, a disjunction should involve two or more things, but no error is currently raised if there are fewer disjoins.

Seq: A child of Seq must add an _order tuple as a class attribute. Normally, that one attribute will be the whole implementation. Listed in the tuple should be two or more other validity classes; as with Or the tuple length is not currently checked. In instantiating a Seq child, it is usually safer to utilize the factory function ListSeq().

Quantification: This abstract class is a special case, in a way. The examples in this article have not used Quantification, but have instead used (still abstract) children of it. For example, this is the implementation of the class Some:

'Quantification' abstract child 'Some'

class Some(Quantification):
    length_message = "List <%s> must have length >= 1"
    min_length = 1
    max_length = maxint

The classes Maybe and Any have similar implementation. These three Quantification children cover all the quantification options for DTD's, but XML Schemas can allow others, e.g. Three_to_Seven, whose implementation is straightforward. I realize that a pretty good length_message could be generated from the other attributes, but I felt like the pluralization and phrasing of messages was better done by a programmer.

A concrete descendent of Quantification must add a _type class attribute, which points simply to another validity class. In principle, a concrete child could add its own min_length, max_length and length_message--but using an intermediary feels like better design.

What Remains To Be Done

As of this writing gnosis.xml.validity is largely a proof-of-concept. A few things are still missing. The most glaring absence is the complete lack of facility for adding XML tag attributes--let alone enforcing their validity. In structure, attributes look a lot like subelements--merely unordered ones--so similar enforcement mechanism can be added to later versions of gnosis.xml.validity. This addition is certainly the highest priority for a next feature.

There are some other conveniences would be nice to have in gnosis.xml.validity. It would be nice to generate a set of Python validity classes automatically from a DTD or XML Schema. Unlike in a DTD, however, a set of Python validity classes need to be defined in a particular order--or at least in an order that defines each class earlier than it is named in an attribute of another class.

Reading from an existing, and valid, XML document would often be useful. It is not necessarily obvious what the best way to achieve this is. Since member items need to be valid object prior to their inclusion in larger structures, the simplest recursive descent approach would not work. But it should be possible to deserialize an XML document to corresponding validity classes.

Finally, some sort of higher level interface to the presented validity classes might ease work with them. The strategy used in the library now is to raise exceptions for every disallowed action; but there may be ways of wrapping this in more convenient API's. Perhaps silent failure or flag return values would be useful, or maybe some other sort of fallback operations for error cases. Deciding the right interfaces probably will require more experimentation by users (including myself).

I welcome reader feedback about what direction later versions of gnosis.xml.validity should take. I believe the initial functionality will already aid a variety of XML programming tasks, but given how little similar library development has been done elsewhere, my intuitions about what is most useful are still vague.

Resources

The general goals that went into the development of the gnosis.xml.validity library were outlined in the XML Zone tipe at:

http://www-106.ibm.com/developerworks/library/x-tipoop.html

The Haskell library HaXml accomplishes everything that mine does, but within the framework of a pure functional language. While this is very different, conceptually, from an object-oriented approach, readers can read about HaXml in an ealier installment of this column:

http://www-106.ibm.com/developerworks/library/x-matters14.html

XML Matters #7 (developerWorks, March 2001) compared DTDs and Schemas. For the issues with each, take a look there.

http://www-106.ibm.com/developerworks/xml/library/x-matters7.html

The most current version of Gnosis_Utils can always be found at the below URL. Make sure to download at least version 1.0.2 to obtain gnosis.xml.validity:

http://gnosis.cx/download/Gnosis_Utils-current.tar.gz

About The Author

Picture of Author David Mertz uses a wholly unstructured brain to write about structured document formats. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/publish/.