David Mertz, Ph.D.
Data Masseur, Gnosis Software, Inc.
August 2000
XML is a simplified dialect of the Standard Generalized Markup Language (SGML). Many readers will be most familiar with SGML via one particular document type, HTML. XML documents are similar to HTML in being composed of text interspersed with and structured by markup tags in angle-brackets. But XML encompasses many systems of tags that allow XML documents to be used for many purposes: magazine articles and user documentation, files of structured data (like CSV or EDI files), messages for interprocess communication between programs, architectural diagrams (like CAD formats), and many other purposes. A set of tags can be created to capture any sort of structured information one might want to represent, which is why XML is growing in popularity as a common standard for representing diverse information.
Python is a freely available, very-high-level, interpreted language developed by Guido van Rossum. It combines a clear syntax with powerful (but optional) object-oriented semantics. Python is available for almost every computer platform you might find yourself working on, and has strong portability between platforms.
There exist a number of techniques and tool for dealing with XML documents in Python. The Resources section provides links to two developerWorks articles by this author that discuss general techniques, as well as links to other documents on XML/Python topics. However, one thing that most existing XML/Python tools have in common is that they are much more XML-centric than Python-centric. Certain constructs and coding techniques feel "natural" in a given programming language, and others feel much more like they are imported from other domains. The latter type of construct can certainly get a job done--and in many cases a certain degree of stylistic clash between a programming tool and its problem domain is inevitable. But in an ideal environment all constructs fit intuitively into their domain, and domains merge seamlessly. When it does, programmers can wax poetic rather than merely "make it work."
This author has begun a research project of creating a more seamless and more natural integration between XML and Python. This article, and subsequent articles in this column, will discuss some of the goals, decisions, and limitations of this project; and hopefully along the way provide readers with a set of modules and techniques that are useful to them, and that point to easier ways to meet programming goals. All tools created as part of the Project will be released to the public domain.
Python is a language with a flexible object system and a rich
set of built-in types. The richness of Python is both an
advantage and a disadvantage for the Project. On the one hand,
having a wide range of native facilities in Python makes it
easier to represent a wide range of XML structures easily. On
the other hand, however, the range of native types and
structures of Python makes for more cases to worry about in
representing native Python objects in XML. As a result of
those asymmetries that exist between XML and Python, the
Project--at least initially--contains two separate modules: one
for representing arbitrary Python objects in XML
(xml_pickle
); a second one (xml_objectify
) for "native"
representation of XML documents as Python objects. This
article will address the former module.
xml_pickle
Python's standard pickle
module already provides a simple and
convenient method of serializing Python objects, which is in
turn useful for persistent storage or transmission over a
network. In some cases, however, it is desirable to perform
serialization to a format with several properties not possessed
by pickle
: (1) The format is human-readable; (2) The format
may be parsed, manipulated, and objects imported, by languages
other than Python; (3) The format supports validation of stored
serialized objects. xml_pickle
provides each of these
features, while maintaining interface compatibility with
pickle
. However, xml_pickle
is not a general-purpose
replacement for pickle
since pickle
retains several
advantages of its own, such as faster operation (especially via
cPickle
) and far more compact object representation.
This article discusses the design goals and decisions that went
into xml_pickle
, as well as thoughts on the module's likely
uses.
xml_pickle
Even though the interface of xml_pickle
is mostly the same as
as that of pickle
, it is worth illustrating the (quite
simple) usage of xml_pickle
for readers who are not familiar
with Python or with pickle
:
import xml_pickle # import the module # declare some classes to hold some attributes class MyClass1: pass class MyClass2: pass # create a class instance, and add some basic data members to it o = MyClass1() o.num = 37 o.str = "Hello World" o.lst = [1, 3.5, 2, 4+7j] # create an instance of a different class, add some members o2 = MyClass2() o2.tup = ("x", "y", "z") o2.num = 2+2j o2.dct = { "this": "that", "spam": "eggs", 3.14: "about PI" } # add the second instance to the first instance container o.obj = o2 # print an XML representation of the container instance xml_string = xml_pickle.XML_Pickler(o).dumps() print xml_string
Everything except the first line and the next to last line is generic Python for working with object instances. It might be a little contrived and a little simple, but essentially everything you do with instance data members is contained in the example (including nesting instances as container data, which is how most complex structures are built in Python). All a Python programmer needs to do to encode her objects as XML is make one method call.
Of course, once you have pickled
your objects, you will want
to also restore them later (or elsewhere). Supposing the above
few lines have already run, restoring the object representation
is as simple as:
new_object = xml_pickle.XML_Pickler().loads(xml_string)
Obviously, in real cases you would want to do something more
interesting with the created XML document than just hold it in
memory during runtime. For example, you might save the XML
document to disk (maybe using the XML_Pickler.dump()
method),
or transmit it over a communication channel. Actually, the
example does print to paper, which might well be a good
durable storage format.
Running the sample code above will produce a pretty good
example of the features of an xml_pickle
representation of a
Python object. However, the below example is a hand-coded
test case developed by the author. The test case has the
advantage of containing every XML structure, tag and attribute
allowed in document type. The specific data is invented, but
it is not hard to imagine the application the data might belong
to:
<?xml version="1.0"?> <!DOCTYPE PyObject SYSTEM "PyObjects.dtd"> <PyObject class="Automobile"> <attr name="doors" type="numeric" value="4" /> <attr name="make" type="string" value="Honda" /> <attr name="tow_hitch" type="None" /> <attr name="prev_owners" type="tuple"> <item type="string" value="Jane Smith" /> <item type="tuple"> <item type="string" value="John Doe" /> <item type="string" value="Betty Doe" /> </item> <item type="string" value="Charles Ng" /> </attr> <attr name="repairs" type="list"> <item type="string" value="June 1, 1999: Fixed radiator" /> <item type="PyObject" class="Swindle"> <attr name="date" type="string" value="July 1, 1999" /> <attr name="swindler" type="string" value="Ed's Auto" /> <attr name="purport" type="string" value="Fix A/C" /> </item> </attr> <attr name="options" type="dict"> <entry> <key type="string" value="Cup Holders" /> <val type="numeric" value="4" /> </entry> <entry> <key type="string" value="Custom Wheels" /> <val type="string" value="Chrome Spoked" /> </entry> </attr> <attr name="engine" type="PyObject" class="Engine"> <attr name="cylinders" type="numeric" value="4" /> <attr name="manufacturer" type="string" value="Ford" /> </attr> </PyObject>
A formal document type definition (DTD) is currently being
developed, and should be available in this article's source
archive by the time the article first appears (check
Resources). Informally, it is not difficult to see the
structure of a PyObjects.dtd
XML document. But the DTD may
disambiguate any issues the author has overlooked.
Looking at the sample XML document, one can see that the three
stated design goals of xml_pickle
have been met: (1) The
format is human readable; (2) The XML representations may be
manipulate by means other than xml_pickle
, whether that is
unrelated Python/XML modules, XML libraries in other
programming languages, XML-enhanced editors and utilities, or
just simply text-editors (as was used in creation of the
sample); (3) XML representations of Python objects may be
validated--at least this will be possible using standard XML
validators once PyObjects.dtd
is completed and tested. Once
the DTD is in place, all and only DTD conformant documents
will be representations of valid Python objects.
CONTENT MODEL. The content model of Python and XML are in certain respects simply different. One difference to pay heed to is that XML documents are inherently linear in form. Python object attributes--and also Python dictionaries--have no definitional order (although implementation details create arbitrary ordering, such as of hashed keys). In this respect, the Python object model is closer to the relational model: rows of a relational table have no "natural" sequence, and primary or secondary keys may or may not provide any meaningful ordering on a table (the keys are always orderable by comparison operators, but this order may be unrelated to the semantics of the keys).
An XML document always lists its tag elements in a particular
order. The order may not be significant to a particular
application, but being a linear document format the XML
document order is always there. The effect of the differing
significance of key order in Python and XML is that the XML
documents produced by xml_pickle
are not guaranteed to
maintain element order through pickle/unpickle cycles. For
example, a hand prepared PyObjects.dtd XML document like the
above may be "unpickled" into a Python object. If the
resultant object is then "pickled" the <attr> tags will most
likely occur in a different order than in the original
document. This is a feature, not a bug, but the fact should be
understood.
LIMITATIONS. Several known limitations occur in xml_pickle
as of the current version (0.2). One potentially serious flaw
is that no effort is made to trap cyclical references in
compound/container objects. If an object attribute refers back
to the container object (or some recursive version of this),
xml_pickle
will exhaust the Python stack. Cyclical
references are likely to indicate a flaw in object design to
start with, but later versions of xml_pickle
will certainly
attempt to deal with them more intelligently.
Another limitation exists in that the namespace of XML
attribute values (such as the "123" in <attr name="123">) is
larger than the namespace of valid Python variables and
instance members. If attributes are (manually) created outside
the Python namespace they will have the odd status of existing
in an instance's .__dict__
magic attribute, but being
inaccessible by normal attribute syntax (e.g. "obj.123" is a
syntax error). This is only an issue where XML documents are
created or modified by means other than xml_pickle
itself.
The author simply has not decided what the best way of handling
this (somewhat obscure) issue will be.
Not all attributes of Python objects are handled by
xml_pickle
either. All the "usual" data members (strings,
numbers, dictionaries, etc.) are pickled well. But instance
methods, and class and function objects as attributes, are not
handled. Methods are simply ignored in pickling (as with
pickle
). If class or function objects exist as attributes,
an XMLPicklingError is raised. This is probably the correct
ultimate behavior, but a final decision has not been made yet.
DESIGN CHOICES. One genuine ambiguity in XML document design
is the choice of when to use tag attributes, and when to use
sub-elements. Opinions on this design issue differ, and XML
programmers often feel strongly about their conflicting views.
This was probably the biggest issue in deciding the
xml_pickle
document structure.
The general principle decided on was that a thing that is naturally "plural" should be represented by sub-elements. For example, a Python list can contain as many items as you like, and is therefore represented by a sequence of <item> sub-elements. On the other side, a number is a singular thing (the value might be more than 1, but there is only one thing in it). In that case it seemed much more logical to use an XML attribute called "value". The real difficult case was with Python strings. In a basic way, they are sequence objects--just like lists. But representing each character in a string using a hypothetical <char> tag would destroy the goal of human-readability, as well as make for enormous XML representations. The decision was made to put strings in the XML "value" attribute, just as with numbers. However, from an aesthetic point-of-view this is probably less desirable than within a tag container, especially for multi-line strings. But this decision seemed more consistent since there was no other "naked" #PCDATA in the specification.
In part because strings are stored in XML "value" attributes--but mostly to maintain the syntacticality of the XML document, Python strings needed to be stored in a "safe" form. There are a few unsafe things that could occur in Python strings. The first type is the basic markup characters like greater-than and less-than. A second type is the quote and apostrophe characters that set off attributes. The third type is questionable ASCII values, such as a null character. One possibility considered was to encode the whole Python strings in something like base64 encoding. This would make strings "safe," but also completely unreadable to humans. The decision was made to use a mixed approach. The basic XML characters are escaped in the style of "&", ">" or """. Questionable ASCII values are escaped in Python-style, such as "\000". The combination makes for easily human-readable XML representations, but requires a somewhat mixed approach to decoding stored strings.
There are a number of things that xml_pickle
is likely to be
good for, and some user-feedback has indicated that it has
entered preliminary usage. Below are a few ideas.
- XML representations of Python objects may be indexed and
cataloged using existing XML-centric tools (not necessarily
written in Python). This provides a ready means of
indexing Python object databases (such as ZODB, PAOS, or
simply shelve
).
- XML representations of Python objects could be restored as
objects of other OOP languages, especially ones having
a similar range of basic types. This is something yet to
do. Much "heavier" protocols like CORBA, XML-RPC, and
SOAP have overlapping purpose, but xml_pickle
is pretty
"light-weight" as an object transport specification.
- Tools for printing and displaying XML documents can be used to provide convenient human-readable representations of Python objects via their XML intermediate form.
- It is possible to manually "debug" Python objects via their XML representation using XML specific editors, or simply text editors. Once hand-modified objects are unpickled, the effects of the edits on program operation can be examined. Other debuggers and wrappers exist for Python, but this provides an additional option.
If readers develop additional uses for xml_pickle
or see
enhancements that would open the module to additional uses, the
author would very much like to receive suggestions.
Charming Python #1: An Introduction to XML Tools for Python
http://gnosis.cx/cgi-bin/txt2html.cgi?source=../publish/programming/charming_python_1.txt
Charming Python #2: A Closer Look at Python's xml.dom
Module
http://gnosis.cx/cgi-bin/txt2html.cgi?source=../publish/programming/charming_python_2.txt
A friendly introduction to Python for programmers with an XML
background is the below book. McGrath uses his book largely to
argue the virtues of his pyxie
module and associated tools
and techniques as the best approach to XML processing. Whether
or not pyxie
is the best approach to your specific problem,
McGrath's is a useful introduction to Python (but less so to
XML) .
XML Processing with Python, Sean McGrath, Prentice Hall PTR, Upper Saddle River, NJ, 2000.
The Python Special Interest Group on XML:
http://www.python.org/sigs/xml-sig/
The World Wide Web Consortium's DOM page.
http://www.w3.org/DOM/
The DOM Level 1 Recommendation.
http://www.w3.org/TR/1998/REC-DOM-Level-1-19981001/
Files used and mentioned in this article:
http://gnosis.cx/download/xml_matters_1.zip
David Mertz finds there to be a hysteretic relationship between writing and comprehension. He has begun, for example, to comprehend his own doctoral work. David may be reached at [email protected]; his life pored over at http://gnosis.cx/publish/. Suggestions and recommendations on this, past, or future, columns are welcomed.