(c) Tenco Media, 2000 -- may be freely distributed if unaltered
XML MATTERS #1
On the Pythonic Treatment of XML Documents As Objects (I)
David Mertz, Ph.D.
Data Masseur, Gnosis Software, Inc.
August 2000
WHAT IS XML? WHAT IS PYTHON?
------------------------------------------------------------------------
XML is a simplified dialect of the Standard Generalized Markup
Language (SGML). Many readers will be most familiar with SGML
via one particular document type, HTML. XML documents are
similar to HTML in being composed of text interspersed with and
structured by markup tags in angle-brackets. But XML
encompasses many systems of tags that allow XML documents to be
used for many purposes: magazine articles and user
documentation, files of structured data (like CSV or EDI
files), messages for interprocess communication between
programs, architectural diagrams (like CAD formats), and many
other purposes. A set of tags can be created to capture any
sort of structured information one might want to represent,
which is why XML is growing in popularity as a common standard
for representing diverse information.
Python is a freely available, very-high-level, interpreted
language developed by Guido van Rossum. It combines a clear
syntax with powerful (but optional) object-oriented semantics.
Python is available for almost every computer platform you
might find yourself working on, and has strong portability
between platforms.
INTRODUCTION: THE PROJECT
-------------------------------------------------------------------
There exist a number of techniques and tool for dealing with
XML documents in Python. The Resources section provides links
to two developerWorks articles by this author that discuss
general techniques, as well as links to other documents on
XML/Python topics. However, one thing that most existing
XML/Python tools have in common is that they are much more
XML-centric than Python-centric. Certain constructs and
coding techniques feel "natural" in a given programming
language, and others feel much more like they are imported from
other domains. The latter type of construct can certainly get
a job done--and in many cases a certain degree of stylistic
clash between a programming tool and its problem domain is
inevitable. But in an ideal environment all constructs fit
intuitively into their domain, and domains merge seamlessly.
When it does, programmers can wax poetic rather than merely
"make it work."
This author has begun a research project of creating a more
seamless and more natural integration between XML and Python.
This article, and subsequent articles in this column, will
discuss some of the goals, decisions, and limitations of this
project; and hopefully along the way provide readers with a set
of modules and techniques that are useful to them, and that
point to easier ways to meet programming goals. All tools
created as part of the Project will be released to the public
domain.
Python is a language with a flexible object system and a rich
set of built-in types. The richness of Python is both an
advantage and a disadvantage for the Project. On the one hand,
having a wide range of native facilities in Python makes it
easier to represent a wide range of XML structures easily. On
the other hand, however, the range of native types and
structures of Python makes for more cases to worry about in
representing native Python objects in XML. As a result of
those asymmetries that exist between XML and Python, the
Project--at least initially--contains two separate modules: one
for representing arbitrary Python objects in XML
([xml_pickle]); a second one ([xml_objectify]) for "native"
representation of XML documents as Python objects. This
article will address the former module.
PART I: [xml_pickle]
-------------------------------------------------------------------
Python's standard [pickle] module already provides a simple and
convenient method of serializing Python objects, which is in
turn useful for persistent storage or transmission over a
network. In some cases, however, it is desirable to perform
serialization to a format with several properties not possessed
by [pickle]: (1) The format is human-readable; (2) The format
may be parsed, manipulated, and objects imported, by languages
other than Python; (3) The format supports validation of stored
serialized objects. [xml_pickle] provides each of these
features, while maintaining interface compatibility with
[pickle]. However, [xml_pickle] is not a general-purpose
replacement for [pickle] since [pickle] retains several
advantages of its own, such as faster operation (especially via
[cPickle]) and far more compact object representation.
This article discusses the design goals and decisions that went
into [xml_pickle], as well as thoughts on the module's likely
uses.
USING [xml_pickle]
-------------------------------------------------------------------
Even though the interface of [xml_pickle] is mostly the same as
as that of [pickle], it is worth illustrating the (quite
simple) usage of [xml_pickle] for readers who are not familiar
with Python or with [pickle]:
#------- Python code to demonstrate [xml_pickle] -------#
import xml_pickle # import the module
# declare some classes to hold some attributes
class MyClass1: pass
class MyClass2: pass
# create a class instance, and add some basic data members to it
o = MyClass1()
o.num = 37
o.str = "Hello World"
o.lst = [1, 3.5, 2, 4+7j]
# create an instance of a different class, add some members
o2 = MyClass2()
o2.tup = ("x", "y", "z")
o2.num = 2+2j
o2.dct = { "this": "that", "spam": "eggs", 3.14: "about PI" }
# add the second instance to the first instance container
o.obj = o2
# print an XML representation of the container instance
xml_string = xml_pickle.XML_Pickler(o).dumps()
print xml_string
Everything except the first line and the next to last line is
generic Python for working with object instances. It might be a
little contrived and a little simple, but essentially
everything you do with instance data members is contained in
the example (including nesting instances as container data,
which is how most complex structures are built in Python). All
a Python programmer needs to do to encode her objects as XML is
make one method call.
Of course, once you have 'pickled' your objects, you will want
to also restore them later (or elsewhere). Supposing the above
few lines have already run, restoring the object representation
is as simple as:
#-- Creating object from xml_pickle'd representation ---#
new_object = xml_pickle.XML_Pickler().loads(xml_string)
Obviously, in real cases you would want to do something more
interesting with the created XML document than just hold it in
memory during runtime. For example, you might save the XML
document to disk (maybe using the 'XML_Pickler.dump()' method),
or transmit it over a communication channel. Actually, the
example *does* print to paper, which might well be a good
durable storage format.
SAMPLE PYOBJECTS.DTD DOCUMENT
-------------------------------------------------------------------
Running the sample code above will produce a pretty good
example of the features of an [xml_pickle] representation of a
Python object. However, the below example is a hand-coded
test case developed by the author. The test case has the
advantage of containing every XML structure, tag and attribute
allowed in document type. The specific data is invented, but
it is not hard to imagine the application the data might belong
to:
-
-
A formal document type definition (DTD) is currently being
developed, and should be available in this article's source
archive by the time the article first appears (check
Resources). Informally, it is not difficult to see the
structure of a 'PyObjects.dtd' XML document. But the DTD may
disambiguate any issues the author has overlooked.
Looking at the sample XML document, one can see that the three
stated design goals of [xml_pickle] have been met: (1) The
format is human readable; (2) The XML representations may be
manipulate by means other than [xml_pickle], whether that is
unrelated Python/XML modules, XML libraries in other
programming languages, XML-enhanced editors and utilities, or
just simply text-editors (as was used in creation of the
sample); (3) XML representations of Python objects may be
validated--at least this will be possible using standard XML
validators once 'PyObjects.dtd' is completed and tested. Once
the DTD is in place, *all and only* DTD conformant documents
will be representations of valid Python objects.
DESIGN FEATURES, CAVEATS AND LIMITATIONS
-------------------------------------------------------------------
CONTENT MODEL. The content model of Python and XML are in
certain respects simply different. One difference to pay heed
to is that XML documents are inherently linear in form. Python
object attributes--and also Python dictionaries--have no
definitional order (although implementation details create
arbitrary ordering, such as of hashed keys). In this respect,
the Python object model is closer to the relational model:
rows of a relational table have no "natural" sequence, and
primary or secondary keys may or may not provide any meaningful
ordering on a table (the keys are always orderable by
comparison operators, but this order may be unrelated to the
semantics of the keys).
An XML document always lists its tag elements in a particular
order. The order may not be significant to a particular
application, but being a linear document format the XML
document order is always there. The effect of the differing
significance of key order in Python and XML is that the XML
documents produced by [xml_pickle] are not guaranteed to
maintain element order through pickle/unpickle cycles. For
example, a hand prepared PyObjects.dtd XML document like the
above may be "unpickled" into a Python object. If the
resultant object is then "pickled" the tags will most
likely occur in a different order than in the original
document. This is a feature, not a bug, but the fact should be
understood.
LIMITATIONS. Several known limitations occur in [xml_pickle]
as of the current version (0.2). One potentially serious flaw
is that no effort is made to trap cyclical references in
compound/container objects. If an object attribute refers back
to the container object (or some recursive version of this),
[xml_pickle] will exhaust the Python stack. Cyclical
references are likely to indicate a flaw in object design to
start with, but later versions of [xml_pickle] will certainly
attempt to deal with them more intelligently.
Another limitation exists in that the namespace of XML
attribute values (such as the "123" in ) is
larger than the namespace of valid Python variables and
instance members. If attributes are (manually) created outside
the Python namespace they will have the odd status of existing
in an instance's '.__dict__' magic attribute, but being
inaccessible by normal attribute syntax (e.g. "obj.123" is a
syntax error). This is only an issue where XML documents are
created or modified by means other than [xml_pickle] itself.
The author simply has not decided what the best way of handling
this (somewhat obscure) issue will be.
Not all attributes of Python objects are handled by
[xml_pickle] either. All the "usual" data members (strings,
numbers, dictionaries, etc.) are pickled well. But instance
methods, and class and function objects as attributes, are not
handled. Methods are simply ignored in pickling (as with
[pickle]). If class or function objects exist as attributes,
an XMLPicklingError is raised. This is probably the correct
ultimate behavior, but a final decision has not been made yet.
DESIGN CHOICES. One genuine ambiguity in XML document design
is the choice of when to use tag attributes, and when to use
sub-elements. Opinions on this design issue differ, and XML
programmers often feel strongly about their conflicting views.
This was probably the biggest issue in deciding the
[xml_pickle] document structure.
The general principle decided on was that a *thing* that is
naturally "plural" should be represented by sub-elements. For
example, a Python list can contain as many items as you like,
and is therefore represented by a sequence of -
sub-elements. On the other side, a number is a singular thing
(the value might be more than 1, but there is only one *thing*
in it). In that case it seemed much more logical to use an XML
attribute called "value". The real difficult case was with
Python strings. In a basic way, they are *sequence*
objects--just like lists. But representing each character in a
string using a hypothetical tag would destroy the goal
of human-readability, as well as make for enormous XML
representations. The decision was made to put strings in the
XML "value" attribute, just as with numbers. However, from an
aesthetic point-of-view this is probably less desirable than
within a tag container, especially for multi-line strings.
But this decision seemed more consistent since there was no
other "naked" #PCDATA in the specification.
In part because strings are stored in XML "value"
attributes--but mostly to maintain the syntacticality of the
XML document, Python strings needed to be stored in a "safe"
form. There are a few unsafe things that could occur in Python
strings. The first type is the basic markup characters like
greater-than and less-than. A second type is the quote and
apostrophe characters that set off attributes. The third type
is questionable ASCII values, such as a null character. One
possibility considered was to encode the whole Python strings
in something like base64 encoding. This would make strings
"safe," but also completely unreadable to humans. The decision
was made to use a mixed approach. The basic XML characters are
escaped in the style of "&", ">" or """.
Questionable ASCII values are escaped in Python-style, such as
"\000". The combination makes for easily human-readable XML
representations, but requires a somewhat mixed approach to
decoding stored strings.
ANTICIPATED USES
------------------------------------------------------------------------
There are a number of things that [xml_pickle] is likely to be
good for, and some user-feedback has indicated that it has
entered preliminary usage. Below are a few ideas.
- XML representations of Python objects may be indexed and
cataloged using existing XML-centric tools (not necessarily
written in Python). This provides a ready means of
indexing Python object databases (such as ZODB, PAOS, or
simply [shelve]).
- XML representations of Python objects could be restored as
objects of *other* OOP languages, especially ones having
a similar range of basic types. This is something yet to
do. Much "heavier" protocols like CORBA, XML-RPC, and
SOAP have overlapping purpose, but [xml_pickle] is pretty
"light-weight" as an object transport specification.
- Tools for printing and displaying XML documents can be used
to provide convenient human-readable representations of
Python objects via their XML intermediate form.
- It is possible to manually "debug" Python objects via their
XML representation using XML specific editors, or simply
text editors. Once hand-modified objects are unpickled,
the effects of the edits on program operation can be
examined. Other debuggers and wrappers exist for Python,
but this provides an additional option.
If readers develop additional uses for [xml_pickle] or see
enhancements that would open the module to additional uses, the
author would very much like to receive suggestions.
RESOURCES
------------------------------------------------------------------------
Charming Python #1: An Introduction to XML Tools for Python
http://gnosis.cx/cgi-bin/txt2html.cgi?source=../publish/programming/charming_python_1.txt
Charming Python #2: A Closer Look at Python's [xml.dom] Module
http://gnosis.cx/cgi-bin/txt2html.cgi?source=../publish/programming/charming_python_2.txt
A friendly introduction to Python for programmers with an XML
background is the below book. McGrath uses his book largely to
argue the virtues of his [pyxie] module and associated tools
and techniques as the best approach to XML processing. Whether
or not [pyxie] is the best approach to your specific problem,
McGrath's is a useful introduction to Python (but less so to
XML) .
_XML Processing with Python_, Sean McGrath, Prentice Hall
PTR, Upper Saddle River, NJ, 2000.
The Python Special Interest Group on XML:
http://www.python.org/sigs/xml-sig/
The World Wide Web Consortium's DOM page.
http://www.w3.org/DOM/
The DOM Level 1 Recommendation.
http://www.w3.org/TR/1998/REC-DOM-Level-1-19981001/
Files used and mentioned in this article:
http://gnosis.cx/download/xml_matters_1.zip
ABOUT THE AUTHOR
------------------------------------------------------------------------
{Picture of Author: http://gnosis.cx/cgi-bin/img_dqm.cgi}
David Mertz finds there to be a hysteretic relationship between
writing and comprehension. He has begun, for example, to
comprehend his own doctoral work. David may be reached at
mertz@gnosis.cx; his life pored over at
http://gnosis.cx/publish/. Suggestions and recommendations on
this, past, or future, columns are welcomed.