(c) Tenco Media, 2001 -- may be freely distributed if unaltered
XML MATTERS #11: Revisiting [xml_pickle] and [xml_objectify]
Lessons in Open Source and Common Sense
David Mertz, Ph.D.
Revisionist, Gnosis Software, Inc.
May 2001
Since the author introduced his handy utilities for
high-level Python handling of XML documents, users and
readers have contributed a number of extremely useful
enhancements and suggestions. This column presents some of
the changes to David's module suite, as well as some tips on
advanced aspects of using and customizing the modules.
INTRODUCTION
------------------------------------------------------------------------
My IBM columns, tutorials, and articles have had a dual--or
maybe triple--purpose for your humble author. In the first
instance, I cherish the opportunity afforded me to share what
knowledge I have with other programmers/developers, and maybe
make a few people's tasks easier therein. It is also awfully
nice that I get paid money for writing these things.
Another purpose is also contained in a number of my columns. I
have had the opportunity to release to the public domain
programming code that I have written in the course of these
columns. In writing this code, I have had the goal of
illustrating general programming concepts--and have tailored
the code around that. But at the same time, I have wanted to
give the programming community code that individual developers
can utilize directly for their own purposes.
A result of releasing the code that I have, is that I have
received back from users of these modules a number of valuable
suggestions and enhancement patches. Most of the improvements
users have come up with are ones I would never have imagined on
my own; and a few are almost shocking in their insightfulness.
I'd like to use this column to present some uses of
[xml_pickle] and [xml_objectify] that were not possible when I
wrote the columns that initially discussed these modules: _XML
Matters #1_ and _XML Matters #2_.
ENHANCEMENTS TO [xml_objectify]
------------------------------------------------------------------------
One change, in particular, has been an ongoing struggle. My
timing was probably slightly unlucky. Within a short time
after I first created [xml_objectify] and [xml_pickle] (in
August 2000), the PyXML distribution went through several
incompatible versions; and not much later than that Python 2.0
came out with its own not-quite-compatible XML support. Users
contributed several patches to match then current Python XML
support along the way, but in their current state
[xml_objectify] and [xml_pickle] both require Python 2.0+, and
its included PyXML package. Given the effective requirement
for Python 2.0 in terms of the XML packages, I also allowed in
a few other changes with Python 2 syntax. The backwards
incompatibility with Python 1.5 is unfortunate, but it would be
too unweildy to maintain it in this case.
One of the features of [xml_objectify] introduced in _XML
Matters #2_ was the special '_XML' attribute that kept complete
element contents (including subelement markup of character
data). The default behavior is still to create an '_XML'
attribute of a nested object -only- when it contains
character-level markup. But you now have a choice about
changing this behavior, using the function 'keep_containers()'
and the values 'ALWAYS', 'MAYBE' and 'NEVER'. For example:
#-------- Default py_obj._XML attribute creation --------#
>>> xml_str = '''Spam and eggs are tasty
... The Spanish Inquisition
... Our weapon is fear'''
>>> open('test.xml','w').write(xml_str)
>>> from xml_objectify import *
>>> py_obj = XML_Objectify('test.xml').make_instance()
>>> py_obj.p[0].PCDATA
u'Spam and eggs tasty'
>>> py_obj.p[0]._XML # first
has markup
u'Spam and eggs are tasty'
>>> py_obj.p[1].PCDATA
u'The Spanish Inquisition'
>>> py_obj.p[1]._XML # second
has no markup
Traceback (most recent call last):
File "", line 1, in ?
AttributeError: '_XO_p' instance has no attribute '_XML'
-
#------- Changing py_obj._XML attribute creation --------#
>>> _=keep_containers(ALWAYS)
>>> py_obj = XML_Objectify('test.xml').make_instance()
>>> py_obj.p[1]._XML
u'The Spanish Inquisition'
>>> _=keep_containers(NEVER)
>>> py_obj = XML_Objectify('test.xml').make_instance()
>>> py_obj.p[0]._XML
Traceback (most recent call last):
File "", line 1, in ?
AttributeError: '_XO_p' instance has no attribute '_XML'
Probably the most powerful feature of [xml_objectify] is also a
subtle one. Many users have probably never needed, or even
noticed class magic behavior. What is possible, however, is to
have special classes on hand that will determine the behaviors
of "objectified" XML nodes. The original article mentioned
this, but it is worth seeing in action.
Before the examples, a few details should be pointed out. In
order to avoid a sloppy conflict in the first module version,
[xml_objectify] now "mangles" the names of the class templates
for XML nodes. The "abstract" node class is called '_XO_', and
it has a few "magic" behaviors in itself. When concrete node
classes are created--by a programmer or dynamically--they have
the form '_XO_tagname' (where '' is a tag that occurs
in the objectified XML document).
The "magic" that '_XO_' itself provides are the '__getitem__()'
and '__len__()' methods. What these let you do is to treat
each node attribute as if it was a list in those contexts where
it would be nice for the attribute to behave like a list; but
at the same time, we can refer to an "only child" node without
having to subscript. For example:
#---- Node attributes as objects and lists of objects ---#
>>> print type(py_obj.p), type(py_obj.foot)
>>> print py_obj.p[1].PCDATA, '...', py_obj.foot.PCDATA
The Spanish Inquisition ... Our weapon is fear
>>> for line in py_obj.p: print line.PCDATA,
...
Spam and eggs tasty The Spanish Inquisition
>>> for line in py_obj.foot: print line.PCDATA,
...
Our weapon is fear
>>> map(lambda line: len(line.PCDATA), py_obj.foot)
[18]
>>> map(lambda line: len(line.PCDATA), py_obj.p)
[20, 23]
Still more magic is possible if you want to create your very
own node classes within a program. Basically, you can make a
attribute node behave in -any- way you might wish.
#------ Creating magic node behaviors for py_obj's ------#
>>> import xml_objectify
>>> xml_str = '''
... SteakPotatos
... CornBroccoli
... '''
>>> open('buffet.xml','w').write(xml_str)
>>> class plate(xml_objectify._XO_):
... def eat(self):
... for food in self.food:
... if food.PCDATA == 'Broccoli':
... return "If I liked Broccoli, I might have to eat it!"
... return "Yum!"
...
>>> xml_objectify._XO_plate = plate
>>> py_obj = XML_Objectify('buffet.xml').make_instance()
>>> print py_obj.plate[1].eat()
If I liked Broccoli, I might have to eat it!
>>> print py_obj.plate[0].eat()
Yum!
Notice that the trick with the 'xml_objectify._XO_plate'
assignment is important. To get the proper magic behavior, the
right magic and mangled class needs to live in that namespace.
In my opinion, it is fabulously cool to be able to grab a bunch
of data from an XML file, then have a perfectly natural Python
object act on that data as its own attributes, using its own
methods
For working with large XML documents, Costas Malamas has
contributed an invaluable enhancement. Until recently, the
only way [xml_objectified] worked was to create a DOM tree,
then recurse through that tree to generate the "Pythonic"
objects. That worked fine for small XML documents, but for
around 50k-100k files, it starts to become absurdly slow.
There appears to be a complexity order effect going on that
renders [xml_objectify] unusable for large documents.
Fortunately, Malamas provided an alternative method for parsing
an XML document, based on the Python [expat] bindings ('expat'
is a high-performance XML library written in C). While there
are still a few wrinkles to be ironed out in the 'ExpatFactory'
class (failure for some documents with processing
instructions), for most cases, the new technique provides
speedy handling of even huge XML documents. Using the expat
technique imposes a couple limitations by design, also: You
obviously lose the the '_dom' attribute of your 'xml_obj' (if
you kept 'xml_obj' in the first place); and you also do not
have an '_XML' attribute to play with anymore. The latter
limitations might be lifted later, however.
Choosing which parsing technique to use is straightforward:
#------------- Choosing a parsing method ----------------#
>>> xml_obj = XML_Objectify('buffet.xml',EXPAT)
>>> xml_obj = XML_Objectify('buffet.xml',parser=DOM)
If no option is specified, the default is the legacy DOM
technique. But future code should specify explicitly, in case
the default changes. 'EXPAT' and 'DOM' are constants within
[xml_objectify] that simply contain matching string values.
ENHANCEMENTS TO [xml_pickle]
------------------------------------------------------------------------
In analogy with [xml_objectify], you will need to populate the
[xml_pickle] namespace when you want to retain the instance
methods of unpickled objects. That sounds confusing, but some
code makes it simple:
#---- Making sure unpickled Python objects are lively ---#
>>> import xml_pickle
>>> class MyClass:
... def DoIt(self):
... print "Done!"
...
>>> o1 = MyClass()
>>> o1.attr1 = 'spam'
>>> xml_str = xml_pickle.XML_Pickler(o1).dumps()
>>> o2 = xml_pickle.XML_Pickler().loads(xml_str)
>>> o2.DoIt()
Traceback (most recent call last):
File "", line 1, in ?
AttributeError: 'MyClass' instance has no attribute 'DoIt'
>>> xml_pickle.MyClass = MyClass
>>> o2 = xml_pickle.XML_Pickler().loads(xml_str)
>>> o2.DoIt()
Done!
Basically, if you put the classes you want to pickle into the
'xml_pickle' namespace before you start all the
pickling/unpickling, you can restore all your object behavior.
But notice that as with [pickle] and [cPickle], the methods are
not themselves pickled (just the attributes are); you use the
class that is present at runtime for the methods (which might
have been updates since last pickling).
A limitation of [xml_pickle] that was pointed out in the
original article has been lifted by Joshua Macy (with some help
from Joe Kraska).. In early versions, [xml_pickle] made no
efforts to check for cyclical references in pickled objects.
Furthermore--and for the same reason--every attribute was
pickled as a deep copy of its actual Python object. If you
have a Python object with many substructures containing
references to the same objects, the pickled size can get big
quickly. Moreover, unpickled objects will contain multiple
objects that, while possibly equal (i.e. 'a == a'), are not
identical (i.e. 'a is a') as were the pre-pickled originals.
However, despite the gains in Macy's approach, it is desirable
to introduce a DEEPCOPY option back into the module. The main
issue with the (quite elegant) 'refid'/'id' scheme used is that
it is likely to be much harder for a generic tool to utilize.
Maybe users of languages other than Python want to be able to
easily use [xml_pickle]'d objects (maybe more as hierarchical
data stores than as full dynamic objects, but that is fine).
Or maybe XSLT transformations of pickled objects would be
useful for certain purposes. A pickled excerpt shows the
difficulty:
#-------------- Pickled Python object as XML ------------#
...
You can see that the attribute 'lst2' would be a bit of work to
figure out in a generic way (such as with developer eyeballs).
One has to pull aff the 'refid', then search back for the
corresponding 'id'. Actually, the use of the 'type="ref"' XML
attribute may have been badly chosen. Given that it -has- a
'refid' XML attribute, things might be made clearer by simply
still recording 'type="list"', as with the 'lst2' referent
'lst'. But of course, once something is done, it is harder to
improve it without breaking backwards compatibility.
A small caveat on references might appeal to subtle-minded
hackers. 'id'/'refid' values are invented out of the Python
'id()' of the relevant objects. The values do not mean
anything inherently, but have the nice property of being unique
at any given moment of runtime. [xml_pickle] gives no
assurance that pickling the "same" object in different runs
will produce entirely identical XML files (the 'id' values will
almost certainly change). In general, the ad hoc 'id' values
will not matter to a program, but if things like cryptographic
hashes or CRCs are used as part of a process, this could be a
gotcha.
Not too much need be described about the enhancement, but in
response to user requests, [Numeric] arrays have been added to
the set of picklable types. For scientific and mathematical
Python users, these types may make up important attributes of
their objects. [xml_pickle] makes an intelligent effort to
make sure that [Numeric] is present when supporting it; if not,
it falls back to the [array] module.
CONCLUSION
------------------------------------------------------------------------
One lesson I have learned in developing--or maybe just
shepherding the development of--these modules is the the value
of a Python truism: First get it right, then make it fast!
The latter part has now been fairly well reached. Some
optimizations to [xml_pickle] have brought its behavior from
O(N^2) to a manageable O(N), relative to pickled object size.
The trick there is that 'str = str + "more stuff"' can be
shockingly inefficient if peformed often enough. With the
expat techniques, [xml_objectify] is similarly swift. I do not
think I would have got something to the world quickly, nor
received the amount of valuable contributions, had I worried
too much about optimization early.
I look forward to learning more about the practical social
dynamics of open source software development as I am able to
create more tools and libraries like the ones discussed in this
column. It has been an interesting path, and I wonder where it
will lead.
RESOURCES
------------------------------------------------------------------------
The current home of David's XML modules is:
http://gnosis.cx/download/xml_objectify.py
And:
http://gnosis.cx/download/xml_pickle.py
For those interested in older--or pre-release--version numbers
of the modules, browse through the directory:
http://gnosis.cx/download/
A variety of versions, named with version numbers, live here.
The module that drops a version number is generally the most
recent "stable" version. Plus you can find lots of other
goodies in this directory (all public domain).
The initial articles on [xml_pickle] and [xml_objectify] can be
found at:
http://gnosis.cx/publish/programming/xml_matters_1.html
and:
http://gnosis.cx/publish/programming/xml_matters_2.html
ABOUT THE AUTHOR
------------------------------------------------------------------------
{Picture of Author: http://gnosis.cx/cgi-bin/img_dqm.cgi}
David Mertz is blessed with the virtues of laziness, and
impatience, and in his wisdom wishes to warn the world that
hubris should not be confused with chutzpah or machismo. David
may be reached at mertz@gnosis.cx; his life pored over at
http://gnosis.cx/publish/. Suggestions and recommendations on
this, past, or future, columns are welcomed.