xml_pickle
And xml_objectify
David Mertz, Ph.D.
Revisionist, Gnosis Software, Inc.
May 2001
Since the author introduced his handy utilities for high-level Python handling of XML documents, users and readers have contributed a number of extremely useful enhancements and suggestions. This column presents some of the changes to David's module suite, as well as some tips on advanced aspects of using and customizing the modules.
My IBM columns, tutorials, and articles have had a dual--or maybe triple--purpose for your humble author. In the first instance, I cherish the opportunity afforded me to share what knowledge I have with other programmers/developers, and maybe make a few people's tasks easier therein. It is also awfully nice that I get paid money for writing these things.
Another purpose is also contained in a number of my columns. I have had the opportunity to release to the public domain programming code that I have written in the course of these columns. In writing this code, I have had the goal of illustrating general programming concepts--and have tailored the code around that. But at the same time, I have wanted to give the programming community code that individual developers can utilize directly for their own purposes.
A result of releasing the code that I have, is that I have
received back from users of these modules a number of valuable
suggestions and enhancement patches. Most of the improvements
users have come up with are ones I would never have imagined on
my own; and a few are almost shocking in their insightfulness.
I'd like to use this column to present some uses of
xml_pickle
and xml_objectify
that were not possible when I
wrote the columns that initially discussed these modules: XML
Matters #1 and XML Matters #2.
xml_objectify
One change, in particular, has been an ongoing struggle. My
timing was probably slightly unlucky. Within a short time
after I first created xml_objectify
and xml_pickle
(in
August 2000), the PyXML distribution went through several
incompatible versions; and not much later than that Python 2.0
came out with its own not-quite-compatible XML support. Users
contributed several patches to match then current Python XML
support along the way, but in their current state
xml_objectify
and xml_pickle
both require Python 2.0+, and
its included PyXML package. Given the effective requirement
for Python 2.0 in terms of the XML packages, I also allowed in
a few other changes with Python 2 syntax. The backwards
incompatibility with Python 1.5 is unfortunate, but it would be
too unweildy to maintain it in this case.
One of the features of xml_objectify
introduced in XML
Matters #2 was the special _XML
attribute that kept complete
element contents (including subelement markup of character
data). The default behavior is still to create an _XML
attribute of a nested object only when it contains
character-level markup. But you now have a choice about
changing this behavior, using the function keep_containers()
and the values ALWAYS
, MAYBE
and NEVER
. For example:
>>> xml_str = '''<doc><p>Spam and eggs <b>are</b> tasty</p> ... <p>The Spanish Inquisition</p> ... <foot>Our weapon is fear</foot></doc>''' >>> open('test.xml','w').write(xml_str) >>> from xml_objectify import * >>> py_obj = XML_Objectify('test.xml').make_instance() >>> py_obj.p[0].PCDATA u'Spam and eggs tasty' >>> py_obj.p[0]._XML # first <p> has <b> markup u'Spam and eggs <b>are</b> tasty' >>> py_obj.p[1].PCDATA u'The Spanish Inquisition' >>> py_obj.p[1]._XML # second <p> has no markup Traceback (most recent call last): File "<stdin>", line 1, in ? AttributeError: '_XO_p' instance has no attribute '_XML'
>>> _=keep_containers(ALWAYS) >>> py_obj = XML_Objectify('test.xml').make_instance() >>> py_obj.p[1]._XML u'The Spanish Inquisition' >>> _=keep_containers(NEVER) >>> py_obj = XML_Objectify('test.xml').make_instance() >>> py_obj.p[0]._XML Traceback (most recent call last): File "<stdin>", line 1, in ? AttributeError: '_XO_p' instance has no attribute '_XML'
Probably the most powerful feature of xml_objectify
is also a
subtle one. Many users have probably never needed, or even
noticed class magic behavior. What is possible, however, is to
have special classes on hand that will determine the behaviors
of "objectified" XML nodes. The original article mentioned
this, but it is worth seeing in action.
Before the examples, a few details should be pointed out. In
order to avoid a sloppy conflict in the first module version,
xml_objectify
now "mangles" the names of the class templates
for XML nodes. The "abstract" node class is called XO
, and
it has a few "magic" behaviors in itself. When concrete node
classes are created--by a programmer or dynamically--they have
the form _XO_tagname
(where <tagname>
is a tag that occurs
in the objectified XML document).
The "magic" that XO
itself provides are the __getitem__()
and __len__()
methods. What these let you do is to treat
each node attribute as if it was a list in those contexts where
it would be nice for the attribute to behave like a list; but
at the same time, we can refer to an "only child" node without
having to subscript. For example:
>>> print type(py_obj.p), type(py_obj.foot) <type 'list'> <type 'instance'> >>> print py_obj.p[1].PCDATA, '...', py_obj.foot.PCDATA The Spanish Inquisition ... Our weapon is fear >>> for line in py_obj.p: print line.PCDATA, ... Spam and eggs tasty The Spanish Inquisition >>> for line in py_obj.foot: print line.PCDATA, ... Our weapon is fear >>> map(lambda line: len(line.PCDATA), py_obj.foot) [18] >>> map(lambda line: len(line.PCDATA), py_obj.p) [20, 23]
Still more magic is possible if you want to create your very own node classes within a program. Basically, you can make a attribute node behave in any way you might wish.
>>> import xml_objectify >>> xml_str = '''<buffet> ... <plate><food>Steak</food><food>Potatos</food></plate> ... <plate><food>Corn</food><food>Broccoli</food></plate> ... <buffet>''' >>> open('buffet.xml','w').write(xml_str) >>> class plate(xml_objectify._XO_): ... def eat(self): ... for food in self.food: ... if food.PCDATA == 'Broccoli': ... return "If I liked Broccoli, I might have to eat it!" ... return "Yum!" ... >>> xml_objectify._XO_plate = plate >>> py_obj = XML_Objectify('buffet.xml').make_instance() >>> print py_obj.plate[1].eat() If I liked Broccoli, I might have to eat it! >>> print py_obj.plate[0].eat() Yum!
Notice that the trick with the xml_objectify._XO_plate
assignment is important. To get the proper magic behavior, the
right magic and mangled class needs to live in that namespace.
In my opinion, it is fabulously cool to be able to grab a bunch of data from an XML file, then have a perfectly natural Python object act on that data as its own attributes, using its own methods
For working with large XML documents, Costas Malamas has
contributed an invaluable enhancement. Until recently, the
only way xml_objectified
worked was to create a DOM tree,
then recurse through that tree to generate the "Pythonic"
objects. That worked fine for small XML documents, but for
around 50k-100k files, it starts to become absurdly slow.
There appears to be a complexity order effect going on that
renders xml_objectify
unusable for large documents.
Fortunately, Malamas provided an alternative method for parsing
an XML document, based on the Python expat
bindings (expat
is a high-performance XML library written in C). While there
are still a few wrinkles to be ironed out in the ExpatFactory
class (failure for some documents with processing
instructions), for most cases, the new technique provides
speedy handling of even huge XML documents. Using the expat
technique imposes a couple limitations by design, also: You
obviously lose the the _dom
attribute of your xml_obj
(if
you kept xml_obj
in the first place); and you also do not
have an _XML
attribute to play with anymore. The latter
limitations might be lifted later, however.
Choosing which parsing technique to use is straightforward:
>>> xml_obj = XML_Objectify('buffet.xml',EXPAT) >>> xml_obj = XML_Objectify('buffet.xml',parser=DOM)
If no option is specified, the default is the legacy DOM
technique. But future code should specify explicitly, in case
the default changes. EXPAT
and DOM
are constants within
xml_objectify
that simply contain matching string values.
xml_pickle
In analogy with xml_objectify
, you will need to populate the
xml_pickle
namespace when you want to retain the instance
methods of unpickled objects. That sounds confusing, but some
code makes it simple:
>>> import xml_pickle >>> class MyClass: ... def DoIt(self): ... print "Done!" ... >>> o1 = MyClass() >>> o1.attr1 = 'spam' >>> xml_str = xml_pickle.XML_Pickler(o1).dumps() >>> o2 = xml_pickle.XML_Pickler().loads(xml_str) >>> o2.DoIt() Traceback (most recent call last): File "<stdin>", line 1, in ? AttributeError: 'MyClass' instance has no attribute 'DoIt' >>> xml_pickle.MyClass = MyClass >>> o2 = xml_pickle.XML_Pickler().loads(xml_str) >>> o2.DoIt() Done!
Basically, if you put the classes you want to pickle into the
xml_pickle
namespace before you start all the
pickling/unpickling, you can restore all your object behavior.
But notice that as with pickle
and cPickle
, the methods are
not themselves pickled (just the attributes are); you use the
class that is present at runtime for the methods (which might
have been updates since last pickling).
A limitation of xml_pickle
that was pointed out in the
original article has been lifted by Joshua Macy (with some help
from Joe Kraska).. In early versions, xml_pickle
made no
efforts to check for cyclical references in pickled objects.
Furthermore--and for the same reason--every attribute was
pickled as a deep copy of its actual Python object. If you
have a Python object with many substructures containing
references to the same objects, the pickled size can get big
quickly. Moreover, unpickled objects will contain multiple
objects that, while possibly equal (i.e. a == a
), are not
identical (i.e. a is a
) as were the pre-pickled originals.
However, despite the gains in Macy's approach, it is desirable
to introduce a DEEPCOPY option back into the module. The main
issue with the (quite elegant) refid
/'id' scheme used is that
it is likely to be much harder for a generic tool to utilize.
Maybe users of languages other than Python want to be able to
easily use xml_pickle
'd objects (maybe more as hierarchical
data stores than as full dynamic objects, but that is fine).
Or maybe XSLT transformations of pickled objects would be
useful for certain purposes. A pickled excerpt shows the
difficulty:
<?xml version="1.0"?> <!DOCTYPE PyObject SYSTEM "PyObjects.dtd"> <PyObject class="XML_Pickler" id="1383532"> <attr name="lst" type="list" id="1391340"> <item type="numeric" value="1" /> <item type="numeric" value="3.5" /> <item type="numeric" value="2" /> <item type="numeric" value="(4+7j)" /> </attr> <attr name="lst2" type="ref" refid="1391340" /> <attr name="num" type="numeric" value="37" /> ... </PyObject>
You can see that the attribute lst2
would be a bit of work to
figure out in a generic way (such as with developer eyeballs).
One has to pull aff the refid
, then search back for the
corresponding id
. Actually, the use of the type="ref"
XML
attribute may have been badly chosen. Given that it has a
refid
XML attribute, things might be made clearer by simply
still recording type="list"
, as with the lst2
referent
lst
. But of course, once something is done, it is harder to
improve it without breaking backwards compatibility.
A small caveat on references might appeal to subtle-minded
hackers. id
/'refid' values are invented out of the Python
id()
of the relevant objects. The values do not mean
anything inherently, but have the nice property of being unique
at any given moment of runtime. xml_pickle
gives no
assurance that pickling the "same" object in different runs
will produce entirely identical XML files (the id
values will
almost certainly change). In general, the ad hoc id
values
will not matter to a program, but if things like cryptographic
hashes or CRCs are used as part of a process, this could be a
gotcha.
Not too much need be described about the enhancement, but in
response to user requests, Numeric
arrays have been added to
the set of picklable types. For scientific and mathematical
Python users, these types may make up important attributes of
their objects. xml_pickle
makes an intelligent effort to
make sure that Numeric
is present when supporting it; if not,
it falls back to the array
module.
One lesson I have learned in developing--or maybe just shepherding the development of--these modules is the the value of a Python truism: First get it right, then make it fast!
The latter part has now been fairly well reached. Some
optimizations to xml_pickle
have brought its behavior from
O(N^2) to a manageable O(N), relative to pickled object size.
The trick there is that str = str + "more stuff"
can be
shockingly inefficient if peformed often enough. With the
expat techniques, xml_objectify
is similarly swift. I do not
think I would have got something to the world quickly, nor
received the amount of valuable contributions, had I worried
too much about optimization early.
I look forward to learning more about the practical social dynamics of open source software development as I am able to create more tools and libraries like the ones discussed in this column. It has been an interesting path, and I wonder where it will lead.
The current home of David's XML modules is:
http://gnosis.cx/download/xml_objectify.py
And:
http://gnosis.cx/download/xml_pickle.py
For those interested in older--or pre-release--version numbers of the modules, browse through the directory:
http://gnosis.cx/download/
A variety of versions, named with version numbers, live here. The module that drops a version number is generally the most recent "stable" version. Plus you can find lots of other goodies in this directory (all public domain).
The initial articles on xml_pickle
and xml_objectify
can be
found at:
http://gnosis.cx/publish/programming/xml_matters_1.html
and:
http://gnosis.cx/publish/programming/xml_matters_2.html
David Mertz is blessed with the virtues of laziness, and impatience, and in his wisdom wishes to warn the world that hubris should not be confused with chutzpah or machismo. David may be reached at [email protected]; his life pored over at http://gnosis.cx/publish/. Suggestions and recommendations on this, past, or future, columns are welcomed.