(c) Tenco Media, 2000 -- may be freely distributed if unaltered XML MATTERS #1 On the Pythonic Treatment of XML Documents As Objects (I) David Mertz, Ph.D. Data Masseur, Gnosis Software, Inc. August 2000 WHAT IS XML? WHAT IS PYTHON? ------------------------------------------------------------------------ XML is a simplified dialect of the Standard Generalized Markup Language (SGML). Many readers will be most familiar with SGML via one particular document type, HTML. XML documents are similar to HTML in being composed of text interspersed with and structured by markup tags in angle-brackets. But XML encompasses many systems of tags that allow XML documents to be used for many purposes: magazine articles and user documentation, files of structured data (like CSV or EDI files), messages for interprocess communication between programs, architectural diagrams (like CAD formats), and many other purposes. A set of tags can be created to capture any sort of structured information one might want to represent, which is why XML is growing in popularity as a common standard for representing diverse information. Python is a freely available, very-high-level, interpreted language developed by Guido van Rossum. It combines a clear syntax with powerful (but optional) object-oriented semantics. Python is available for almost every computer platform you might find yourself working on, and has strong portability between platforms. INTRODUCTION: THE PROJECT ------------------------------------------------------------------- There exist a number of techniques and tool for dealing with XML documents in Python. The Resources section provides links to two developerWorks articles by this author that discuss general techniques, as well as links to other documents on XML/Python topics. However, one thing that most existing XML/Python tools have in common is that they are much more XML-centric than Python-centric. Certain constructs and coding techniques feel "natural" in a given programming language, and others feel much more like they are imported from other domains. The latter type of construct can certainly get a job done--and in many cases a certain degree of stylistic clash between a programming tool and its problem domain is inevitable. But in an ideal environment all constructs fit intuitively into their domain, and domains merge seamlessly. When it does, programmers can wax poetic rather than merely "make it work." This author has begun a research project of creating a more seamless and more natural integration between XML and Python. This article, and subsequent articles in this column, will discuss some of the goals, decisions, and limitations of this project; and hopefully along the way provide readers with a set of modules and techniques that are useful to them, and that point to easier ways to meet programming goals. All tools created as part of the Project will be released to the public domain. Python is a language with a flexible object system and a rich set of built-in types. The richness of Python is both an advantage and a disadvantage for the Project. On the one hand, having a wide range of native facilities in Python makes it easier to represent a wide range of XML structures easily. On the other hand, however, the range of native types and structures of Python makes for more cases to worry about in representing native Python objects in XML. As a result of those asymmetries that exist between XML and Python, the Project--at least initially--contains two separate modules: one for representing arbitrary Python objects in XML ([xml_pickle]); a second one ([xml_objectify]) for "native" representation of XML documents as Python objects. This article will address the former module. PART I: [xml_pickle] ------------------------------------------------------------------- Python's standard [pickle] module already provides a simple and convenient method of serializing Python objects, which is in turn useful for persistent storage or transmission over a network. In some cases, however, it is desirable to perform serialization to a format with several properties not possessed by [pickle]: (1) The format is human-readable; (2) The format may be parsed, manipulated, and objects imported, by languages other than Python; (3) The format supports validation of stored serialized objects. [xml_pickle] provides each of these features, while maintaining interface compatibility with [pickle]. However, [xml_pickle] is not a general-purpose replacement for [pickle] since [pickle] retains several advantages of its own, such as faster operation (especially via [cPickle]) and far more compact object representation. This article discusses the design goals and decisions that went into [xml_pickle], as well as thoughts on the module's likely uses. USING [xml_pickle] ------------------------------------------------------------------- Even though the interface of [xml_pickle] is mostly the same as as that of [pickle], it is worth illustrating the (quite simple) usage of [xml_pickle] for readers who are not familiar with Python or with [pickle]: #------- Python code to demonstrate [xml_pickle] -------# import xml_pickle # import the module # declare some classes to hold some attributes class MyClass1: pass class MyClass2: pass # create a class instance, and add some basic data members to it o = MyClass1() o.num = 37 o.str = "Hello World" o.lst = [1, 3.5, 2, 4+7j] # create an instance of a different class, add some members o2 = MyClass2() o2.tup = ("x", "y", "z") o2.num = 2+2j o2.dct = { "this": "that", "spam": "eggs", 3.14: "about PI" } # add the second instance to the first instance container o.obj = o2 # print an XML representation of the container instance xml_string = xml_pickle.XML_Pickler(o).dumps() print xml_string Everything except the first line and the next to last line is generic Python for working with object instances. It might be a little contrived and a little simple, but essentially everything you do with instance data members is contained in the example (including nesting instances as container data, which is how most complex structures are built in Python). All a Python programmer needs to do to encode her objects as XML is make one method call. Of course, once you have 'pickled' your objects, you will want to also restore them later (or elsewhere). Supposing the above few lines have already run, restoring the object representation is as simple as: #-- Creating object from xml_pickle'd representation ---# new_object = xml_pickle.XML_Pickler().loads(xml_string) Obviously, in real cases you would want to do something more interesting with the created XML document than just hold it in memory during runtime. For example, you might save the XML document to disk (maybe using the 'XML_Pickler.dump()' method), or transmit it over a communication channel. Actually, the example *does* print to paper, which might well be a good durable storage format. SAMPLE PYOBJECTS.DTD DOCUMENT ------------------------------------------------------------------- Running the sample code above will produce a pretty good example of the features of an [xml_pickle] representation of a Python object. However, the below example is a hand-coded test case developed by the author. The test case has the advantage of containing every XML structure, tag and attribute allowed in document type. The specific data is invented, but it is not hard to imagine the application the data might belong to: A formal document type definition (DTD) is currently being developed, and should be available in this article's source archive by the time the article first appears (check Resources). Informally, it is not difficult to see the structure of a 'PyObjects.dtd' XML document. But the DTD may disambiguate any issues the author has overlooked. Looking at the sample XML document, one can see that the three stated design goals of [xml_pickle] have been met: (1) The format is human readable; (2) The XML representations may be manipulate by means other than [xml_pickle], whether that is unrelated Python/XML modules, XML libraries in other programming languages, XML-enhanced editors and utilities, or just simply text-editors (as was used in creation of the sample); (3) XML representations of Python objects may be validated--at least this will be possible using standard XML validators once 'PyObjects.dtd' is completed and tested. Once the DTD is in place, *all and only* DTD conformant documents will be representations of valid Python objects. DESIGN FEATURES, CAVEATS AND LIMITATIONS ------------------------------------------------------------------- CONTENT MODEL. The content model of Python and XML are in certain respects simply different. One difference to pay heed to is that XML documents are inherently linear in form. Python object attributes--and also Python dictionaries--have no definitional order (although implementation details create arbitrary ordering, such as of hashed keys). In this respect, the Python object model is closer to the relational model: rows of a relational table have no "natural" sequence, and primary or secondary keys may or may not provide any meaningful ordering on a table (the keys are always orderable by comparison operators, but this order may be unrelated to the semantics of the keys). An XML document always lists its tag elements in a particular order. The order may not be significant to a particular application, but being a linear document format the XML document order is always there. The effect of the differing significance of key order in Python and XML is that the XML documents produced by [xml_pickle] are not guaranteed to maintain element order through pickle/unpickle cycles. For example, a hand prepared PyObjects.dtd XML document like the above may be "unpickled" into a Python object. If the resultant object is then "pickled" the tags will most likely occur in a different order than in the original document. This is a feature, not a bug, but the fact should be understood. LIMITATIONS. Several known limitations occur in [xml_pickle] as of the current version (0.2). One potentially serious flaw is that no effort is made to trap cyclical references in compound/container objects. If an object attribute refers back to the container object (or some recursive version of this), [xml_pickle] will exhaust the Python stack. Cyclical references are likely to indicate a flaw in object design to start with, but later versions of [xml_pickle] will certainly attempt to deal with them more intelligently. Another limitation exists in that the namespace of XML attribute values (such as the "123" in ) is larger than the namespace of valid Python variables and instance members. If attributes are (manually) created outside the Python namespace they will have the odd status of existing in an instance's '.__dict__' magic attribute, but being inaccessible by normal attribute syntax (e.g. "obj.123" is a syntax error). This is only an issue where XML documents are created or modified by means other than [xml_pickle] itself. The author simply has not decided what the best way of handling this (somewhat obscure) issue will be. Not all attributes of Python objects are handled by [xml_pickle] either. All the "usual" data members (strings, numbers, dictionaries, etc.) are pickled well. But instance methods, and class and function objects as attributes, are not handled. Methods are simply ignored in pickling (as with [pickle]). If class or function objects exist as attributes, an XMLPicklingError is raised. This is probably the correct ultimate behavior, but a final decision has not been made yet. DESIGN CHOICES. One genuine ambiguity in XML document design is the choice of when to use tag attributes, and when to use sub-elements. Opinions on this design issue differ, and XML programmers often feel strongly about their conflicting views. This was probably the biggest issue in deciding the [xml_pickle] document structure. The general principle decided on was that a *thing* that is naturally "plural" should be represented by sub-elements. For example, a Python list can contain as many items as you like, and is therefore represented by a sequence of sub-elements. On the other side, a number is a singular thing (the value might be more than 1, but there is only one *thing* in it). In that case it seemed much more logical to use an XML attribute called "value". The real difficult case was with Python strings. In a basic way, they are *sequence* objects--just like lists. But representing each character in a string using a hypothetical tag would destroy the goal of human-readability, as well as make for enormous XML representations. The decision was made to put strings in the XML "value" attribute, just as with numbers. However, from an aesthetic point-of-view this is probably less desirable than within a tag container, especially for multi-line strings. But this decision seemed more consistent since there was no other "naked" #PCDATA in the specification. In part because strings are stored in XML "value" attributes--but mostly to maintain the syntacticality of the XML document, Python strings needed to be stored in a "safe" form. There are a few unsafe things that could occur in Python strings. The first type is the basic markup characters like greater-than and less-than. A second type is the quote and apostrophe characters that set off attributes. The third type is questionable ASCII values, such as a null character. One possibility considered was to encode the whole Python strings in something like base64 encoding. This would make strings "safe," but also completely unreadable to humans. The decision was made to use a mixed approach. The basic XML characters are escaped in the style of "&", ">" or """. Questionable ASCII values are escaped in Python-style, such as "\000". The combination makes for easily human-readable XML representations, but requires a somewhat mixed approach to decoding stored strings. ANTICIPATED USES ------------------------------------------------------------------------ There are a number of things that [xml_pickle] is likely to be good for, and some user-feedback has indicated that it has entered preliminary usage. Below are a few ideas. - XML representations of Python objects may be indexed and cataloged using existing XML-centric tools (not necessarily written in Python). This provides a ready means of indexing Python object databases (such as ZODB, PAOS, or simply [shelve]). - XML representations of Python objects could be restored as objects of *other* OOP languages, especially ones having a similar range of basic types. This is something yet to do. Much "heavier" protocols like CORBA, XML-RPC, and SOAP have overlapping purpose, but [xml_pickle] is pretty "light-weight" as an object transport specification. - Tools for printing and displaying XML documents can be used to provide convenient human-readable representations of Python objects via their XML intermediate form. - It is possible to manually "debug" Python objects via their XML representation using XML specific editors, or simply text editors. Once hand-modified objects are unpickled, the effects of the edits on program operation can be examined. Other debuggers and wrappers exist for Python, but this provides an additional option. If readers develop additional uses for [xml_pickle] or see enhancements that would open the module to additional uses, the author would very much like to receive suggestions. RESOURCES ------------------------------------------------------------------------ Charming Python #1: An Introduction to XML Tools for Python http://gnosis.cx/cgi-bin/txt2html.cgi?source=../publish/programming/charming_python_1.txt Charming Python #2: A Closer Look at Python's [xml.dom] Module http://gnosis.cx/cgi-bin/txt2html.cgi?source=../publish/programming/charming_python_2.txt A friendly introduction to Python for programmers with an XML background is the below book. McGrath uses his book largely to argue the virtues of his [pyxie] module and associated tools and techniques as the best approach to XML processing. Whether or not [pyxie] is the best approach to your specific problem, McGrath's is a useful introduction to Python (but less so to XML) . _XML Processing with Python_, Sean McGrath, Prentice Hall PTR, Upper Saddle River, NJ, 2000. The Python Special Interest Group on XML: http://www.python.org/sigs/xml-sig/ The World Wide Web Consortium's DOM page. http://www.w3.org/DOM/ The DOM Level 1 Recommendation. http://www.w3.org/TR/1998/REC-DOM-Level-1-19981001/ Files used and mentioned in this article: http://gnosis.cx/download/xml_matters_1.zip ABOUT THE AUTHOR ------------------------------------------------------------------------ {Picture of Author: http://gnosis.cx/cgi-bin/img_dqm.cgi} David Mertz finds there to be a hysteretic relationship between writing and comprehension. He has begun, for example, to comprehend his own doctoral work. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/publish/. Suggestions and recommendations on this, past, or future, columns are welcomed.