XML MATTERS #33: XML for Word Processors Open source embraces XML as native document format David Mertz, Ph.D. Bibliophile, Gnosis Software, Inc. February 2004 In their recent versions, the three major Free Software word processing programs have all adopted XML formats as their native document format. The approach to XML taken by AbiWord, KOffice's KWord, and OpenOffice.org Writer differs somewhat between the applications-- largely reflecting the underlying development focus of each project. But all open source word processor developers have realized the advantages of XML as a document format: componentization of parsers and writers; openness and formality of format specification; applicability of XSLT and other transformation APIs. INTRODUCTION ------------------------------------------------------------------------ Beyond some abandoned or incomplete efforts, three word processors are available now in actively maintained states. All do an excellent job as word processors; all provide a variety of useful import/export capabilities--include with the widely used, but proprietary, MS-Word format; and all are available in both source and binary form for Linux along with other platforms (on both Free and proprietary OSs). And most interstingly for this column, AbiWord, KWord, and OpenOffice.org Writer all use XML for their native document formats. For this column, I am not interested in comparing the features, appearance, and user interfaces of the three abovementioned projects. Suffice it to say that all of them have obtained a very nice degree of polish in look-and-feel, and all have a sufficient feature set for creations of most types of business and personal documents. What I am interested in here is the design of XML document formats--the guts inside these projects. An additional Free Software application is worth mentioning in passing here: LyX is a GUI front-end to creation of LaTeX documents. For specialized technical documents--such as those involving many equations or complex cross-referencing--LyX is a good choice, but its learning curve is steep for creating general business correspondences. A few words on the three projects are worthwhile for those unfamiliar with them. AbiWord is a standalone word processor with an emphasis on cross-platform compatibility, moderate size, and good execution speed. OpenOffice.org is an outgrowth of Sun Microsystems' StarOffice product, which was released under Free Software license, and taken up by a developer community. OpenOffice.org Writer is just part of a suite of inter-operable applications including a spreadsheet, vector drawing program, presentation application, and some other components. Similarly, KWord is part of KOffice (which is itself part of the overall KDE project); KOffice contains even more components than does OpenOffice.org--adding flowcharting, raster image editing, charting, and other applications. In any case, for now we focus only on the word processor component of KOffice and OpenOffice.org. TESTING DOCUMENT FORMATS ------------------------------------------------------------------------ As you would expect, new versions of these open source word processors usually tweak the document format a bit. Fortunately, XML is well suited to upward changes, which usually can amount to addition of (optional) new attributes and child elements. If done well, earliers versions of applications can even degrade relatively gracefully when they read newer saved documents--usually just by ignoring unfamiliar tags and attributes. In the XML formats I looked at, DTDs are provided by the project developers, but they tend to be somewhat out of sync with the actual XML documents created by the same version of the applications. Well-formedness is still respected--as you would hope, but creation and parsing seem to be rather informal matters, the final say is the source code that implements the formats, not a DTD or Schema. In other words, the samples below will -not- validate successfully. To get a look at what documents -really- look like, I created a very simple test document, shown in screenshot: {Screenshot of simple document: http://gnosis.cx/publish/programming/simple_doc.gif} Interestingly--if perhaps not surprisingly--we will see in the XML versions of this document that the representation on the identical document is not unique. Of course, being XML, issues like whitespace normalization allow non-identical files to represent the same infoset; but that is not what I mean. I found that, at least in some details, the exact same formatting can get different markup due to the sequence of user actions that went into the document creation (and perhaps due to other factors too). While this fact is not necessarily a problem--and probably applies equally to binary document formats like MS-Word's .doc format--it seems mildly unfortunate that cannonicalization is not as straightforward at a semantic level as it is at the XML syntax. STARTING SIMPLE: [AbiWord] ------------------------------------------------------------------------ AbiWord uses a relatively simple and straightforward XML document format in which appearance and layout are specified in CSS-like attributes. While many such attributes are directly taken from CSS, the AbiWord developers decided that CSS was insufficient for their needs, so took it only as a starting point. Although they are a bit long, I would like to present the entire XML source of the word processor documents created. I have prettified these sources, but have verified that my infoset neutral changes do not affect re-import. First the AbiWord version: