David Mertz, Ph.D.
Gesticulator, Gnosis Software, Inc.
October, 2001
This tip provides some guidance on when use tag attributes and when to use subelement contents to represent data. What considerations go into designing a DTD, Schema, or just an ad hoc XML format? When are attributes and contents interchangeable, and when are they not?
An odd thing about XML is that it provides two almost-but-not- quite-equivalent ways of spelling "this is the data." One way of indicating a data value is to put it inside a subelement content; another way is to put it in attribute values. Inasmuch as there is not usually a clear line between where each of the two approaches is appropriate, XML is not entirely orthogonal (which is computer-science-speak for "each construct does one thing, and no other construct does the same thing").
One way of deciding exactly what data goes where is to be handed an XML dialect specification--either as a DTD, as a W3C XML Schema, or described informally or by example--and to be told that you must use that dialect. If you are not making the choices, do not worry about the suggestions in this tip. But many times one needs to design the exact XML dialect to use for a process. In the latter case, think about the issues below.
One thing to keep in mind is the difference between XML documents that need to be merely well-formed, and those that need to be valid relative to some DTD/Schema. Validity is much more rigorous; it allows you to insist on certain data being present, and structured in a certain way; but for the very same reason, it is much more work to make sure a given document production process conforms with validity requirements. There are advantages to both approaches; as well, a imposing a DTD adds pros and cons to the element/attribute issues.
If you want to use a DTD, subelements are strictly ordered, while attributes are unordered. In well-formed-only XML documents, one is free to play with order; after all, any tag can go inside any other tag, at any depth, in this case.
For example, you might have a list of contacts each of whom must have a name, age and telephone number. But there is no logical sense in which age precedes telephone number. The attributes are unordered. In this case, attributes are more intuitive. Compare these documents:
<?xml version="1.0" ?> <!DOCTYPE contacts SYSTEM "attrs.dtd" > <contacts> <contact name="Jane Doe" age="74" telephone="555-3412" /> <contact name="Chieu Win" telephone="555-8888" age="44" /> </contacts>
-
<?xml version="1.0" ?> <!DOCTYPE contacts SYSTEM "subelem.dtd" > <contacts> <contact> <name>Jane Doe</name> <age>74</age> <telephone>555-3412</telephone> </contact> <contact> <name>Chieu Win</name> <telephone>555-8888</telephone> <age>44</age> </contact> </contacts>
Imagine the DTD that is implied by each XML format. For the attribute-oriented form, it might be:
<!ELEMENT contacts (contact*)> <!ELEMENT contact EMPTY> <!ATTLIST contact name CDATA #REQUIRED age CDATA #REQUIRED telephone CDATA #REQUIRED >
One might try to create a subelement-oriented DTD to do the same thing with:
<!ELEMENT contacts (contact*)> <!ELEMENT contact (name,age,telephone)> <!ELEMENT name (#PCDATA)> <!ELEMENT age (#PCDATA)> <!ELEMENT telephone (#PCDATA)>
The obvious problem, is that the above simple example is invalid under this DTD (even though it has the data we want). The subelements are out of order.
One could create a DTD that made the XML document valid by including the definition:
<!ELEMENT contact (name?,age?,telephone?)+>
However, this allows far too much--you could have contact elements with no name, and ones with several age's, neither of which matches our semantic requirements. To get what we really want would require the extremely cumbersome definition:
<!ELEMENT contact ((name,age,telephone)| (name,telephone,age)| (age,name,telephone)| (age,telephone,name)| (telephone,name,age)| (telephone,age,name))>
This is ugly, and it gets uglier at a factorial rate with more data points. And being stricter than is semantically necessary for data producers is also undesirable (e.g., imposing the first subelement DTD).
If the same type of data occurs many times within an object, subelements win, hands down. For example a "contacts" object contains many "contact" objects in the above example. In this case, it is clear that each "contact" should be described within a child element of the "contacts" element.
However, in real-life, one is often tempted away from this design principle. Here is how it happens: First, you find that each Flazbar has a flizbam attached to it (and a flizbam is described by a datum). Good enough, it seems obvious to save the extra verbosity of a subelement, and create a "flizbam" attribute for the "Flazbar" tag. A while down the road--after you have written wonderful production code for handling Flazbars--you discover that in some situations, a Flazbar can have two flizbams. Not a problem, with very little change to your installed code, you just change the DTD to contain:
<!ATTLIST Flazbar flizbam CDATA #REQUIRED flizbam2 CDATA #IMPLIED>
All your old XML documents are still valid, but new ones work also. After a while you discover the third flizbam....
It is hard to avoid this pitfall. Data and objects evolve over time, and singular things frequently become dual or multiple. Some programmers seem to eschew attributes altogether out of this fear; but I think that goes too far.
If whitespace is important, attributes are no good. After normalization, you can still count on every token in an attribute being whitespace separated from its neighbors. Moreover, you can add veritcal and horizontal whitespace to long attribute values to improve readability without any problem. But if you are representing something like source code or poetry, where exact spacing matters, stick to element contents.
Ideally, XML should be a format computers read, not one humans read. But--fortunately or unfortunately--programmers are humans too; and for the forseeable future, we are going to spend a lot of time reading, writing, and debugging XML files. It is positively painful to read XML that is formatted with only machines in mind (no whitespace, or nonsensical whitespace).
In my own mind, it is almost always much easier to read and write attribute-oriented XML formats than subelement-oriented ones. Look at the examples above. Neither is horrible to read, but the attribute version is still better. And better still to write it, since you do not need to worry about capricious subelement ordering.
I have pointed to some cases where subelements or attributes are more desirable. Unfortunately, sometimes the real situation falls into multiple cases (pointing in opposite directions). And a lot of times, data designs change enough to invalidate previous motivations. Nonetheless, keeping the principles addressed in mind can lead to clearer and cleaner XML document formats. The main rule, as always, is "use (informed) common sense."
Everything one really needs to know about XML is in the Extensible Markup Language (XML) 1.0 W3C Recommendation. Of course understanding exactly what this signifies requires some subtlety:
http://www.w3.org/TR/REC-xml
David Mertz uses a wholly unstructured brain to write about structured document formats. David may be reached at [email protected]; his life pored over at http://gnosis.cx/publish/.