XML MATTERS #9
Generating DTDs and XML Documents from SQL Queries

David Mertz, Ph.D.
Bricolateur, Gnosis Software, Inc.
March 2001

    SQL queries that extract data from relational databases can
    also be used to provide very practical ad hoc document type
    information for the representation of query results in XML.
    This column discusses the public domain [sql2dtd] and
    [sql2xml] utilities that allow RDBMS-independent generation
    of portable XML result sets.


INTRODUCTION
------------------------------------------------------------------------

  The previous _XML Matters_ column discussed some of the
  advantages and theory underlying various data models.  One
  conclusion of that column was that RDBMS' are here to stay
  (with good reasons), and that XML is best seen, in this
  context, as a means of transporting data -between- various
  DBMS', rather than as something to -replace- DBMS'.  XPATH and
  XSLT are useful for certain "data querying" purposes, but their
  application is far less broad and general than RDBMS'--and SQL,
  in particular--is.

  A number of recent RDBMS' include built-in (or at least
  optional) tools for exporting XML.  These include DB2 and
  Oracle, and probably other RDBMS' also.  However, the tools
  discussed in this column are intended to be generic to any
  particular RDBMS; in particular, the DTDs generated by the
  presented tools will remain identical for the same query
  performed against different RDBMS'.  We hope this will further
  goals of data transparency.


SIMPLIFYING TOO MUCH
------------------------------------------------------------------------

  What might strike you as the most obvious way of converting
  relational database data to XML is also generally a bad idea.
  That is, it would be simple enough--conceptually and
  practically--to do a table-by-table "dump" of all the contents
  of an RDBMS into corresponding XML documents.  For example, a
  LIBRARY database (we will continue the simple example in the
  previous column) might have one table called BOOKPRICE, with
  contents like:

      SELECT * FROM BOOKPRICE;
      +------------+-------+
      | ISBN       | Price |
      +------------+-------+
      | 2994927282 | 34.99 |
      | 3920202049 | 47.50 |
      +------------+-------+

  We could straightforwardly convert this into a document
  'BOOKPRICE.xml' that looked like:

      <?xml version="1.0"?>
      <SQL>
        <row>
          <ISBN>2994927282</ISBN>
          <Price>34.99</Price>
        </row>
        <row>
          <ISBN>3920202049</ISBN>
          <Price>47.50</Price>
        </row>
      </SQL>

  Once we had created similar XML documents for every table
  within a database, we would have a complete snapshot of the
  database.


SIMPLIFYING ONLY ENOUGH
------------------------------------------------------------------------

  There are two basic problems with the approach sketched above.
  The first problem is that our "XML-dump" is hugely inefficient.
  XML, as readers well know, is an extremely verbose format, and
  an "XML-dump" of a large database would be an even larger
  collection of XML documents.  Futhermore, major DBMS' already
  provide compact and efficient flat-file or packed-record style
  dumps.  The XML sketched provides no additional functionality
  to those simple formats, while simply multiplying the size of
  transport files.

  The second problem is more interesting than the first, however.
  In a sense, "DASD is cheap," and bandwidth is also getting
  cheaper ("DASD" is an old IBM way of saying "hard disks"; and
  the phrase is a design cliche).  So mere efficiency is not
  necessarily overriding.  What is more important is that the
  data one wants to communicate with partners/departments/users/etc.
  is rarely the entire contents of a database.  Sometimes some
  contents are private; and almost all the time, most contents
  are simply not relevant to a particular recipient.

  What would be more interesting than dumping raw tables to XML
  would be dumping the results of useful SQL queries to XML.
  What would be still better would be to associate DTDs with
  these useful queries in order to regularize expectations about
  the exact parsing and processing required for XML transactions
  or feeds.  These two interesting actions are precisely what
  the public domain Python utilities [sql2dtd] and [sql2xml] let
  you accomplish.

  Suppose that A and B each have their own internal data storage
  strategies (e.g. in different RDBMS').  They each maintain all
  sort of related information which is not relevant to the
  interaction between A and B, but they also have some
  information they'd like to share.  Suppose, on these lines,
  that A needs to communicate a particular kind of data set to B
  on a recurrent basis.  One thing A and B can do is agree that A
  will periodically send to B a set of XML documents, each of
  which will conform to a DTD agreed to in advance.  The specific
  data in one transmission will vary with time, but the validity
  rules have been specified in advance.  Both A and B can carry
  out their programming, knowing what the protocol will look
  like.


GENERATING THE DTD'S
------------------------------------------------------------------------

  One way to develop the communication between A and B
  contemplated above is to develop DTDs (or Schemas) that match
  the specific needs of A and B. Then A will need to develop
  custom code to export data into the agreed upon DTDs from A's
  current RDBMS; and B will need to develop custom code to import
  the same data (into a differently structured database).  Then,
  finally, the communication channel can be opened.

  There is likely to be a quicker way, however, and a way that is
  likely to leverage existing export/import procedures.  The
  Standard Query Language (SQL) is a wonderfully compact and
  expressive means of expressing exactly what data within an
  RDBMS database you are interested in.  Trying to bolt on XML
  native techniques like XPATH or XSLT to a relational model will
  probably feel unnatural, although they can certainly express
  querying functions within XML's basically hierarchical model.

  Moreover, many organizations have already developed well-tested
  sets of SQL statements that are relevant to achieving known
  tasks.  Often, in fact, RDBMS' provide means for optimizing
  stored queries.  While there are certainly cases, as well,
  where designing rich DTDs for data exchanges makes sense, in
  many or most cases, a good solution can be to use the
  structuring information implicit in SQL queries as an
  (automatic) basis for XML data transmissions.

  While SQL queries can combine table data in complex ways, the
  -result- from any SQL query is a rather simple row-and-column
  arrangement.  Query output has a fixed number of fixed columns,
  with each row filling in values for every column.  The
  potential of XML to represent complex nesting patterns of
  elements is just simply not going to be deeply exercised in
  representing SQL results.  However, there are nonetheless
  several important aspects of an SQL query that can and should
  be represented in an XML DTD beyond simply row/column
  positions.


WHAT TO REPRESENT FROM AN SQL QUERY?
------------------------------------------------------------------------

  I would argue that in our contemplated data exchange between A
  and B, two aspects should be separated out.  On the one hand,
  there is the internal organization of A's data--its
  normalization, and denormalized optimizations, for example.  B
  does not, and need not, be concerned about the A-internal
  aspect.  On the other hand, there is meta-data that importantly
  describes what is actually transmitted.  Of course, separating
  these aspects is not necessarily easy.

  In writing [sql2dtd] (and in helping plan Scott Hathaway's
  [sql2xml]), I made several decisions about what is part of a
  data transmission, and what is internal to the sender's setup
  (and need not be represented in the DTD).  A sample XML (with
  the DTD as an internal set) output can help illustrate these
  decisions (this output is generated entirely from the SQL query
  contained as an attribute; when run against a suitable
  database, of course):

      <?xml version="1.0"?>
      <!DOCTYPE SQL [
      <!ELEMENT SQL (row)*>
      <!ATTLIST SQL
        GROUP_BY NMTOKEN #FIXED "AuthID"
        query CDATA #FIXED "SELECT AuthID AS SSN,COUNT(GroupID)
                            FROM AUTHGROUP GROUP BY AuthID
                            ORDER BY AuthID"
      >
      <!ELEMENT row (SSN, column2)>
      <!ATTLIST row num ID #IMPLIED>
      <!ELEMENT SSN (#PCDATA)>
      <!ELEMENT column2 (#PCDATA)>
      <!ATTLIST column2 CALC CDATA #FIXED "COUNT(GroupID)">
      ]>

      <SQL>
        <row num="1">
          <SSN>111-22-3333</SSN>
          <column2>1</column2>
        </row>
        <row num="2">
          <SSN>333-22-4444</SSN>
          <column2>2</column2>
        </row>
        <row num="3">
          <SSN>666-44-5555</SSN>
          <column2>1</column2>
        </row>
      </SQL>

  This simple XML document may actually contain more meta-data
  than one notices initially.  Of course, by including the SQL
  itself at an attribute of the root node, one can reconstruct
  -anything- implicit in the SQL.  But doing that requires
  re-parsing the SQL, which [sql2dtd] has represented in the
  document, so it is not generally necessary.

  The fact that the XML contains calculated elements is contained
  in the specification of the 'CALC' attribute for any such
  element.  Since a calculated expression can be long, and
  contain characters illegal for XML tags, calculated columns are
  named simply by their position.  However, the particular
  calculation that went into the element contents is contained as
  an attribute of the tag.  In order to avoid repeating the
  attribute throughout the XML body, it is specified as '#FIXED'
  in the DTD.

  Frequently, if calculated columns are used, the calculation
  reflects grouping of columns with the "GROUP BY" modifier.  Any
  such groupings are listed in the 'GROUP_BY' attribute of the root
  element.

  Furthermore, if an "ORDER BY" clause is used, each '<row>' tag
  carries a 'num' attribute that specifies the specific sequence of
  output data.  However, if a result set is unordered, no 'num'
  attribute is used.

  Let us consider what is *not* represented within the DTD, and
  see that it really does pertain to A's internal data
  representation, not to the transmitted message.

  Beyond the embedded original SQL query, no representation is
  kept of the table or tables that were queried for data (the
  "FROM" clause).  The particular table organization is simply
  not something B needs to be interested in.  In fact, A may very
  well modify their database design after the transmission
  protocol is in place; but B need not worry about that as long
  as the same fields (columns) are extracted by some means.  In
  particular, since the "AS" clause overrides the actual table
  column-name, it is possible to continue sending a XML element
  that no longer has any direct literal meaning in A's database.

  Most importantly in the design of [sql2dtd], "WHERE" and
  "HAVING" clauses are ignored (and also JOIN, DISTINCT and ALL
  modifiers).  As with table names the particular joins and
  filters necessary to get the data out of A's tables just is not
  something B should have to worry about.  If A happens to need
  to join together a few tables to get some datum, that is simply
  a normalization strategy by A. B may or may not use any
  analogous strategy (for a different data subset), and either
  way doe not care what A does.  Filters (mostly using the
  "WHERE" clause or "DISTINCT" modifier) are ignored for a
  related, but slightly different reason.  If for whatever
  business reasons, A only needs to inform B about those woozles
  whose whatzit is more than 25, from B's perspective that is
  just in the nature of woozles.  That is A may be interested in
  a subclass of woozles B does not care about, but the specific
  fact that A needed to filter to get the interesting ones (as
  opposed to not having them, or putting them in different
  tables) is not B's worry.  Sub-selects, in this respect, are
  just another kind of filter.


WRAPPING UP
------------------------------------------------------------------------

  We have not seen anything on the specific usage of [sql2dtd]
  and [sql2xml].  Not much is needed here, since both are well
  documented internally.  In general, [sql2dtd] can generate the
  DTD from an SQL query, but does not itself query any database.
  [sql2xml] peforms queries via ODBC, and optionally utilizes
  [sql2dtd] to get a DTD (or it can generate DTD-less XML
  documents).

  What these tools help with is only approximately half the
  process contemplated between A and B. They can quickly arrive
  at DTDs using these tools, and A can equally quickly generate
  the output XML documents conforming with these DTDs. But B, at
  its end, still needs to do all the work involved in parsing,
  storing and processing these received documents.  Later columns
  will discuss B's job in some more detail.


RESOURCES
------------------------------------------------------------------------

  My fellow XML Zone columnist, Uche Ogbuji has written an
  interesting article covering somewhat different elements of
  RDBMS/XML connections for _LinuxWorld_.  As well as writing a
  good article, Uche includes an extensive list of relevant
  resources:

    http://www.linuxworld.com/lw-2001-02/lw-02-xml3databases.html

  My [sql2dtd] module can be obtained at:

    http://gnosis.cx/download/sql2dtd.py

  Scott Hathaway's related [sql2xml] can be found at:

    http://members.home.net/slhath/downloads.html

  Matt Sergeant's [DBIx::XML_RDB] is a Perl module whose purpose is
  substantially similar to [sql2xml].  However, the XML documents
  generated by [DBIx::XML_RDB] lack an accompanying DTD:

    http://theoryx5.uwinnipeg.ca/CPAN/data/DBIx-XML_RDB/XML_RDB.html


ABOUT THE AUTHOR
------------------------------------------------------------------------

  {Picture of Author:  http://gnosis.cx/cgi-bin/img_dqm.cgi}
  David Mertz, as a child had a quite large collection of Legos
  (or at least so it seemed).  He donated them therefrom to
  Goodwill, perhaps just a bit earlier in childhood than he
  actually bored of them.  But therein arises his love of small
  programming components.  David may be reached at
  mertz@gnosis.cx; his life pored over at http://gnosis.cx/publish/.
  Suggestions and recommendations on this, past, or future,
  columns are welcomed.