XML MATTERS #12: The [xml2sql] Module
Generating SQL Statements from DTDs and XML Documents

David Mertz, Ph.D.
Format Manipulator, Gnosis Software, Inc.
May 2001

    An ealier column explored the generation of XML documents
    from SQL queries.  The reverse direction is equally possible,
    but with its own set of contraints and complications.  That
    is, how might one go about translating an XML document and
    DTD back into an RDBMS storage format.  The public domain
    utilities 'xml2sql' and 'dtd2sql' discussed by the author
    generate SQL statements to create and fill a database, in a
    consistent and reversible fashion.


INTRODUCTION
------------------------------------------------------------------------

  Some earlier columns have discussed various data models, and
  how XML fits into the picture.  In its simplest form, my
  earlier (and continuing) conclusion was that both XML documents
  and RDBMS's are going to be around for a good while longer.
  This column, and its discussed utilities, continue my efforts
  to make data and format transitions a little bit easier for
  developers.  By using [dtd2sql] and [xml2sql], programmers can
  -automatically- move the contents of XML documents into SQL
  databases (and get the information back out later).

  *Caveats*.  The utilities presented--here and in the
  earlier discussion of [sql2dtd] and [sql2xml]--will, hopefully,
  ease the work of developers looking for quick and ad hoc
  transformations between XML and SQL.  But do not assume that
  these utilities substitute for database analysts.  Optimization,
  normalization, and de-normalization are complex matters
  requiring knowledge and experience.  The tools presented
  provide good, workable, and easily deployed solutions; but
  these solutions are only sometimes substitutes for custom
  development.

  Moreover, [xml2sql] and [dtd2sql] work best with XML documents
  that contain table-oriented information to start with.  They do
  a far less good job of transforming prose-oriented and
  linear-oriented XML documents.  The limits imposed in this
  regard are, essentially, identical to the limits imposed by the
  [xml_objectify] library that [xml2sql] utilizes.  However, this
  particular focus is not really so much of a limitation:  there
  is really no way that -any- technique--even custom
  development--can produce a very natural representation of
  prose- and linear-data within an RDBMS framework.  The models
  are just different, and [xml2sql] does about as good job as one
  can.

  Finally, [xml2sql] is not likely to be particularly fast or
  efficient, nor is it designed to be.  The emphasis with this
  utility is with portability and generality.  Therefore, rather
  than call any database library--whether for a specific RDBMS or
  via a mechanism like ODBC--[xml2sql] and [dtd2sql] create
  simple textual SQL statements.  [dtd2sql] generates a list of
  'CREATE TABLE' statements; [xml2sql] generates a list of
  'INSERT INTO' statements.  It's up to the user or programmer to
  feed these statements into an actual RDBMS.  But a beauty of
  the arrangement is that these statements can be fed into any
  vendor's RDBMS equally well.


DIRTY HANDS
------------------------------------------------------------------------

  The modules [xml2sql] and [dtd2sql] are probably best
  introduced with their practical usage.  We'll circle back to
  some theoretical issues in their design.

  For purposes of testing, I created a simple testing script.
  For development purposes, I used the popular open-source RDBMS
  'mySQL'.  While 'mySQL' as some limitations compared to more
  sophisticated RDBMS's, it is a nice test platform.  The sample
  DTD and XML document belong to an IBM tutorial I wrote (see
  Resources for the collection).  My script is an OS/2 command
  file, but it should work the same under Windows, and with only
  slight modification under a Unix-like system:

      #------ Test script for transferring XML to SQL ---------#
      echo drop database test;       > test.sql
      echo create database test;    >> test.sql
      echo use test;                >> test.sql
      python dtd2sql.py dwtut.dtd   >> test.sql
      python xml2sql.py haskell.xml >> test.sql
      mysql -u root -pPASSWORD test  < test.sql

  The first few lines simply wipe out and restore the 'test'
  database in 'mySQL'.  In real-life, you would rarely want to
  'drop' an existing database, but merely 'INSERT' and 'DELETE'
  from tables.  But one likes to start fresh when testing.

  When [dtd2sql] is run, it takes a DTD on either STDIN or from
  the filename given on the command-line.  This DTD can be the
  internal subset of an XML document, if you wish; or it can be
  an external file.  However, the tool is limited to reading from
  a single source at a time, and will not currently handle
  complex multi-file DTD's, parametric entities, or an internal
  subset that partially overrides and external definition.  As
  mentioned, [dtd2sql] simply produces a set of 'CREATE TABLE'
  statements.

  Let's look at one output line from [dtd2sql], and pick apart
  its pieces (some wrapping added for presentation):

      #----- Sample CREATE TABLE statement from [dtd2sql] -----#
      CREATE TABLE a (
          primary_key BIGINT UNSIGNED PRIMARY KEY,
          seq INT UNSIGNED,
          href BLOB,
          PCDATA BLOB,
          _XML BLOB,
          foreign_key_p BIGINT UNSIGNED,
          foreign_key_li BIGINT UNSIGNED,
          foreign_key_prompt BIGINT UNSIGNED,
          foreign_key_response BIGINT UNSIGNED
      );

  Some of the column names are easily explained by the definition
  of the 'a' element itself; for others we need to look a little
  deeper.

      #------------ DTD entry for <a> XML element -------------#
      <!--A hyperlink to some other resource.-->
      <!ELEMENT a (#PCDATA | code)* >
      <!ATTLIST a href CDATA #REQUIRED  >

  The 'primary_key' column is created in every table.  So is the
  'seq' column, although this latter is not always used for
  anything meaningful.  The 'href' column comes straight from the
  identically named XML tag attribute.  And the 'PCDATA' and
  '_XML' columns are places to hold the actual element contents
  (either with or without any embedded character-level markup).

  The most interesting thing in the 'CREATE TABLE' statement
  shown is the several 'foreign_key_*' columns.  Let's consider
  those.


CREATING RELATIONS
------------------------------------------------------------------------

  According to the relational model, different tables are
  connected to each other via a primary key/foreign key identity.
  A JOIN in SQL is simply a way of saying that a field in one
  table must correspond to another field in a different table.  A
  -primary key- is a special thing in the relational model:  it
  must be unique to each record (row) of a table.  Most of the
  time, a database analyst will look at the deep structure of the
  data, and figure out--in consultation with application
  programmers and end users--what the best candidates are to
  serve as primary keys.  These keys can be a concatenation of
  multiple columns, and often identifiers like social-security
  numbers, employee IDs, ISBNs, or part IDs are used in these
  roles.

  Obviously enough, [dtd2sql] cannot do all the background
  research that a database analyst does.  All it has is a DTD (or
  maybe, potentially an XML document).  On that basis, there is
  no real way of determining which attributes or element contents
  will be unique (if any).  Fortunately, [dtd2sql] can take the
  road of some commercial RDBMS's, and simply eshew "natural"
  primary keys.  Instead, we are able to choose wholly artificial
  primary keys--ones that isolate the uniqueness requirement,
  while refusing any data representation role.  I would argue
  that such a pattern actually achieves better orthogonality in
  database design than does the more common strategy of
  identifying suitable "real world" data.  And the approach has
  the additional advantage of giving every table's primary key
  an identical and predictable name and format.

  The primary keys used are random 18-digit integers.  Assuming
  Python's [random] module is pretty good, the risk of collision
  is pretty small.  However, no strict guarantee of non-collision
  is created in the code (maybe in a later version).  So far, so
  good.

  The next step is making these primary keys usable for purposes
  of SQL JOINs. To do that, we need to make sure that when an XML
  element can contain a subelement, the subelement contains a
  foreign key corresponding to the parent's primary key.
  Admittedly, the most parsimonious manner of achieving the above
  goal is to create a single 'foreign_key' column for every
  non-root XML element.  In that case, the SQL user who queried
  the resultant database would need to -know- what JOINs would
  produce results (say by reading the original DTD).

  Throwing away parsimony, I opted for explicitness.  I create a
  separate 'foreign_key_*' column for every element that -might-
  be a parent of the XML element a table corresponds to.  So in
  the above 'CREATE TABLE' example, [dtd2sql] identified these
  DTD element definitions:

      #---------- Elements that might be parents of <a> -------#
      <!ELEMENT p        (#PCDATA | code | img | br | i | b | a)* >
      <!ELEMENT li       (#PCDATA | code | img | br | i | b | a)* >
      <!ELEMENT prompt   (#PCDATA | code | img | br | i | b | a)* >
      <!ELEMENT response (#PCDATA | code | img | br | i | b | a)* >

  An advantage of the explicit approach of [dtd2sql] is that the
  table structures created inherently contain a large part of the
  information in the DTD (but not all of it, since quantifiers
  are not differentiated thereby).


WALKING THE WINDING ROADS
------------------------------------------------------------------------

  Putting data into the tables created by [dtd2sql] is the job of
  [xml2sql].  Well, technically neither tool actually puts any
  data anywhere, but just specifies what the data is.  You need
  to use the tools that come with your RDBMS to actually load the
  data.

  The truth is that [xml2sql] does remarkably little.  It has
  less than 50 lines of code at its heart (the 'walkNodes()'
  function).  And even those few lines are well documented and
  make no effort to achieve conciseness through programming
  tricks.  Of course, most of what [xml2sql] does is actually
  done by [xml_objectify].  The first step for [xml2sql] is to
  create a "Pythonic" object using [xml_objectify].  After that,
  it is simple to recurse through all the nested attributes,
  outputting "INSERT INTO" SQL statements as it goes.  Users of
  an older version of [xml_objectify], however, will need to grab
  the latest one because of a slight change in the way XML
  element names are "mangled" along the way.

  Once you run [xml2sql] you get a bunch of SQL statments in
  return.  This collection of statments can be redirected and
  piped according to normal STDOUT behavior, which makes
  combining [xml2sql] with RDBMS command-line tools
  straightforward.  If you want to use [xml2sql] as a support
  module, you can get the set of SQL statements as a Python list
  (which might be handy for use with some database module).  A
  typical produced statement looks like (wrapped for
  presentation):

      INSERT INTO p
             (primary_key, seq, foreign_key_text__column, PCDATA)
      VALUES (15447926390024014, 0, 527610371062647168,
              "Navigating through the tutorial is easy:");

  As described above, we can see from the form of this "INSERT
  INTO" that a table had been created to correspond with the
  '<p>' element in our DTD.  Actually, the element is only really
  known to occur in the XML document; validation of the XML
  document against the DTD is a job that needs to be handled
  outside of these modules.  But assuming the XML document is
  valid, the right tables and column were created by [dtd2sql].

  We can also see in the "INSERT INTO" that this particular '<p>'
  tag is nested inside a '<text-column>' element (some name
  mangling is needed to get valid SQL column names).  To wit,
  that element that has the primary key '527610371062647168'.  It
  turns out also that this '<p>' element has some PCDATA
  contents; and that it has 'seq' column value of zero.  The
  meaning of that list bit is that this '<p>' element is all
  alone within its container; had multiple '<p>' elements occured
  in the same '<text-column>' they would be sequenced, beginning
  with one.


PUTTING IT TOGETHER
------------------------------------------------------------------------

  Once we have a bunch of data inside an RDBMS, we usually want
  to get it back out in structured and useful ways.  Fortunately,
  with a basic understanding of the primary key and foreign key
  strategy employed, you can find anything you want.  In fact, in
  many ways, the flexibility you have at this point is -better-
  than you would have had with an XPATH query syntax.  Let's look
  at an example:

      #------------ Selecting data from the RDBMS -------------#
      SELECT "Paragraph", p.seq, p._XML
        FROM title,panel,body,text__column TC,p
       WHERE title.foreign_key_panel = panel.primary_key
         AND body.foreign_key_panel = panel.primary_key
         AND TC.foreign_key_body = body.primary_key
         AND p.foreign_key_text__column = TC.primary_key
         AND title.PCDATA="About Haskell"
       ORDER BY p.seq
      ;

  A word or two of explanation is worthwhile.  The JOINs are all
  of the same form:  a 'foreign_key_X' field is JOIN'ed with the
  'primary_key' of some table 'X'.  Once all the joins are in
  place, you can add a more substantative condition, along with
  ORDER'ing, GROUP'ing, and the like.  In this case we want to
  look at all the paragraphs ('<p>' elements) of the '<panel>'
  whose '<title>' is "About Haskell."  The result looks something
  like the below:

      #------------- SQL query against XML tutorial -----------#
      C:\mysql2\bin>mysql -u root -pgnosis test < haskell.sql
      Paragraph  seq  _XML
      Paragraph  1    Haskell is just one of a number of functional programming...
      Paragraph  2    Among functional languages, Haskell is in many ways the...
      Paragraph  3    On a minor note, Haskell is syntactically easier to get...


CONCLUSION
------------------------------------------------------------------------

  This column has illustrated the command-line usage of [dtd2sql]
  and [xml2sql].  For quick testing and ad hoc shell usage, this
  is probably the approach you will want to use.  However, as
  with most things in Python, it is very simple to reuse the
  modules within your own code.  The self-test code (the
  command-line usage) provides a straightforward template for any
  importing module to follow.  I look forward to hearing about
  the wonderful uses readers manage to put these modules to--as
  they have done for so many others.


RESOURCES
------------------------------------------------------------------------

  The modules discussed in this article may be downloaded from:

    http://gnosis.cx/download/dtd2sql.py

  and:

    http://gnosis.cx/download/xml2sql.py

  An archive of the supporting and data files used in this
  article can be found at:

    http://gnosis.cx/download/xml_matters_12.zip

  The support module [xml_objectify] can be found at:

    http://gnosis.cx/download/xml_objectify.py

  In general, the Gnosis Software download directory contains a
  variety of software I have produced, mostly for IBM
  developerWorks columns and articles.  Often various versions of
  a particular software module are available in the directory,
  both the latest-and-greatest and earlier ones.  Take a look at:

    http://gnosis.cx/download/


ABOUT THE AUTHOR
------------------------------------------------------------------------

  {Picture of Author:  http://gnosis.cx/cgi-bin/img_dqm.cgi}
  David Mertz believes that without data, life itself would be
  impossible... well, OK, at least the internet would get less
  hype.  David may be reached at mertz@gnosis.cx; his life pored
  over at http://gnosis.cx/publish/.  Suggestions and
  recommendations on this, past, or future, columns are welcomed.