xml2sql
ModuleDavid Mertz, Ph.D.
Format Manipulator, Gnosis Software, Inc.
May 2001
An ealier column explored the generation of XML documents from SQL queries. The reverse direction is equally possible, but with its own set of contraints and complications. That is, how might one go about translating an XML document and DTD back into an RDBMS storage format. The public domain utilitiesxml2sql
anddtd2sql
discussed by the author generate SQL statements to create and fill a database, in a consistent and reversible fashion.
Some earlier columns have discussed various data models, and
how XML fits into the picture. In its simplest form, my
earlier (and continuing) conclusion was that both XML documents
and RDBMS's are going to be around for a good while longer.
This column, and its discussed utilities, continue my efforts
to make data and format transitions a little bit easier for
developers. By using dtd2sql
and xml2sql
, programmers can
automatically move the contents of XML documents into SQL
databases (and get the information back out later).
Caveats. The utilities presented--here and in the
earlier discussion of sql2dtd
and sql2xml
--will, hopefully,
ease the work of developers looking for quick and ad hoc
transformations between XML and SQL. But do not assume that
these utilities substitute for database analysts. Optimization,
normalization, and de-normalization are complex matters
requiring knowledge and experience. The tools presented
provide good, workable, and easily deployed solutions; but
these solutions are only sometimes substitutes for custom
development.
Moreover, xml2sql
and dtd2sql
work best with XML documents
that contain table-oriented information to start with. They do
a far less good job of transforming prose-oriented and
linear-oriented XML documents. The limits imposed in this
regard are, essentially, identical to the limits imposed by the
xml_objectify
library that xml2sql
utilizes. However, this
particular focus is not really so much of a limitation: there
is really no way that any technique--even custom
development--can produce a very natural representation of
prose- and linear-data within an RDBMS framework. The models
are just different, and xml2sql
does about as good job as one
can.
Finally, xml2sql
is not likely to be particularly fast or
efficient, nor is it designed to be. The emphasis with this
utility is with portability and generality. Therefore, rather
than call any database library--whether for a specific RDBMS or
via a mechanism like ODBC--[xml2sql] and dtd2sql
create
simple textual SQL statements. dtd2sql
generates a list of
CREATE TABLE
statements; xml2sql
generates a list of
INSERT INTO
statements. It's up to the user or programmer to
feed these statements into an actual RDBMS. But a beauty of
the arrangement is that these statements can be fed into any
vendor's RDBMS equally well.
The modules xml2sql
and dtd2sql
are probably best
introduced with their practical usage. We'll circle back to
some theoretical issues in their design.
For purposes of testing, I created a simple testing script.
For development purposes, I used the popular open-source RDBMS
mySQL
. While mySQL
as some limitations compared to more
sophisticated RDBMS's, it is a nice test platform. The sample
DTD and XML document belong to an IBM tutorial I wrote (see
Resources for the collection). My script is an OS/2 command
file, but it should work the same under Windows, and with only
slight modification under a Unix-like system:
echo drop database test; > test.sql echo create database test; >> test.sql echo use test; >> test.sql python dtd2sql.py dwtut.dtd >> test.sql python xml2sql.py haskell.xml >> test.sql mysql -u root -pPASSWORD test < test.sql
The first few lines simply wipe out and restore the test
database in mySQL
. In real-life, you would rarely want to
drop
an existing database, but merely INSERT
and DELETE
from tables. But one likes to start fresh when testing.
When dtd2sql
is run, it takes a DTD on either STDIN or from
the filename given on the command-line. This DTD can be the
internal subset of an XML document, if you wish; or it can be
an external file. However, the tool is limited to reading from
a single source at a time, and will not currently handle
complex multi-file DTD's, parametric entities, or an internal
subset that partially overrides and external definition. As
mentioned, dtd2sql
simply produces a set of CREATE TABLE
statements.
Let's look at one output line from dtd2sql
, and pick apart
its pieces (some wrapping added for presentation):
CREATE TABLE a ( primary_key BIGINT UNSIGNED PRIMARY KEY, seq INT UNSIGNED, href BLOB, PCDATA BLOB, _XML BLOB, foreign_key_p BIGINT UNSIGNED, foreign_key_li BIGINT UNSIGNED, foreign_key_prompt BIGINT UNSIGNED, foreign_key_response BIGINT UNSIGNED );
Some of the column names are easily explained by the definition
of the a
element itself; for others we need to look a little
deeper.
<!--A hyperlink to some other resource.--> <!ELEMENT a (#PCDATA | code)* > <!ATTLIST a href CDATA #REQUIRED >
The primary_key
column is created in every table. So is the
seq
column, although this latter is not always used for
anything meaningful. The href
column comes straight from the
identically named XML tag attribute. And the PCDATA
and
_XML
columns are places to hold the actual element contents
(either with or without any embedded character-level markup).
The most interesting thing in the CREATE TABLE
statement
shown is the several foreign_key_*
columns. Let's consider
those.
According to the relational model, different tables are connected to each other via a primary key/foreign key identity. A JOIN in SQL is simply a way of saying that a field in one table must correspond to another field in a different table. A primary key is a special thing in the relational model: it must be unique to each record (row) of a table. Most of the time, a database analyst will look at the deep structure of the data, and figure out--in consultation with application programmers and end users--what the best candidates are to serve as primary keys. These keys can be a concatenation of multiple columns, and often identifiers like social-security numbers, employee IDs, ISBNs, or part IDs are used in these roles.
Obviously enough, dtd2sql
cannot do all the background
research that a database analyst does. All it has is a DTD (or
maybe, potentially an XML document). On that basis, there is
no real way of determining which attributes or element contents
will be unique (if any). Fortunately, dtd2sql
can take the
road of some commercial RDBMS's, and simply eshew "natural"
primary keys. Instead, we are able to choose wholly artificial
primary keys--ones that isolate the uniqueness requirement,
while refusing any data representation role. I would argue
that such a pattern actually achieves better orthogonality in
database design than does the more common strategy of
identifying suitable "real world" data. And the approach has
the additional advantage of giving every table's primary key
an identical and predictable name and format.
The primary keys used are random 18-digit integers. Assuming
Python's random
module is pretty good, the risk of collision
is pretty small. However, no strict guarantee of non-collision
is created in the code (maybe in a later version). So far, so
good.
The next step is making these primary keys usable for purposes
of SQL JOINs. To do that, we need to make sure that when an XML
element can contain a subelement, the subelement contains a
foreign key corresponding to the parent's primary key.
Admittedly, the most parsimonious manner of achieving the above
goal is to create a single foreign_key
column for every
non-root XML element. In that case, the SQL user who queried
the resultant database would need to know what JOINs would
produce results (say by reading the original DTD).
Throwing away parsimony, I opted for explicitness. I create a
separate foreign_key_*
column for every element that might
be a parent of the XML element a table corresponds to. So in
the above CREATE TABLE
example, dtd2sql
identified these
DTD element definitions:
<!ELEMENT p (#PCDATA | code | img | br | i | b | a)* > <!ELEMENT li (#PCDATA | code | img | br | i | b | a)* > <!ELEMENT prompt (#PCDATA | code | img | br | i | b | a)* > <!ELEMENT response (#PCDATA | code | img | br | i | b | a)* >
An advantage of the explicit approach of dtd2sql
is that the
table structures created inherently contain a large part of the
information in the DTD (but not all of it, since quantifiers
are not differentiated thereby).
Putting data into the tables created by dtd2sql
is the job of
xml2sql
. Well, technically neither tool actually puts any
data anywhere, but just specifies what the data is. You need
to use the tools that come with your RDBMS to actually load the
data.
The truth is that xml2sql
does remarkably little. It has
less than 50 lines of code at its heart (the walkNodes()
function). And even those few lines are well documented and
make no effort to achieve conciseness through programming
tricks. Of course, most of what xml2sql
does is actually
done by xml_objectify
. The first step for xml2sql
is to
create a "Pythonic" object using xml_objectify
. After that,
it is simple to recurse through all the nested attributes,
outputting "INSERT INTO" SQL statements as it goes. Users of
an older version of xml_objectify
, however, will need to grab
the latest one because of a slight change in the way XML
element names are "mangled" along the way.
Once you run xml2sql
you get a bunch of SQL statments in
return. This collection of statments can be redirected and
piped according to normal STDOUT behavior, which makes
combining xml2sql
with RDBMS command-line tools
straightforward. If you want to use xml2sql
as a support
module, you can get the set of SQL statements as a Python list
(which might be handy for use with some database module). A
typical produced statement looks like (wrapped for
presentation):
INSERT INTO p (primary_key, seq, foreign_key_text__column, PCDATA) VALUES (15447926390024014, 0, 527610371062647168, "Navigating through the tutorial is easy:");
As described above, we can see from the form of this "INSERT
INTO" that a table had been created to correspond with the
<p>
element in our DTD. Actually, the element is only really
known to occur in the XML document; validation of the XML
document against the DTD is a job that needs to be handled
outside of these modules. But assuming the XML document is
valid, the right tables and column were created by dtd2sql
.
We can also see in the "INSERT INTO" that this particular <p>
tag is nested inside a <text-column>
element (some name
mangling is needed to get valid SQL column names). To wit,
that element that has the primary key 527610371062647168
. It
turns out also that this <p>
element has some PCDATA
contents; and that it has seq
column value of zero. The
meaning of that list bit is that this <p>
element is all
alone within its container; had multiple <p>
elements occured
in the same <text-column>
they would be sequenced, beginning
with one.
Once we have a bunch of data inside an RDBMS, we usually want to get it back out in structured and useful ways. Fortunately, with a basic understanding of the primary key and foreign key strategy employed, you can find anything you want. In fact, in many ways, the flexibility you have at this point is better than you would have had with an XPATH query syntax. Let's look at an example:
SELECT "Paragraph", p.seq, p._XML FROM title,panel,body,text__column TC,p WHERE title.foreign_key_panel = panel.primary_key AND body.foreign_key_panel = panel.primary_key AND TC.foreign_key_body = body.primary_key AND p.foreign_key_text__column = TC.primary_key AND title.PCDATA="About Haskell" ORDER BY p.seq ;
A word or two of explanation is worthwhile. The JOINs are all
of the same form: a foreign_key_X
field is JOIN'ed with the
primary_key
of some table X
. Once all the joins are in
place, you can add a more substantative condition, along with
ORDER'ing, GROUP'ing, and the like. In this case we want to
look at all the paragraphs (<p>
elements) of the <panel>
whose <title>
is "About Haskell." The result looks something
like the below:
C:\mysql2\bin>mysql -u root -pgnosis test < haskell.sql Paragraph seq _XML Paragraph 1 Haskell is just one of a number of functional programming... Paragraph 2 Among functional languages, Haskell is in many ways the... Paragraph 3 On a minor note, Haskell is syntactically easier to get...
This column has illustrated the command-line usage of dtd2sql
and xml2sql
. For quick testing and ad hoc shell usage, this
is probably the approach you will want to use. However, as
with most things in Python, it is very simple to reuse the
modules within your own code. The self-test code (the
command-line usage) provides a straightforward template for any
importing module to follow. I look forward to hearing about
the wonderful uses readers manage to put these modules to--as
they have done for so many others.
The modules discussed in this article may be downloaded from:
http://gnosis.cx/download/dtd2sql.py
and:
http://gnosis.cx/download/xml2sql.py
An archive of the supporting and data files used in this article can be found at:
http://gnosis.cx/download/xml_matters_12.zip
The support module xml_objectify
can be found at:
http://gnosis.cx/download/xml_objectify.py
In general, the Gnosis Software download directory contains a variety of software I have produced, mostly for IBM developerWorks columns and articles. Often various versions of a particular software module are available in the directory, both the latest-and-greatest and earlier ones. Take a look at:
http://gnosis.cx/download/
David Mertz believes that without data, life itself would be impossible... well, OK, at least the internet would get less hype. David may be reached at [email protected]; his life pored over athttp://gnosis.cx/publish/. Suggestions and recommendations on this, past, or future, columns are welcomed.