-------#
An advantage of the explicit approach of [dtd2sql] is that the
table structures created inherently contain a large part of the
information in the DTD (but not all of it, since quantifiers
are not differentiated thereby).
WALKING THE WINDING ROADS
------------------------------------------------------------------------
Putting data into the tables created by [dtd2sql] is the job of
[xml2sql]. Well, technically neither tool actually puts any
data anywhere, but just specifies what the data is. You need
to use the tools that come with your RDBMS to actually load the
data.
The truth is that [xml2sql] does remarkably little. It has
less than 50 lines of code at its heart (the 'walkNodes()'
function). And even those few lines are well documented and
make no effort to achieve conciseness through programming
tricks. Of course, most of what [xml2sql] does is actually
done by [xml_objectify]. The first step for [xml2sql] is to
create a "Pythonic" object using [xml_objectify]. After that,
it is simple to recurse through all the nested attributes,
outputting "INSERT INTO" SQL statements as it goes. Users of
an older version of [xml_objectify], however, will need to grab
the latest one because of a slight change in the way XML
element names are "mangled" along the way.
Once you run [xml2sql] you get a bunch of SQL statments in
return. This collection of statments can be redirected and
piped according to normal STDOUT behavior, which makes
combining [xml2sql] with RDBMS command-line tools
straightforward. If you want to use [xml2sql] as a support
module, you can get the set of SQL statements as a Python list
(which might be handy for use with some database module). A
typical produced statement looks like (wrapped for
presentation):
INSERT INTO p
(primary_key, seq, foreign_key_text__column, PCDATA)
VALUES (15447926390024014, 0, 527610371062647168,
"Navigating through the tutorial is easy:");
As described above, we can see from the form of this "INSERT
INTO" that a table had been created to correspond with the
'' element in our DTD. Actually, the element is only really
known to occur in the XML document; validation of the XML
document against the DTD is a job that needs to be handled
outside of these modules. But assuming the XML document is
valid, the right tables and column were created by [dtd2sql].
We can also see in the "INSERT INTO" that this particular '
'
tag is nested inside a '' element (some name
mangling is needed to get valid SQL column names). To wit,
that element that has the primary key '527610371062647168'. It
turns out also that this '' element has some PCDATA
contents; and that it has 'seq' column value of zero. The
meaning of that list bit is that this '
' element is all
alone within its container; had multiple '
' elements occured
in the same '' they would be sequenced, beginning
with one.
PUTTING IT TOGETHER
------------------------------------------------------------------------
Once we have a bunch of data inside an RDBMS, we usually want
to get it back out in structured and useful ways. Fortunately,
with a basic understanding of the primary key and foreign key
strategy employed, you can find anything you want. In fact, in
many ways, the flexibility you have at this point is -better-
than you would have had with an XPATH query syntax. Let's look
at an example:
#------------ Selecting data from the RDBMS -------------#
SELECT "Paragraph", p.seq, p._XML
FROM title,panel,body,text__column TC,p
WHERE title.foreign_key_panel = panel.primary_key
AND body.foreign_key_panel = panel.primary_key
AND TC.foreign_key_body = body.primary_key
AND p.foreign_key_text__column = TC.primary_key
AND title.PCDATA="About Haskell"
ORDER BY p.seq
;
A word or two of explanation is worthwhile. The JOINs are all
of the same form: a 'foreign_key_X' field is JOIN'ed with the
'primary_key' of some table 'X'. Once all the joins are in
place, you can add a more substantative condition, along with
ORDER'ing, GROUP'ing, and the like. In this case we want to
look at all the paragraphs ('' elements) of the ''
whose '' is "About Haskell." The result looks something
like the below:
#------------- SQL query against XML tutorial -----------#
C:\mysql2\bin>mysql -u root -pgnosis test < haskell.sql
Paragraph seq _XML
Paragraph 1 Haskell is just one of a number of functional programming...
Paragraph 2 Among functional languages, Haskell is in many ways the...
Paragraph 3 On a minor note, Haskell is syntactically easier to get...
CONCLUSION
------------------------------------------------------------------------
This column has illustrated the command-line usage of [dtd2sql]
and [xml2sql]. For quick testing and ad hoc shell usage, this
is probably the approach you will want to use. However, as
with most things in Python, it is very simple to reuse the
modules within your own code. The self-test code (the
command-line usage) provides a straightforward template for any
importing module to follow. I look forward to hearing about
the wonderful uses readers manage to put these modules to--as
they have done for so many others.
RESOURCES
------------------------------------------------------------------------
The modules discussed in this article may be downloaded from:
http://gnosis.cx/download/dtd2sql.py
and:
http://gnosis.cx/download/xml2sql.py
An archive of the supporting and data files used in this
article can be found at:
http://gnosis.cx/download/xml_matters_12.zip
The support module [xml_objectify] can be found at:
http://gnosis.cx/download/xml_objectify.py
In general, the Gnosis Software download directory contains a
variety of software I have produced, mostly for IBM
developerWorks columns and articles. Often various versions of
a particular software module are available in the directory,
both the latest-and-greatest and earlier ones. Take a look at:
http://gnosis.cx/download/
ABOUT THE AUTHOR
------------------------------------------------------------------------
{Picture of Author: http://gnosis.cx/cgi-bin/img_dqm.cgi}
David Mertz believes that without data, life itself would be
impossible... well, OK, at least the internet would get less
hype. David may be reached at mertz@gnosis.cx; his life pored
over at http://gnosis.cx/publish/. Suggestions and
recommendations on this, past, or future, columns are welcomed.