« Cool Earths outside Hot Jupiters | Main | LINQ presentation slides »

September 12, 2006

XML, data and regularity

Alex James warns that XML is tomorrow's COBOL: "XML is cumbersome, we do a lot of work to get our database records into XML format, we do a lot of work when we modify it, and finally we do a lot of work to get it back into a database."

XML is a victim of its own success. XML is an excellent format for documents, from traditional documents such as XHTML pages, XSL-FO renderings, DocBook manuals or even insurance policies to more abstract descriptive documents such as the CML description of a molecular structure or the XAML rendering of a heterogeneous object graph (typically a WPF visual or a WF workflow, but could be anything; I've used XAML to mark up business logic).

What these have in common is that there's not a lot of regularity to them. An XHTML page has huge freedom to intermix text runs, bold elements, headings, images and so on and so forth. The same applies to something like DocBook. My business logic XAML listed components each of which could have its own configuration schema. Try to create a pseudo-relational structure for these, as Alex does for his "list of books" XML, and it would be a nightmare. XML succeeds in these cases precisely because it provides the structure inline with the content.

Once you move out of this realm and into the realm of highly structured data, such as Alex's list of books, the inline structure becomes a weakness not a strength. The repetition of the structural information against every element creates bloat, and more worryingly it creates duplication. I don't really want to admit the possibility that one of my books might have an "Authro" instead of an "Author". So for strictly schematised data, a representation that carries the schema only once, and implicitly associates content to schema in some way, is safer and more efficient. Alex proposes a tabular representation. This works fine for tabular data, as you would expect. When hierarchies enter the picture, however, he has to reintroduce relational complexities like keys and joins to make it work. Of course, it does still work, because it's just the relational model. Nevertheless, it isn't as natural as a true hierarchical schema, but I'm not sure what such a schema would look like, and that would reintroduce the issue of mapping in and out of relational databases.

For the time being, then, XML is being wildly overused, and many if not most applications of XML would be better served by Alex's vision of transporting mini-databases around. But it's unfair to blame XML for not being a relational database, just as it would be unfair to blame the relational model for being a lousy way to represent documents.

September 12, 2006 in Software | Permalink

TrackBack

TrackBack URL for this entry:
https://www.typepad.com/services/trackback/6a00d8341c5c9b53ef00d834e950e469e2

Listed below are links to weblogs that reference XML, data and regularity:

Comments

Aardvarks are the hatstands of dubious metaphor...
Java is the new COBOL, XML is the new CSV.

Whenever someone uses the term XML in some sort of vague marketing / architecture / nebulous bollocks sort of way, I invite you to substitute the term CSV e.g. 'we'll solve that by sending XML down the wire' becomre 'we'll solve that by sending CSV down the wire'.
Amusingly, I used to have cause to drive past the offices of Software AG, which had the slogan 'Software AG - the XML company' on the side. Needless to say I had a chuckle every time mutting 'Software AG, the CSV company'.

The CSV / XML isomorphism is actually quite a good one I think. If you take CSV to mean 'character separated' (as some people have it) then you can sensibly argue that XML is to the new Java / Application Server / SOAP ridden world as CSV was to the old Unix weenie 'everything is solved by piping text files together' world, with XSLT being to XML what AWK is to CSV.

This, I think, shows how far we have fallen, if you compare the elegance of AWK (and A, W and K's book which I think is a masterpiece) with XSLT and the books about it (well, there may be a good book but I have never seen one).

Have I said before that having a language that processes XML being in XML strikes me as a category error, like wanting to make programs that handle poetry rhyme?

Actually, I am being unfair to both CSV and COBOL.

Posted by: Harvey Pengwyn at Sep 13, 2006 9:38:05 AM

I currently work in a Cobol environment, and I think the XML comparison may be closer than you know. Parsing XML is remarkably similar to parsing Cobol file structures. Cobol programmers were the inventors of the phrase "self-describing" as far as I can tell.

I also used to work with FileMaker in its early days. It was essentially hierarchical in versions 2 and 3, and transitioned to being relational in version 4. You could have datafiles with hierarchical elements and relational logic. This could lead to all sorts of brokenness and head-bending work-arounds. I now get to see all the same fundamental mistakes being made with XML now.

Posted by: Qarl at Sep 13, 2006 11:38:44 AM

Curiously enough, when I presented on XML lo these six years ago, "XML is the new CSV" was the exact line I used. At the time I thought it was an ingenious way to map old-skool understanding into the richer new world. Now it looks like a terrible prophecy.

Posted by: Ivan at Sep 13, 2006 6:47:31 PM

Back in the hamsterlithic era (the late 80s / early 90s) I too dealt with the traditional fixed format numbered records, but from the FORTRAN side. It was the mechanism of choice for the COBOL world to communicate with the FORTRAN world in which our scheduling / routing / monitoring systems lived.

If we are to be honest the 3 eras are probably (the triples are language of choice for commercial systems, language of choice for technical, file format of choice)
COBOL / FORTRAN 77 / fixed records
misc 4GLs / C / CSV
Java / C++ / XML

There is a long rant to be written about what a step backwards the change from FORTRAN 77 to C was.

Posted by: Harvey Pengwyn at Sep 15, 2006 5:03:38 AM