Skip to main content

skip to main content

developerWorks  >  XML  >

Keeping pace with James Clark

An interview (and analysis) with the leading authority on markup languages

developerWorks
Document options

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Introductory

Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought, Inc.

01 Jul 2002

James Clark is arguably the most accomplished developer in the world of markup languages. In his distinguished career of contributing to both SGML and XML, he has served on standards bodies, provided important practical perspectives on where markup meets traditional code, and most importantly, written many of the programs that have moved XML (and SGML before it) from the world of abstract speculation into hard practicality. In this article, Uche Ogbuji interviews James Clark, concentrating on a discussion of practical developments, current and future, in the world of XML. The author also provides his own analysis of the issues raised.

In December of 2001, James Clark was awarded the inaugural XML Cup by the XML 2001 conference committee. This award was in honor of his many accomplishments for the XML community, and no one was in doubt of his deserving it. If you have done anything at all with markup languages, you have probably used code written by James Clark:

  • Jade, Clark's DSSSL processor (sort of an XSLT processor for SGML) is used by Linux users who have taken advantage of the Linux Documentation Project.
  • Clark's sgmls is the most widely-used SGML parser in the world.
  • expat or XT (parser and XSLT processor, respectively) are used by many XML developers. Indeed, given its prevalence in Python, Perl, and Apache projects, expat is clearly one of the most widely-used XML parsers available.

Notably, all this work is open source, and has been since before open source came into high fashion.

Clark received the award for his recent contributions as well. In 2001, he pioneered TREX, an alternative XML schema language which has since merged with the RELAX language, to form RELAX NG. The latter is proving extremely popular with developers despite competition from the W3C-sanctioned schema definition language.

And to prove his grasp of XML's future, Clark followed his award with a speech on the five technical challenges currently facing XML. In this speech, he highlighted many technical issues that have caused much grumbling among developers, and he struck a cautionary note about some XML developments that might cause technical difficulties down the road.

Here's an overview of Clark's five challenges for XML:

  • Make progress yet keep XML simple. XML's great strength is the diversity that can be accommodated by its simplicity. The danger with building on this strength is the temptation to build more specialized features into the core, rather than into separate, higher layers.
  • Don't neglect the foundations of XML. Now that XML has become so important, it is essential to go back to basics and solidify the foundation of XML. It is best to rip out legacy concerns (some arcane SGML features, for instance) that cause unnecessary complexity. On the other hand, important specifications such as the Infoset (a formal model of the information nodes in an XML document) should be established at the foundations.
  • Control the processing pipeline. One major omission from XML standardization is a well-defined mechanism for describing the pipeline of processing used to handle XML documents. This is especially important given the diversity of processing tools and methods available.
  • Improve XML processing models. Right now, developers are generally caught between the inefficiencies of DOM and the unfamiliar feel of SAX. An API that offers the best of both is needed.
  • Avoid premature standardization. The success of XML makes it tempting to try to lock down standards for all sorts of related technologies. But many of these technologies need more consideration and practice before standards can be established without harming innovation.

This article is based on an interview with James Clark, in which I followed up on these technical challenges. The questions and answers are presented, and in some cases, they are followed by my own explanation and analysis. Readers should be very familiar with XML and reasonably familiar with the various XML processing technologies.

The nuts and bolts: XML at its core

Q: Are there any more lessons you think we can learn from SGML before we "stab it in the back," as you suggested in your speech?
A: I think the key lesson is that the lowest layers should deal only with syntax and should be semantically neutral.

Developers often come to XML with the mistaken impression that it can automatically provide meaning to their structured information, and there is often a temptation to build such intelligence into XML at the low level. Clark advocates that the lowest levels only deal with syntax: how one processor understands the same form for an XML document as another, and that more intelligent facilities be layered on top of this.

Q: What are some of the things you'd like to see addressed in the developing XML 1.1 working draft?
A: I think the character entity is a small, tightly-focused problem that could be solved if there was the will to do so. The W3C XML Schema folks, who I often don't see eye to eye with, also want to see this problem solved.

There are a lot of complexities with XML's character representation and handling that have caused unpleasant surprises for developers, and endless debate in the XML community. Specialized tools for addressing this problem, such as Simon St. Laurent's Gorille, highlight its prominence. There has been a lot of discussion about what needs to be added or fixed in a 1.1 version of XML, since such a thing is actually beginning to be specified with the recent Blueberry draft for XML 1.1.

Q: Tim Bray has published XML-SW (XML Skunkworks) as an unofficial draft. It seems to be quite in line with your suggestions for XML 1.1. Do you have any comments or thoughts on this project?
A: I think it's a good proof of concept. I think it shows that XML 1.0/XML Namespaces/XML Infoset/XML Base can be integrated without an unreasonable amount of work and that the integration will result in something significantly more coherent than what we have now. I hope the W3C pursues something like this.
Q: What sorts of type safety do you think are needed for XSLT?
A: I don't think you can graft static type-safety onto XSLT (or anything else) as an afterthought. If you want a language with static type-checking, you really need to design it in from the start: It drastically affects the design of the language. I think there's a place for both languages like XSLT and more strongly typed XML processing languages. I predict XQuery will become a very important example of the latter.

Clark, who is also editor of the XSLT 1.0 specification, mentioned the need for type-safety in XSLT in his speech. Type-related problems are common among some XSLT developers, the most common manifestation being the use of

 <xsl:value-of select="spam"/> 

when the developer means to use

 <xsl:value-of select="'spam'"/> 

In this case, a node set is used when a string is intended, usually with results that are hard to debug. The W3C's draft XML XQuery language is one in a line of W3C specifications (which also includes XML Schema, XPath 2.0, and XSLT 2.0) where typing features are being added to XML. Interestingly enough, typing is one of the areas where Clark warned against premature standardization in his speech.

Q: You cite the need for better-designed XML APIs. Can you think of any APIs based on open-source code that would be suitable for emulation?
A: There's an open source XML Pull Parser in Java (see "XPP" in Resources). I think it's possible to do better than both this and the XML API in Microsoft's .NET. There are several open-source Java data-binding frameworks. However, I think this is something that really needs to be a standard part of the Java platform just like the Java API for XML Processing (JAXP). Now there is also a Java Specification Request for a pull XML parser API (see "JSR173" in Resources).

The DOM builds data objects representing each XML node, which can be memory intensive, but is easy to use for most developers: You just access object attributes in a way that's common in many mainstream programming languages. SAX sends back a free stream of events based on the XML constructs, and does not require data to be in memory except for the node currently being processed. This makes it more efficient, but requires developers to master many sophisticated techniques for state management. A pull API is one that has an interface similar to DOM (for simplicity) but is smart enough to load into memory only those nodes that are needed for current processing, allowing some of the efficiency of SAX.

Q: Are there any working groups, projects, or even products that you think are providing interesting and useful options for providing a processing pipeline? Are you working on any such efforts?
A: Sun's "XML Pipeline Definition Language" (submitted as a W3C Note) looks interesting, but I don't think it's quite there yet. In particular, the processdef element, which is fairly fundamental, seems very problematic from the point of view of both security and interoperability. ISO's DSDL will also be tackling the issue of the processing pipeline.


Back to top


The fuzzy edges: Matters of more general interest

Q: You noted that XML is very diverse, and that one of its strengths is its ability to represent many kinds of information. You are an old hand from the SGML world. Has any use of XML been a particular surprise (or concern) to you?
A: I've been very surprised by how much XML has been used for RPC (as in SOAP).
Q: Of all the software projects you have worked on, which do you think has been the most significant? Any criteria you choose will do, but one might be: Which project was most crucial to helping understand, influence, or just plain implement an important standard?
A: A tough question. I think I would pick sgmls. This was the one major project I have done that I [didn't] write from scratch, but rather this was based on Charles Goldfarb's ARCSGML. sgmls made ARCSGML into a reliable, production-quality, open-source SGML parser. I think this was important for broadening the SGML community, and without the SGML community XML wouldn't have happened.


Back to top


The future of XML

XML is certainly near the transition from a rough frontier-land to a mature technology run by a mature community. It will take all the experience and wisdom of that community to make this transition one that ensures the continued longevity of the technology, and its usefulness to developers. James Clark has the experience and perspective to help guide this process. Keep your eye out for the many developments that have come up in this discussion. They are certain to change the way you use XML in the very near future.



Resources

  • Benoit Marchal's SAX, the power API provides a solid introduction to SAX, the event-based API for processing XML that has become a de facto standard (developerWorks, August 2001).
  • Simon St. Laurent's Gorille is a tool for testing XML character handling.

  • Read about the hullabaloo surrounding the XML 1.1 (Blueberry) working draft in this XML Deviant column by Leigh Dodds.


About the author

Photo of Uche Ogbuji

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management applications. Fourthought develops 4Suite, an open source platform for XML, RDF and knowledge-management applications. Mr. Ogbuji is a Computer Engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top