 | Level: Introductory Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought Inc.
18 Apr 2006 Thinking XML author Uche Ogbuji continues with the theme of XML best practices. In the previous installment "Good advice for creating XML," you looked at XML design recommendations from experts. In this article, you'll find recommendations from the Internet Engineering Task Force (IETF), an organization whose technical papers drive most Internet protocols. The IETF's XML recommendations are gathered together in RFC 3470: "Guidelines for the Use of Extensible Markup Language (XML) within IETF Protocols."
In the last Thinking XML installment "Good advice for creating XML," I discussed a couple of resources provided by experts looking to advise XML developers. XML was designed specifically for use on the Internet (the Web to be precise), and many Internet protocols use XML to represent data and documents. XML has thus become important to all the major standards organizations developing Internet technology, and not least the Internet Engineering Task Force (IETF), an organization whose technical papers drive most Internet protocols from e-mail to Web communications, and beyond. In order to provide consistent quality of XML used with such IETF technologies, the IETF developed a set of guidelines, RFC 3470, titled "Guidelines for the Use of Extensible Markup Language (XML) within IETF Protocols" (see Resources). In this article I'll discuss and expand upon the valuable advice provided in RFC 3470.
Background and basics
A large part of the document is taken up with background material regarding XML. As it says in section 1.1, "Experienced XML practitioners will
likely already be familiar with the background material here, but the
guidelines are intended to be appropriate for those readers as well." Naturally, I'll fast forward to the sections providing guidelines.
Section 2, "XML Selection Considerations" describes factors for developers to be aware of when they decide whether to use XML in IETF technology. But, I recommend the section as a valuable set of considerations for the use of XML in many contexts. Section 3 discusses possible alternatives to XML, focusing on those used in IETF protocols, such as Abstract Syntax Notation 1 (ASN.1) and External Data Representation (XDR). For a fuller list of XML alternatives, see Paul Tchistopolskii's site (see Resources).
Section 4.3 discusses additional syntactic restrictions some might choose to place upon XML. For example, a rule that says "all well-formed XML is acceptable as long as it doesn't use CDATA sections". Such additional restrictions have always been controversial. The best-known example is probably SOAP, which disallows processing instructions, document type declarations and any internal DTD subset. RFC 3470 discourages such restrictions, and I strongly agree. They work against a core goal of XML by reducing interoperability. The RFC does acknowledge situations where syntactic variations need to be eliminated for practical reasons. A good example is for digital signatures of XML documents. For such cases, the RFC very sensibly recommends a restriction to Canonical XML, a W3C specification tailored to this purpose (see Resources). The message is that you either allow all XML features, or mandate Canonical XML. Do not invent your own class of restricted XML.
Section 4.4. covers the XML declaration. I think this section's recommendations are weak and non-committal. As I've prescribed in articles on developerWorks (see Resources), I strongly recommend that you include an XML declaration with a specified character encoding unless you have very good reasons not to do so. Even documents that use the default UTF-8 and UTF-16 encodings should have such a declaration. One possible reason for not doing so is hinted at in RFC 3470. If XML is embedded as little snippets within a larger format framework, it might be an unreasonable amount of overhead to require the declaration in every case.
Representing content
The RFC moves on to coverage of content constructs with sections 4.5 and 4.6, which offer very sensible advice on when to use processing instructions and comments. Section 4.7 is a very interesting one. I don't think it ends up providing any really solid guidelines at all. But rather than a fault, this is a clear reflection of the state of controversy and uncertainty among XML experts as to which schema languages to recommend, and what extensibility and schema evolution techniques are best. Unfortunately the section doesn't really cover the breadth of these issues directly or by providing references.
Section 4.13 discusses general parsed entities internal as well as external and leans towards overall discouragement of their use. To quote the RFC:
This feature adds complexity to XML processing, and seems more appropriate for use of XML in document processing than in data representation. As such, this document recommends avoiding entity declarations in protocol specifications.
This seems to imply that document processing is not an expected concern of IETF technical reports, but I think this is a strange viewpoint, and that it also comes close to contradicting section 4.3 by restricting XML syntax. A good counter-example is Atom, a format for exchanging Web feeds of information, defined in RFC 4287. Atom is certainly a document format, and entities are a suitable mechanism to use in Atom documents. I would suggest that readers be a bit more circumspect than RFC 3470 section 4.13 and make a case-by-case evaluation of the suitability of general parsed entities in XML-based specifications. Section 4.14 addresses, almost in passing, the fact that external entities used in protocols for message-passing cause a situation where the message is not self-contained. This is certainly a valid concern to be addressed by the protocol author. There is also the complication that XML processors are not required to resolve references to external entities, so using these can present interoperability problems. The Mozilla® application suite (and thus derived applications such as Firefox® Web browser) is a good example of a very prominent XML processor that does not follow references to external entities, and silently drops such references in its processing.
Section 4.16 addresses the very confusing area of XML white space handling. Unfortunately it adds a bit to the confusion with the following assertion:
In XML instances all white space is considered significant and is by default visible to processing applications.
True, a common mistake is to assume that you can ignore any runs of whitespace in XML documents (a habit many picked up from HTML processing). As the above sentence says, whitespace is significant by default, but it is not true that all whitespace is significant. Suppose you use a DTD, and you declare a particular element, and the instance includes a child node of that element comprising only whitespace that does not match a PCDATA declaration. The node is considered ignorable or insignificant whitespace, and the parser might report it in a special way. For example the
popular SAX API allows parsers to report such text using an ignorableWhitespace event rather than
characters, which is used for character data. I think it's important to be clear about this distinction, otherwise users might end up confused by the behavior of parsers that do recognize ignorable whitespace.
What not to miss from RFC 3470
Despite a few quibbles, I think RFC 3470 is an excellent document. Different parts of the RFC are suitable for different audiences, as I've tried to point out in my coverage, but I recommend some parts of it to anyone who is even looking to approach XML. These include:
- Section 2, "XML Selection Considerations". Even XML veterans should be very readily able to articulate when XML is best, and when it's best avoided.
- Section 4.3, "Syntactic Restrictions". Read the rationale behind the very sensible recommendation to avoid arbitrary lexical restrictions on XML. When you design formats either mandate Canonical XML or allow the full range of low-level syntax.
- Section 7, "Security Considerations". The Internet has always been a source of security worries and the IETF has always been keenly attentive to security considerations, including this high-level discussion with regard to XML processing. The section is certainly not comprehensive. For example, it does not discuss the danger of XPath injection attacks.
If you have some sources for advice on best practices, or have thoughts of your own, please share them on the Thinking XML discussion forum.
Resources Learn
Get products and technologies
-
Build your next development project with IBM trial software, available for download directly from developerWorks.
Discuss
About the author  | 
|  | Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia or contact him at uche@ogbuji.net. |
Rate this page
|  |