 | Level: Intermediate Elliotte Rusty Harold (elharo@metalab.unc.edu), Adjunct Professor, Polytechnic University
29 Apr 2005 The name of an XML file does not have to end in .xml. In fact, an XML document doesn’t have to be in a file at all. It can be a database record, a piece of a file, a transitory stream of bytes in memory that’s never written to disk, or a combination of several different files. However, many XML documents do reside on hard disks and other fixed media. When they do, it’s useful to be able to identify them quickly. This article summarizes the common file extensions and MIME media types that are used for XML documents. Sometimes, it’s just easier to go with the flow than to invent new conventions.
Three-letter extensions have been used to identify file types since at least the late 1960s
and are still used today. Some operating systems use four letters or two or even one instead
of three, but the basic convention remains filename-period-extension. When files are
moved between heterogeneous systems, the name and the extension are often the only
metadata that move with them.
If you store XML documents in a file system, use a standard file extension. Doing so
makes it much easier for everyone to find, recognize, and process XML files. By far, the
most common extension is .xml, but numerous others are used for specific subsets of XML,
as Table 1 shows.
Table 1. Common XML file extensions
| Extension | Meaning | | .xml | Generic XML document | | .ent | Parsed entity, document fragment | | .dtd | Document Type Definition (DTD) | | .rdf | Resource Description Framework (RDF) XML syntax | | .atom | Atom syndication feed | | .owl | Web Ontology Language | | .xhtml | Extensible Hypertext Markup Language (XHTML) | | .xsd | W3C XML Schema Language schema | | .xsl | XSL Transformations | | .fo | XSL Formatting Objects | | .rng | RELAX NG XML syntax | | .sch | Schematron schema | | .svg | Scalable Vector Graphics (SVG) | | .rss | RSS (Really Simple Syndication, Rich Site Summary, or RDF Site Summary -- depending on who defines the acronym) syndication feed | | .plist | Apple’s property list format |
Resources served by a Web server might not be in a file system. However, if the resources are
XML documents, do still try to make sure that the URLs for these resources end with one
of the above extensions, as appropriate to their detailed type.
MIME media types
When a Web server transmits a file, it doesn’t just send the file name and contents. It also sends
a lot of metadata about the file in the HTTP header, as shown in Listing 1:
Listing 1. Sample metadata
HTTP/1.1 200 OK
Date: Sun, 23 Jan 2005 18:21:33 GMT
Server: Apache/2.0.52 (Unix) mod_ssl/2.0.52 OpenSSL/0.9.7d
Last-Modified: Sun, 10 Oct 2004 16:17:21 GMT
ETag: "3e06d-16a05-2dbc8640"
Accept-Ranges: bytes
Content-Length: 92677
Content-Type: application/xhtml+xml
|
Notice the Content-Type header in the last line. The value of this header --
in this case application/xhtml+xml -- is a MIME
media type (possibly accompanied by optional information about the character set of the document).
Web browsers and other clients use this metadata to decide how to process the file -- for example,
to determine whether they can display it natively or have to pass it to a helper application. MIME types
are also used in other contexts, including e-mail, and by a few experimental operating systems, notably
BeOS. Linux and other UNIX® systems also use MIME types, but they mostly do so by mapping file
extensions to MIME types rather than by tagging files with MIME types directly. The real, practical use
of MIME types is on the Internet.
The basic content type for generic XML documents is application/xml. The type text/xml
is also registered, but this type has been deprecated because of some unfortunate interactions with
other parts of the HTTP protocol. (Using text/xml indicates that the document is encoded in ASCII,
even if the document’s XML declaration states otherwise.) Other basic registered MIME types are :
- application/xml-dtd for DTDs
- application/xml-external-parsed-entity for document fragments
For more specific XML format types, the convention is to use the type application/foo+xml,
where "foo" refers to the specific XML vocabulary -- for example, application/rdf+xml
for RDF, application/xhtml+xml for XHTML, application/svg+xml for SVG, and so forth. In
this way, generic XML processors are able to recognize the document as XML while still allowing
processors for the specific format to recognize it as well. Table 2 lists some of the media types
you might encounter.
Table 2. XML MIME media types
| Media type | Document format | | image/svg+xml* | SVG | | application/atom+xml* | Atom Feed Syndication Format | | application/mathml+xml* | Mathematical Markup Language | | application/beep+xml | Blocks Extensible Exchange Protocol | | application/cpl+xml | Call Processing Language | | application/soap+xml | A SOAP message | | application/epp+xml | Extensible Provisioning Protocol | | application/rdf+xml | RDF XML syntax | | application/xhtml+xml | XHTML | | application/xop+xml | XML-binary Optimized Packaging | | application/xslt+xml* | XSLT stylesheet | | application/xmpp+xml | Extensible Messaging and Presence Protocol | | application/voicexml+xml* | VoiceXML |
*Registration in progress
You can't just pick a new MIME media type out of the air for every new format you
create. You must publish new types in a formal specification
(often an IETF Request for Comments) and register them with
the Internet Assigned Numbers Authority (IANA). However, you can designate experimental subtypes without registration. These subtypes must begin with x-.
For example, if I needed a custom type for the television listing markup language I
invented for an example in my book, the XML 1.1 Bible, I could call it application/x-tvml+xml.
The application type tells processors to treat this file as non-ASCII
data. The +xml at the end of the subtype tells the processors that it’s XML, the
x- warns them that this is an unregistered type, and the tvml tells them
what kind of data it is.
 |
text/xsl
The most infamous example of just making up MIME types out of whole cloth is the
text/xsl pseudotype that Microsoft® Internet Explorer uses to identify
XSLT stylesheets. The type doesn’t exist outside Microsoft’s imagination. No such type
has ever been registered with the IANA, nor is it ever likely to be registered
because the XSL specifications now follow the lead of RFC 3023 and recommend
application/xslt+xml as the MIME type. Unfortunately, a lot of other software and
documentation have simply parroted Microsoft’s error (and make no mistake -- it is an error:
text is rarely an appropriate media type for XML documents of any kind, XSL or
otherwise) rather than checking the specs to find out what they really say.
|
|
Heuristics
The final way you can identify an XML file is to open it and look. This approach isn’t
the fastest, and perhaps I shouldn’t even discuss it in this series because it’s completely
inappropriate for large collections of XML documents. However, sometimes, it’s the only
truly reliable way to tell whether a file or stream contains XML. While you might simply
throw the file/stream into a parser and hope for the best, that’s a relatively heavyweight solution.
A few good heuristics based solely on the first few bytes will tell you if
a file or stream is likely to be XML, and therefore worth checking further with a parser.
For example, every well-formed XML document is guaranteed to begin with a less-than
sign (<), optionally preceded by initial white space. In practice, the vast majority of XML
documents begin in one of three ways:
<?xml
<!DOCTYPE
<foo, where foo is any XML name
Character set issues make the detection a little trickier. All three of these may or may not
be preceded by a Unicode byte order mark in either UTF-8, big-endian UTF-16, or little-endian
UTF-16. Furthermore, any number of character sets besides Unicode can be used, including
ASCII, ISO-8859-1 (Latin-1), and EBCDIC. Still, because these sets overlap a lot within the character range of the likely initial strings, you can whittle things down to just
a few common byte sequences, shown here in hexadecimal:
- FE FF 00 3C 00 3F
- FF FE 3C 00 3F 00
- 3C 3F 78 6D
- EF BB BF 3C 3F
- 4C 6F A7 94
- 3C
These heuristics aren’t perfect -- most notably, they do identify most malformed HTML documents
as possible XML. And you can improve on them in a few corner cases by stripping off initial
white space (tab, carriage return, linefeed, and space) before the first < (3C) or checking that
the character after the first < is ?, !, or an XML name-start character. However, in practice, any document that doesn’t begin with one of the sequences above is unlikely to be XML. If you check these characters first, you can filter out a lot of the chaff and save time by parsing
only the most likely candidates.
Summary
Another way to determine which files contain XML is simply to remember where you
put them. However, even if this method is good enough for your own applications, you might
well come across other applications that need to access the same data but don’t have
detailed knowledge of your personal file naming conventions. When you follow (or at least do not
gratuitously deviate from) the standard conventions for file names and MIME media types, your documents are more accessible to everyone, and you noticeably enhance XML’s ability to
interchange data across heterogeneous systems.
Resources - Read "RFC 2046, Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types," which describes the basic structure of MIME types, including the type/subtype division and the use of the x- prefix for unregistered types.
- Visit the public MIME type registry, which is maintained by the IANA and lists all registered XML types and all other types.
- Check out "RFC 3023, XML Media Types," which describes the basic set of MIME media types for XML documents and lays out the system by which new types are chosen.
- Find out why text is an inappropriate media type for XML documents in Architecture of the World Wide Web, Volume One.
- Twenty years ago, Apple Computer invented a better way of identifying file types that didn’t make the file name serve double duty. In essence, they stored an extra four-letter code with each file in its resource fork. Apple tried to abandon this scheme in Mac OS X, but reversed course after a massive developer revolt. Apple now supports both type codes and file name extensions. "Finder Interface," Chapter 7 of Inside Macintosh: Macintosh Toolbox Essentials also describes this scheme.
- Be, Inc. may be defunct, but you can explore the BeOS in the Haiku project -- MIME-based file system and all.
- Read the XML 1.1 Bible by Elliotte Rusty Harold. It's also available on Amazon.com.
- Find out more about DB2®, the IBM software solution for information management. At its core is a powerful family of relational database management system (RDBMS) servers.
- Find hundreds more XML resources on the developerWorks XML zone.
- Learn how you can become an IBM Certified Developer in XML and related technologies.
About the author  | 
|  | Elliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a
decent bowl of gumbo. However, he resides in the Prospect Heights neighborhood of Brooklyn
with his wife Beth and cats Charm (named after the quark) and Marjorie (named after his
mother-in-law). He's an adjunct professor of computer science at Polytechnic University, where
he teaches Java technology and object-oriented programming.
His Cafe au Lait
Web site has become one of the most popular independent Java sites on the Internet, and his
spin-off site, Cafe con Leche, has become one
of the most popular XML sites. His books include Effective XML, Processing XML with Java, Java Network Programming, and The XML 1.1 Bible. He's currently working on the XOM API for processing XML and the
XQuisitor GUI query tool. You can contact him at elharo@metalab.unc.edu.
|
Rate this page
|  |