Skip to main content

skip to main content

developerWorks  >  XML  >

Working XML: Understand the various approaches to XML parsing

Review the pros and cons of XML parsing APIs to select the best one for your project

developerWorks
Document options

Document options requiring JavaScript are not displayed

Discuss


Rate this page

Help us improve this content


Level: Intermediate

Benoit Marchal (bmarchal@pineapplesoft.com), Consultant, Pineapplesoft

03 Jan 2007

Even developers who are very knowledgeable on advanced XML matters can lack a firm understanding of the fundamentals. To ensure a solid foundation, this article covers the most basic XML service: parsing. It introduces the various approaches to parsing and highlights their pros and cons.

Looking at the fundamentals

Next year XML will be nine years old. This has been a long ride for the extensible markup language. Today it is difficult to find an application that does not use XML somewhere.

And yet when I work with clients, I cannot help but notice that the fundamentals are not always understood. It's interesting because developers with a solid understanding of complex, but recent, XML subjects can have gaps in their understanding of the fundamentals such as parsing.

And yet where does every XML processing start? Through parsing. Parsing is probably the most fundamental service available to developers. The parser reads the XML document, decodes the syntax and passes meaningful objects to the application. The parser might also provide additional services such as validation (making sure the document conforms to an XML Schema or a DTD) or namespace resolution.

This article introduces the various approaches to parsing and highlights their pros and cons to help you decide on the tools for your next XML project. It includes links to more articles so when you decide on a tool, you can study the technicalities of a given API.

Why does it matter?

Why does parsing matter? Simply because every XML processing starts with parsing. It does not matter whether you use a high-level programming language, such as XSLT, or low-level Java programming: the first step always is to read the XML file, decode the structure, and retrieve the information...that is parsing.

The first choice you face when parsing a document is whether to adopt an existing parsing library (there's one for almost every programming language, even COBOL [Common Business Oriented Language]) or to roll out your own. And this is a very easy choice: pick an existing library.

Granted, XML is not a very complex syntax so you can be forgiven for thinking that you can hack your way with regular expressions or other ad-hoc means. In practice it seldom works: XML syntax requires support for multiple encodings and many subtleties, such as CDATA sections or entities. Home-made implementations almost never cater to all these aspects and they create incompatibilities.

Conversely, the parsers that ship with development environments were tested with an eye towards compatibility. Because the main reason to adopt a standard syntax like XML is to be compatible with other applications and toolkits, this is one case where it really pays to use a well-tested library.

Most parsers offer at least two different APIs, typically an object model API and an event API (also called stream API). The Java platform, for example, ships with both DOM (Document Object Model) and SAX (Simple API for XML).

Both sets of APIs offer the same services: decoding of the document, optional validation, namespace resolution, and more. The difference is not in the services but in the data model used by the API.

The main choice: first alternative

Object model APIs define a hierarchical object model to represent XML documents. In other words, classes are defined for every concept in the XML syntax: element, attribute, entity, documents. As the parser reads the XML document, it does a one-to-one mapping between the XML syntax and the classes. For example, when it sees a tag, it instantiates the element class.

Not surprisingly, there's some discussion over which data model is the best. The W3C has normalized DOM, whose main advantage is to be portable: it was defined as a CORBA interface and mapped to many languages. So if you know DOM with, say, JavaScript, then you know DOM in Java, C++, Perl, Python and any other language.

An alternative data model is JDOM, a Java-optimized DOM that --being Java specific-- is more tightly integrated with Java but, by definition, lacks the portability.

While one can deliberate forever the best data model for the XML syntax, I believe it completely misses the point because every object-based API shares the same basic pros and cons. On the pro side, an object model API is straightforward to understand when you are familiar with the XML syntax. Because it's a direct mapping from XML syntax to classes, it is very easy to learn, use, and debug.

The price to pay for simplicity is efficiency...at least for many projects. As it reads through the document, the parser creates objects based on syntactical constructs. For many applications, the XML syntax is not the most suitable:

  • The XML syntax is very detailed so the parser ends up creating many objects, even for a small document.
  • Typically the XML vocabulary is optimized for storage or efficient data transmission, not for processing, so the application may find that it needs to preprocess the data, for example, to compute partial sums or merge with other data sources, before the actual processing can start. In many cases it is necessary to copy the data from the XML object model into an application-specific object model or a database before processing can start.
  • Because the object model is generic, it includes many references between the objects (for example, backward references from children elements to their parent) that are not needed for a given application. These references further increase the memory consumption.

When working with small documents on a desktop this is not a problem, but in other configurations, like servers, the inherent inefficiencies of the object model APIs are not acceptable.

Second alternative

The second alternative is an event API, such as SAX. The concept almost mirrors the object model approach. Rather than define a generic data model based on the XML syntax, the parser relies on the application programmer to build a tailor-made data model.

The parser, therefore, can be made thinner because it needs only to deliver a minimalist set of information. More importantly, the overall processing can be made more efficient rather than rely on a one size fits all object model (irrespective of how good that object model is), the programmer can tailor the object model to the needs of the application.

It's with the edge cases that the benefits are the most obvious:

  • Statistical applications, and any application that summarizes information, benefit because their data model is reduced to running totals as opposed to a copy of the entire document.
  • Likewise, applications that process documents on the fly (for example, to load a document into a database) with no or little processing also benefit as they do not need to store data at all.

Because their memory requirement is reduced, event APIs enable processing documents of any size, even documents greater than the available memory. For the same reason, they are also particularly efficient on servers where many processes run concurrently and memory is shared.

What you gain in efficiency, you lose in simplicity. Event APIs have the reputation of being difficult to use, simply because the application programmer is in charge of more operations. While this is true in the short-term, it has been my experience that in the medium- to long-term, the gain in efficiency more than makes up for the slight increase in complexity.

There are two variations of stream API: the push and pull versions. Historically, push has been more popular because it was the model adopted by SAX. The push variant is being normalized and will be integrated in the Java platform shortly as StAX.

What's the difference? Who controls the reading loop. Like any software reading a file, the parser is built around a reading loop: a loop that progresses through the file.

In the push mode (SAX), the parser controls the loop. In practice when the application calls the parser, it will not return until the end of the file. The parser issues callbacks to the application to build the data model, as discussed before, but the parser is in control.

In the pull mode, the application controls the loop. It is the responsibility of the application to call the parser repetitively in a loop until the end of the document is reached.

The push mode is best suited for processing XML documents on the go, like reading an RSS feed and displaying it as an HTML Web page. For most applications that store their data in XML, it is more efficient as "reading a document" is implemented as one call to the parser.

The pull mode is more effective for applications that process different XML vocabularies. Typically they need to sniff the input (reading the first few lines) to decide on the subroutine to call based on the vocabulary.

For those applications controlling the parser, a loop helps because the application can easily stop the reading after sniffing the first few lines.

Third alternative

This article would not be complete if it failed to mention an alternative to parsing in the form of XML marshaling libraries such as Castor. This approach is midway between an object model and event-based approach.

The idea is to generate an object model from the XML Schema so instead of using a generic model (like DOM), the parser generates an object model that is more specific to the vocabulary used. For example, if the vocabulary deals with invoices, you should expect it contains elements for sender, recipient, date, product lines, product identification, price, and total. DOM maps those elements to one generic element class. A marshaling library creates specialized classes for sender, recipient, date, product lines, product identification, price, total, and any other elements that appeared in the document.

A marshaling library offers some of the benefits of event API, in terms of working with a data model that is tailored to the vocabulary (which might or might not be the same thing as tailored to the needs of the application) instead of a generic one.

What about writing?

A parser reads and decodes XML documents, bringing them from disk to memory. What about the other way around? What if your application needs to store data to an XML file?

Although I advise that you refrain from decoding XML documents with ad hoc routines, I have no qualms about writing. When reading you must make sure to implement all the rules, including the obscure. But when writing, you can implement a small workable subset of the vocabulary.

Still, most object model APIs do double duty and include the option to write an object tree to disk as well as read. When you work with event APIs, it is practical to generate writing events from the data structure (see Resources).

Decision time

So what's the conclusion? The API you use to read an XML document has a significant impact on the overall performance of your application, so take time to familiarize yourself with the options and choose the best option for your platform, programming language and, more importantly, your project.

Share this...

digg Digg this story
del.icio.us Post to del.icio.us
Slashdot Slashdot it!

Generally speaking, event APIs consume fewer resources and, therefore, can be more efficient, but if you store the whole document in memory anyway, then an object API is a better choice because it saves a lot of coding.

Review the list of resources and you will find plenty of options to work with XML parsers. Most importantly, never attempt to decode XML documents with ad-hoc code. The risk that you'll create incompatibilities if you fail to implement the standard completely is just too high.



Resources

Learn
  • "Understanding DOM" (Nicholas Chase, developerWorks, July 2003): Learn about DOM, the standard Document Object Model API. Find info about the structure of a DOM document plus how to use Java technology to create a Document from an XML file, make changes to it, and retrieve the output.

  • "SAX, the power API" (Benoît Marchal, developerWorks, August 2001): Learn about when to use the SAX API instead of DOM, plus get an overview of commonly used SAX interfaces and detailed examples in a Java-based application with many code samples

  • "SAX filters for flexible processing" (Uche Ogbuji, developerWorks, March 2003): Work SAX filters for more maintainable coding as you to construct complex XML processing behaviors from simple, independent modules.

  • "Simplify XML programming with JDOM" (Wes Biggs and Harry Evans, developerWorks, May 2001): Explore an alternate object model API that is optimized for the Java language.

  • "Data binding with Castor" (Bruce Snyder, developerWorks, August 2002): Learn more about the SAX API in great detail.

  • "Transcending the limits of DOM, SAX, and XSLT" (David Mertz, developerWorks, October 2001): Review the limitations of each model and possible alternative patterns.

  • "Implement XMLReader"(Benoît Marchal, developerWorks, November 2003): Learn to write XML documents through a SAX API.

  • "Converting from DOM" and "Converting from SAX" (Brett McLaughlin, developerWorks, April 2001) Convert between DOM and SAX to communicate with apps.

  • IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.

  • XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.

  • developerWorks technical events and webcasts: Stay current with technology in these sessions.


Get products and technologies
  • IBM trial software: Build your next development project with trial software available for download directly from developerWorks.


Discuss


About the author

Photo of Benoit Marchal

Benoît Marchal is a consultant and writer based in Namur, Belgium. He is the author of XML by Example, Applied XML Solutions, and XML and the Enterprise. He produces the Declencheur podcast on photography.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top