 | Level: Intermediate Brett McLaughlin (brett@newInstance.com), Author/Editor, O'Reilly Media, Inc.
11 Oct 2005 The latest version of the Java™ programming language -- Java 5.0 -- includes an improved, expanded version of the Java API for XML Processing (JAXP). A major addition to JAXP is the new validation API, which allows greater interactivity, support for XML Schema and RELAX NG, and the ability to make on-the-fly changes while validating. All of these improvements finally give Java developers an industrial-strength solution for XML validation. This article details the new API, from its basics to the more advanced features.
For several years, the Java API for XML Processing (JAXP) remained a stable, somewhat boring API. That's not a bad thing. Being boring often translates to being reliable -- always a plus for software. But JAXP's dullness has lulled developers away from looking to it for new features. JAXP didn't evolve much from Java 1.3 to 1.4, aside from supporting the newest versions of the SAX and DOM specifications (see Resources). But in Java 5.0 and JAXP 1.3, Sun has added to JAXP significantly. Along with support for XPath, validation is the most notable addition. This article gives you an in-depth look at JAXP 1.3's validation features, implemented in the javax.xml.validation package.
A brief history lesson
 |
Schema, schema, everywhere
In this article (and in general), the term schema refers to any constraint model that follows an XML format. So an XML Schema is a schema, but a schema is not always an XML Schema (as in the W3C spec). For example, the term schema can also apply to RELAX NG schemas. A generic term like schema makes it easier to refer to a particular approach (XML-based constraint models) without the restrictions of a specific implementation.
|
|
Before you tackle the specifics of the validation API, you need a solid understanding of how validation worked prior to JAXP 1.3. Also, it appears that Sun will continue to provide the older approach for DTD validation, favoring the new API for schema-based validation. So even if you want to move exclusively to the javax.xml.validation package, you still need to understand the old methodology for validating documents that use DTDs.
Create parser factories
In normal JAXP processing, everything begins with a factory. SAXParserFactory is used for SAX parsing, and DocumentBuilderFactory for DOM parsing. You create both with the static newInstance() method, as demonstrated in Listing 1.
Listing 1. Create a SAXParserFactory and a DocumentBuilderFactory
// Create a new SAX Parser factory
SAXParserFactory factory = SAXParserFactory.newInstance();
// Create a new DOM Document Builder factory
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); |
Turn on validation
 |
One factory, many parsers
Options set on a factory affect all the parsers created with that factory. When you invoke setValidating() with a value of true, you effectively tell the factory that all parsers it creates must be validating. Keep this in mind: It's easy to turn on validation in a factory, forget about that setting 100 lines of code later, and generate a parser that you don't remember is validating.
|
|
While SAXParserFactory and DocumentBuilderFactory differ in features and properties specific to SAX and DOM, for validation purposes they share a single method: setValidating(). As you might expect, to turn on validation, you supply the value true for this method. However, you use the factory to create parsers, not to parse documents directly. Once you have a factory, you call either newSAXParser() (for SAX) or newDocumentBuilder() (for DOM). Listing 2 shows both in action, along with validation being turned on.
Listing 2. Turn on validation (for DTDs)
// Create a new SAX Parser factory
SAXParserFactory factory = SAXParserFactory.newInstance();
// Turn on validation
factory.setValidating(true);
// Create a validating SAX parser instance
SAXParser parser = factory.newSAXParser();
// Create a new DOM Document Builder factory
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
// Turn on validation
factory.setValidating(true);
// Create a validating DOM parser
DocumentBuilder builder = factory.newDocumentBuilder(); |
In both cases, you are left with an object (either a SAXParser or DocumentBuilder) that can parse XML and validate that XML as it is parsed. Keep in mind, though, that this is strictly for DTD parsing. The setValidating(true) call has absolutely no effect on schema-based parsing.
Introducing javax.xml.validation
Five years ago, having a nifty method to turn on DTD validation was enough. Even two years ago, schema languages such as XML Schema and RELAX NG were still coming into their own and working out their bugs. Today, though, documents are validated against schemas as often as they are against DTDs. This evenness in approach is largely due to legacy documents that still use DTDs. A few years from now, DTDs might end up much like Lisp -- more an artifact of history than a mainstream technology.
Support for schema validation in JAXP 1.3 -- through the introduction of the javax.xml.validation package -- has garnered kudos from developers. The package is simple to use, free from code bloat, and now standard with the Java language. And, perhaps best of all, if you already work with SAX and DOM through JAXP, then it's even easier to learn how to deal with validation. The model is similar, and you'll find this run through the API a breeze.
Work with a SchemaFactory
As you know from A brief history lesson, when you work with SAX, your first step is to create a new SAXParserFactory. When you work with DOM, you start with a DocumentBuilderFactory. It shouldn't come as any surprise, then, that you begin schema validation with an instance of the SchemaFactory class, as shown in Listing 3.
Listing 3. Create a SchemaFactory
import javax.xml.XMLConstants;
import javax.xml.validation.SchemaFactory;
...
SchemaFactory schemaFactory =
SchemaFactory.newInstance(XMLConstants.W3C_SCHEMA_NS_URI); |
This resembles how you create other factories, with the addition of a parameter sent to the newInstance() method. You must pass into this method one of the constants defined in another class -- a class that you need to get familiar with: javax.xml.XMLConstants. This class defines all sorts of constants that are useful in JAXP applications, but for now you just need to know about two:
-
XMLConstants.RELAXNG_NS_URI
, for RELAX NG schemas
-
XMLConstants.W3C_XML_SCHEMA_NS_URI
, for W3C XML Schema
Because the SchemaFactory is tied to a specific constraint model, you must supply this value at factory construction.
Several other options are available on the SchemaFactory class. I'll deal with those later, in Going deeper with validation. For normal XML validation, you'll find that the preconfigured factory works just fine.
Validate against a schema
 |
Use the Source, Luke
Despite the awful pun that passes for this sidebar's title, the Source interface is pretty important throughout JAXP. Originated for XML transformation processing, it's become the standard for input to various JAXP constructs, at least when you don't use the Java language's IO classes directly. If you've never used Source and its implementations before, check out the documentation and articles on XML transformations in Resources.
|
|
Once your factory is ready, you just need to load the constraint set you want to use. You do this through the newSchema() method on the factory. But that factory takes a javax.xml.transform.Source implementation as input, so an intermediary step is required: converting your schema to a Source representation. This is a simple matter, demonstrated in Listing 4.
Listing 4. From constraints to Schema
import javax.xml.XMLConstants;
import javax.xml.transform.Source;
import javax.xml.transform.stream.StreamSource;
import javax.xml.validation.SchemaFactory;
import javax.xml.validation.Schema;
...
SchemaFactory schemaFactory =
SchemaFactory.newInstance(XMLConstants.W3C_SCHEMA_NS_URI);
Source schemaSource =
new StreamSource(new File("constraints.xml"));
Schema schema = schemaFactory.newSchema(schemaSource);
|
This should be fairly intuitive to you if you're at all familiar with JAXP. In Listing 4, I loaded a file called constraints.xml. You can use any method you want to get the data into a Source, including pulling constraints using SAX or DOM (taking advantage of SAXSource and DOMSource, respectively), or even a URL.
Once you have the Source implementation, feed it to your factory's newSchema() method. You'll get a Schema back as a result. At this point, it's simple to validate a document. Check out Listing 5 for the details.
Listing 5. Validate XML
import javax.xml.XMLConstants;
import javax.xml.transform.Source;
import javax.xml.transform.stream.StreamSource;
import javax.xml.validation.SchemaFactory;
import javax.xml.validation.Schema;
import javax.xml.validation.Validator;
...
SchemaFactory schemaFactory =
SchemaFactory.newInstance(XMLConstants.W3C_SCHEMA_NS_URI);
Source schemaSource = new StreamSource(new File("constraints.xml"));
Schema schema = schemaFactory.newSchema(schemaSource);
Validator validator = schema.newValidator();
validator.validate(new StreamSource("my-file.xml"));
|
Again, there's nothing revolutionary here. If you know the classes to use and the methods to call, then this is easy. Because you want to validate, you need to use the Validator class. You can obtain it from your Schema with the aptly named newValidator() method. Finally, you can call validate(), again passing in a Source implementation -- this time representing the XML to parse and validate.
With that call, the target XML is parsed and validated. Keep in mind that even if you supply the XML as a DOMSource, which is an already parsed representation of XML, parsing might well need to reoccur. Validation is still something that's tightly linked to parsing, so expect a little bit of time to pass during the validation process.
If errors occur, you'll get exceptions indicating what went wrong. Most implementations of JAXP include a line number and sometimes a column number to help you know exactly which part of the XML offended the constraint model. Of course, just throwing out an exception isn't a very robust way to deal with problems. I'll give you information on a better approach in the next section.
It might seem like this is a rather lengthy chain of events: Get the factory, get the schema, get the validator. It's certainly true that JAXP could provide a method on the factory to do all this for you, perhaps something like validate(Source schema, Source xmlDocument). But the modularity here is good; you'll see in the next section that with classes for both Schema and Validator, you can handle some pretty odd edge cases in your XML processing. And, if you must have that utility method, write it yourself -- it's a nice exercise!
Going deeper with validation
For many applications, what I've shown you so far will be everything you need, and more. You can send an input document and a schema off to a method and let it validate. A simple Exception tells you if anything went wrong and even gives you some basic information on how to fix the problem. For applications that use XML as a data format, or perhaps just to pass some information around, this is probably all you will ever need to learn about JAXP's validation functionality.
However, we live in a world where XML editors, file and code generators, and Web services abound. For these applications -- where XML isn't so much ancillary as much as it is the application -- basic validation is often not enough. For those applications JAXP offers a lot more, which I'll focus on now.
Deal with errors
First and foremost, an Exception is supposed to indicate exceptional behavior. For these XML-based applications, though, failure of a file to validate might not be exceptional at all; it might just be one of several possible outcomes. Take, for example, an XML-capable editor or IDE. In these situations, invalid XML shouldn't crash and halt the system. Further, having code that knows something went wrong only in the form of an Exception is pretty heavy-handed.
Of course, this is nothing new if you're a JAXP veteran; you're already used to supplying an org.xml.sax.ErrorHandler to your SAXParser or DocumentBuilder. That interface, with its three methods -- warning(), error(), and fatalError() -- simplifies how you handle errors in parsing. Fortunately, this same facility is available when you validate XML. Even better, you use the same interface. That's right -- the ErrorHandler interface is usable in validation just as it is in parsing. Check out Listing 6 for a simple example.
Listing 6. Handle validation errors
import javax.xml.XMLConstants;
import javax.xml.transform.Source;
import javax.xml.transform.stream.StreamSource;
import javax.xml.validation.SchemaFactory;
import javax.xml.validation.Schema;
import javax.xml.validation.Validator;
import org.xml.sax.ErrorHandler;
...
SchemaFactory schemaFactory =
SchemaFactory.newInstance(XMLConstants.W3C_SCHEMA_NS_URI);
Source schemaSource = new StreamSource(new File("constraints.xml"));
Schema schema = schemaFactory.newSchema(schemaSource);
Validator validator = schema.newValidator();
ErrorHandler mySchemaErrorHandler = new MySchemaErrorHandler();
validator.setErrorHandler(mySchemaErrorHandler);
validator.validate(new StreamSource("my-file.xml")); |
Just as when you work with SAX, you can use this interface to customize how errors are handled. This lets your application gracefully exit validation, print error messages, or even to try to recover from the error and continue validating. And if you are already familiar with this interface, there's no additional learning curve!
Load multiple schemas
 |
The other, other setErrorHandler()
If you read the JavaDoc for the javax.xml.validation package, you might notice a setErrorHandler() method on SchemaFactory, as well as on the Schema class. When you set a handler for the SchemaFactory class, you're allowing -- and handling -- errors that might occur when parsing a schema during the invocation of newSchema(). So this is part of the validation API, but it's intended for schema-parsing errors rather than schema-validation errors.
|
|
In some rare cases, you might want to construct a Schema object from multiple schemas. This bends the mind a bit; a single Schema does not map to a single schema or file. Instead, the object represents a set of constraints. Those constraints can come from a single file or even multiple files. For this reason, you can supply the newSchema() method an array of Source implementations -- representing multiple constraints -- through the method newSchema(Source[] sourceList). You still get back a single Schema object, which represents the combination of the supplied schemas.
As you might expect, a lot can go wrong in this situation. You're well-advised to set an ErrorHandler on your SchemaFactory (see Deal with errors for more information). Plenty of things can go wrong, so you'll want to be prepared to handle problems as they arise.
Integrate validation into parsing
So far, you've seen validation largely as a totally separate function from parsing. However, this doesn't need to be the case. Once you have a Schema object, you can assign it to either a SAXParserFactory or a DocumentBuilderFactory, in both cases through the setSchema() method (see Listing 7).
Listing 7. Integrate validation into parsing
// Load up the document
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
// Set up an XML Schema validator, using the supplied schema
Source schemaSource = new StreamSource(new File(args[1]));
SchemaFactory schemaFactory = SchemaFactory.newInstance(
XMLConstants.W3C_XML_SCHEMA_NS_URI);
Schema schema = schemaFactory.newSchema(schemaSource);
// Instead of explicitly validating, assign the Schema to the factory
factory.setSchema(schema);
// Parsers from this factory will automatically validate against the
// associated schema
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(new File(args[0])); |
Note that you do not need to use the setValidating() method to turn validation on explicitly. Any time a factory has a non-null
Schema, parsers (or builders) created from that factory will validate against that Schema. As you might expect, any errors in validation are reported to the ErrorHandler set on the parsers.
The big warning
As good as all this sounds -- and it does sound pretty great to me -- JAXP's new validation API has some significant problems. First and foremost, even in the production release of Java 5.0 and JAXP 1.3, I found schema validation to be buggy and quirky. Parsers are still adding support for the new API, and that means that edge cases -- lesser-used features of the API -- are often only partially implemented (and sometimes not at all). I found that in many cases, documents validated with a standalone validator like xmllint (see Resources), but failed validation with JAXP.
It also seems much more reliable to use the Validator class and the validate() method directly, rather than assigning a Schema to a SAXParserFactory or DocumentBuilderFactory. I recommend that you take the same cautious approach. I'm not urging you to ditch the API, but merely to double-check your results on as many sample documents as possible, and be diligent about your error handling.
Summary
To be perfectly fair, there's nothing patently new in JAXP's validation API. You could always use SAX or DOM to parse and validate XML, and -- using SAX's ErrorHandler class -- with clever programming you could even respond to validation errors on the fly. However, this all requires fairly intimate knowledge of SAX, time to test and debug, and careful memory management (if you're eventually creating a DOM Document object). It's here that the JAXP validation API shines. It provides a well-tested, ready-to-use solution that offers far more than a giant on/off switch for schema-based validation. It also integrates easily with your existing JAXP code, making the addition of schema validation a snap. I can't imagine any Java developer getting along with XML for long without finding at least a few great uses for JAXP validation.
Resources Learn
- "All about JAXP, Part 1" and "All about JAXP, Part 2" (developerWorks, May 2005): This two-part series by Brett McLaughlin introduces JAXP, showing how to take advantage of the API's parsing and validation features and its support for XSL transformations.
- "What's new in JAXP 1.3? Part 1" and "What's new in JAXP 1.3? Part 2" (developerWorks, November and December 2004): Get an in-depth look into the new features in JAXP 1.3.
- "Tip: Validation and the SAX ErrorHandler interface" (developerWorks, June 2001): Learn more about SAX's validation capabilities and the
ErrorHandler interface.
- "XML Schema validation in Xerces-Java 2" (developerWorks, July 2002): This tutorial by Nicholas Chase guides you through the process of using schema validation with Xerces-J.
- Sun's Java technology and XML headquarters: This is a great place to get started with JAXP.
-
Java 2 Platform Standard Edition 5.0 API Specification: The JAXP JavaDoc is now integrated into the Java 5.0 core API documentation.
-
Simple API for XML (SAX): Find out more about the APIs under the covers of JAXP. Start with SAX 2 for Java.
-
W3C Document Object Model (DOM): For another view of XML supported by JAXP, take a look at DOM.
-
Apache Xerces2 Java Parser: Sun uses the Xerces parser in its JDK 5.0 implementation.
- developerWorks' New to XML page: Need a more basic introduction to XML? This page has lots of useful resources, including Doug Tidwell's popular tutorial "Introduction to XML" (developerWorks, August 2002).
-
IBM XML certification: Find out how you can become an IBM Certified Developer in XML and related technologies.
Get products and technologies
-
Java 2 Platform Standard Edition 5.0: If you're new to Java programming, you can get JAXP along with a complete JDK.
-
Libxml2: Libxml2 is the XML C parser and toolkit developed for the Gnome project. It includes the xmllint validation program.
Discuss
About the author  | 
|  | Brett McLaughlin has worked in computers since the Logo days. (Remember the little triangle?) In recent years, he's become one of the most well-known authors and programmers in the Java and XML communities. He's worked for Nextel Communications, implementing complex enterprise systems; at Lutris Technologies, actually writing application servers; and most recently at O'Reilly Media, Inc., where he continues to write and edit books that matter. His most recent book, Java 5.0 Tiger: A Developer's Notebook, is the first book available on the newest version of Java technology, and his classic Java and XML remains one of the definitive works on using XML technologies in the Java language.
|
Rate this page
|  |