 | Level: Intermediate Peter Nehrer (pnehrer@ecliptical.ca), Software consultant, Consultant
12 Dec 2006 In addition to a low-level cursor-based API, StAX provides a powerful iterator-based method to process XML that uses event objects to communicate information about the parsed stream. Part 2 explored this API in detail and provided some examples of its use. In this article, you'll examine customization techniques that use application-defined events. In particular, you'll see how to create custom event classes and use them to process XML with the event iterator-based API. Last but not least, you'll review the serialization API provided by StAX for writing XML as a stream of tokens as well as event objects.
Creating custom events
When you develop complex applications, it is often useful to build the application using a layered approach -- a lower layer provides the necessary abstractions to the application's upper layers, and so on. For instance, you might group together all code that processes XML into a higher-level object model specific to your application. This technique not only allows for the reuse of common concepts and solutions, but it also speeds up development time and results in more maintainable code.
Because the pull-based approach used by StAX leaves the application in control of the parsing process, nothing precludes you from converting the parsed events into application-specific model objects (such as, proprietary messages, or other structural building blocks). However, you might find it more convenient to stay in the realm of events and simply create customized events to represent more complex structures in your XML content. By super-imposing your custom types over their underlying XML data structures, you can simplify the development of your application code while allowing the lower layers to still work with these types as event objects (and, for instance, write them out into an output stream as events).
The StAX event hierarchy is open-ended. You can extend the existing events, and define your own event types that are new altogether. Because event objects are defined as Java™ interfaces rather than classes, you have a lot freedom in how you implement them. For example, you can subclass your existing object model and represent each type as an event. You can also do the same through composition, delegation, and so on. Listing 1 shows an example of extending the Characters event to represent values of a specific data type (a Java Date, in this case). All the subclass does is convert the text data into a date value. Note that because there are no public implementations of the standard event interfaces provided with StAX, you utilize the Decorator pattern to wrap an existing Characters instance and delegate all method invocations to it.
Listing 1. Custom extension of the Characters event to represent Date values
final DatatypeFactory tf = DatatypeFactory.newInstance();
class DateTime implements Characters {
private final Characters d;
private final Date value;
DateTime(Characters d) {
this.d = d;
XMLGregorianCalendar cal = tf.newXMLGregorianCalendar(d.getData());
value = cal.toGregorianCalendar().getTime();
}
Date getValue() { return value; }
public Characters asCharacters() { return d.asCharacters(); }
public EndElement asEndElement() { return d.asEndElement(); }
public StartElement asStartElement() { return d.asStartElement(); }
public String getData() { return d.getData(); }
public int getEventType() { return d.getEventType(); }
public Location getLocation() { return d.getLocation(); }
public QName getSchemaType() { return d.getSchemaType(); }
public boolean isAttribute() { return d.isAttribute(); }
public boolean isCData() { return d.isCData(); }
public boolean isCharacters() { return d.isCharacters(); }
public boolean isEndDocument() { return d.isEndDocument(); }
public boolean isEndElement() { return d.isEndElement(); }
public boolean isEntityReference() { return d.isEntityReference(); }
public boolean isIgnorableWhiteSpace() { return d.isIgnorableWhiteSpace(); }
public boolean isNamespace() { return d.isNamespace(); }
public boolean isProcessingInstruction() { return d.isProcessingInstruction();
}
public boolean isStartDocument() { return d.isStartDocument(); }
public boolean isStartElement() { return d.isStartElement(); }
public boolean isWhiteSpace() { return d.isWhiteSpace(); }
public void writeAsEncodedUnicode(Writer writer)
throws XMLStreamException {
d.writeAsEncodedUnicode(writer);
}
}
|
Completely new events are typically used to represent application-specific XML structures. For example, you might represent certain tags with their attributes and/or text content by a custom event class that defines an API for accessing this data (thus freeing the application from having to use the XML API to get to it). Even if a custom event is a composition of several standard events (such as StartElement, Characters, and EndElement), it is still a separate type of event because it cannot be mapped to any one of these events in particular.
The XMLEvent interface
Any custom event class must implement the XMLEvent interface. In the current version of StAX, there is no abstract base class that one can extend to implement custom events. Fortunately, you can implement most methods in XMLEvent rather trivially. For instance, the downcasting methods (asStartElement(), asEndElement(), and asCharacters()) can each simply throw a ClassCastException (since the custom event is not any one of those). Likewise, the boolean query methods, shorthands for determining the event type (methods isAttribute(), isCharacters(), and so on) can simply return false. If the custom event is a composition of other events, the getLocation() method can simply return the last event's location. The getSchemaType() method can return the name of the outer element's complex type from the associated XML Schema (if any). You might write the writeAsEncodedUnicode(Writer) method to delegate to each contained event in turn (e.g., for a custom event representing a whole element with simple content, first the StartElement is written, then the Characters, followed by the EndElement). Finally, the getEventType() method must return a value that uniquely represents the new event type throughout the application. Values 0 through 256 are reserved for use by the StAX provider, so you must select something outside of that range (and document it well, so that there are no conflicts with other custom events used throughout the application).
Listing 2 shows an example of a custom event that represents the icon element in an Atom feed. This element is defined by the Atom Syndication Format specification (version 1.0) to contain a URL of the feed's icon.
Listing 2. Custom event to represent Atom feed icons (a text-only element)
class Icon implements XMLEvent {
private final StartElement startElement;
private final String url;
private final EndElement endElement;
Icon(StartElement startElement, String url, EndElement endElement) {
this.startElement = startElement;
this.url = url;
this.endElement = endElement;
}
String getURL() { return url; }
public Characters asCharacters() { throw new ClassCastException(); }
public EndElement asEndElement() { throw new ClassCastException(); }
public StartElement asStartElement() { throw new ClassCastException(); }
public int getEventType() { return 257; }
public Location getLocation() { return endElement.getLocation(); }
public QName getSchemaType() { return null; }
public boolean isAttribute() { return false; }
public boolean isCharacters() { return false; }
public boolean isEndDocument() { return false; }
public boolean isEndElement() { return false; }
public boolean isEntityReference() { return false; }
public boolean isNamespace() { return false; }
public boolean isProcessingInstruction() { return false; }
public boolean isStartDocument() { return false; }
public boolean isStartElement() { return false; }
public void writeAsEncodedUnicode(Writer writer)
throws XMLStreamException {
startElement.writeAsEncodedUnicode(writer);
ef.createCharacters(url).writeAsEncodedUnicode(writer);
endElement.writeAsEncodedUnicode(writer);
}
}
|
Thus far, you have learned how to define custom event classes, but nothing has been said about how to make StAX use them during the parsing process. The following section explores the techniques for plugging your custom event classes into the framework.
XML parsing using custom events
Once you define a custom event, you can use it in your event processing code. You are free to translate any sequence of standard events into your custom event (and pass that on to other components that act as consumers of your events), just like you would if you convert events into an application-specific object model. However, StAX provides a more generic mechanism for plugging custom event implementations into the framework.
The XMLEventAllocator API
As you already know, the event iterator-based API is a layer that is logically above the cursor-based API. In fact, a canonical implementation of XMLEventReader might wrap an XMLStreamReader and create event objects based on its current state. To support application-defined events, XMLEventReader uses an application-supplied instance of XMLEventAllocator whenever it needs to create a concrete event. This is configurable through the XMLInputFactory; in order to use a custom XMLEventAllocator, the application should provide an instance of it by calling the XMLInputFactory's setEventAllocator(XMLEventAllocator) method.
XMLEventAllocator is essentially a bridge
between the cursor-based and the event iterator-based API. In addition
to serving as its own factory (the XMLInputFactory calls the allocator's newInstance() method for
each new XMLEventReader instance), the interface defines two event allocation methods. Method allocate(XMLStreamReader) is required to return an XMLEvent that represents the stream reader's current state, which cannot be modified by this method. This method should basically implement a cursor-to-event-object mapping. The other method, allocate(XMLStreamReader, XMLEventConsumer), should use the supplied stream reader to create one or more event objects and pass them on to the supplied event consumer. The application is free to modify the stream reader's state in the process; for instance, it can split a single state into multiple event objects, or conversely, coalesce several states into one event (for example, by reading a sequence of tokens and representing that as one event object).
The use of XMLEventAllocator is optional and there is no default instance of it (that is, by default the XMLEventReader uses an implementation-specific method of creating event objects). This is somewhat unfortunate, because it means that developers cannot simply extend an existing allocator (for example, by decorating it), but must build their own from the ground up. However, a custom XMLEventAllocator is not required to provide its own implementation of each standard event type. It can (and is even encouraged to) use the XMLEventFactory for that purpose.
Listing 3 shows an implementation of a simple allocator that converts the text content of Atom feed elements "published" and "updated" into a DateTime event (an extension of the Characters event). These elements have text-only content whose text should represent a date/time value. This customization is performed in the allocate(XMLStreamReader, XMLEventConsumer) method; when an appropriate StartElement is encountered, any element text (up to the corresponding EndElement) is converted into a single Characters event, which is in turn decorated with the custom DateTime event. This result is then fed into the event consumer.
Listing 3. Simple event allocator that creates DateTime events from the text of elements whose content is known to represent a date/time value
final String ATOM_NS = "http://www.w3.org/2005/Atom";
final QName PUBLISHED = new QName(ATOM_NS, "published");
final QName UPDATED = new QName(ATOM_NS, "updated");
final XMLEventFactory ef = XMLEventFactory.newInstance();
class CustomAllocator implements XMLEventAllocator {
public XMLEvent allocate(XMLStreamReader r)
throws XMLStreamException {
switch (r.getEventType()) {
case XMLStreamConstants.CDATA:
return ef.createCData(r.getText());
case XMLStreamConstants.CHARACTERS:
if (r.isWhiteSpace())
return ef.createSpace(r.getText());
return ef.createCharacters(r.getText());
case XMLStreamConstants.COMMENT:
return ef.createComment(r.getText());
case XMLStreamConstants.DTD:
return ef.createDTD(r.getText());
case XMLStreamConstants.END_DOCUMENT:
return ef.createEndDocument();
case XMLStreamConstants.END_ELEMENT:
return ef.createEndElement(r.getName(), allocateNamespaces(r));
case XMLStreamConstants.PROCESSING_INSTRUCTION:
return ef.createProcessingInstruction(r.getPITarget(), r.getPIData());
case XMLStreamConstants.SPACE:
return ef.createIgnorableSpace(r.getText());
case XMLStreamConstants.START_DOCUMENT:
String encoding = r.getEncoding();
String version = r.getVersion();
if (encoding != null && version != null &&
r.standaloneSet())
return ef.createStartDocument(encoding, version, r.isStandalone());
if (encoding != null && version != null)
return ef.createStartDocument(encoding, version);
if (encoding != null)
return ef.createStartDocument(encoding);
return ef.createStartDocument();
case XMLStreamConstants.START_ELEMENT:
return ef.createStartElement(r.getName(), allocateAttributes(r),
allocateNamespaces(r));
default:
return null;
}
}
private Iterator allocateNamespaces(XMLStreamReader reader) {
ArrayList namespaces = new ArrayList();
for (int i = 0, n = reader.getNamespaceCount(); i < n; ++i) {
Namespace namespace;
String prefix = reader.getNamespacePrefix(i);
String uri = reader.getNamespaceURI(i);
if (prefix == null)
namespace = ef.createNamespace(uri);
else
namespace = ef.createNamespace(prefix, uri);
namespaces.add(namespace);
}
return namespaces.iterator();
}
private Iterator allocateAttributes(XMLStreamReader r) {
ArrayList attributes = new ArrayList();
for (int i = 0, n = r.getAttributeCount(); i < n; ++i)
attributes.add(ef.createAttribute(r.getAttributeName(i),
r.getAttributeValue(i)));
return attributes.iterator();
}
public void allocate(XMLStreamReader reader,
XMLEventConsumer consumer) throws XMLStreamException {
XMLEvent event = allocate(reader);
if (event != null) {
consumer.add(event);
if (event.isStartElement()) {
QName name = event.asStartElement().getName();
if (PUBLISHED.equals(name) || UPDATED.equals(name)) {
String text = reader.getElementText();
Characters delegate = ef.createCharacters(text);
DateTime dateTime = new DateTime(delegate);
consumer.add(dateTime);
}
}
}
}
public XMLEventAllocator newInstance() {
return new CustomAllocator();
}
};
|
Once the input factory is set up with a custom event allocator, every event reader created from it will use it to create event objects. The application can use the event reader just like it normally would, but it should expect the custom events to show up in the parsed event stream (see Listing 4).
Listing 4. Iterating over events created using a custom allocator
XMLInputFactory factory = XMLInputFactory.newInstance();
factory.setEventAllocator(new CustomAllocator());
XMLEventReader r = factory.createXMLEventReader(uri, input);
try {
while (r.hasNext()) {
XMLEvent event = r.nextEvent();
if (event instanceof DateTime)
System.out.println(((DateTime) event).getValue());
}
} finally {
r.close();
}
|
In addition to letting you to create application-specific abstractions of your XML content, this approach can create a more memory efficient way of using event objects. By default, the XMLEventReader will create a new instance of every event it returns. While this might be convenient to the application -- it can cache events and use them even after parsing has completed -- it might have negative performance implications in resource-constrained environments or when large XML documents are processed. For applications that do not cache event objects, it might be more efficient to re-use a single instance of each event type, rather than create a new one each time (and thus potentially cause frequent garbage collections).
Such a static allocator might lazily create each type of event the first time it is needed and subsequently return the same instance each time, updated with the new information. However, the application would have to know of this policy change and handle events accordingly. For instance, it would have to be really careful not to pull two events in sequence, or even peek at events, as this could cause the loss of previously returned event data.
Writing XML the StAX way
No introduction to StAX is complete without discussing its serialization support. Just like on the input side, there are two styles of API for outputting XML. At one level, StAX supports writing XML as a stream of low-level tokens without ensuring the output's well-formedness (that is, it is up to the application to call the right methods at the right time). The other, higher-level API style supports the addition of entire event objects into the output stream.
Regardless of the desired API style, the application must first use the XMLOutputFactory to create the appropriate writer object -- either an XMLStreamWriter to use the low-level API, or an XMLEventWriter to use the event object-based API. To obtain the default XMLOutputFactory, call its static getInstance() method (just like any other JAXP factory).
XMLOutputFactory supports several types of output, essentially the output counterparts of the inputs supported by XMLInputFactory; in addition to standard Java I/O OutputStream and Writer, JAXP Result is supported as well.
Methods defined in the XMLStreamWriter interface are roughly analogous to those in the standard Java I/O DataOuputStream, but for XML rather than simple Java data types. While this writer does not perform well-formedness checks on its input, it does ensure that data in the CHARACTERS event as well as attribute values are properly escaped.
To start writing an XML document, the application calls one of the several overloaded versions of the writeStartDocument method. One version takes an explicit document encoding and XML version, another only the encoding, and one with no arguments uses default values for both encoding and XML version (UTF-8 and 1.0, respectively). For documents with a Document Type Declaration (DTD), the application can call writeDTD(String) with the entire DTD production as the argument.
Three versions of the writeStartElement method are available to write an element's opening tag. One version that takes a prefix, local name, and namespace URI binds the prefix to the namespace in the new context created by the element's start tag. Another version only takes a namespace URI and local name; this one writes elements without a prefix (assuming the namespace URI is the default namespace in the context). Finally, the version that only takes a local name starts elements in the default namespace, whatever it may be. Note that in all cases, the user is responsible for writing out the namespace attribute (that is, binding a prefix to a URI does not write out the namespace declaration).
In addition to these methods, there are three equivalent writeEmptyElement methods for writing empty elements (that is, without a separate closing tag). These methods open and close the element at the same time.
When writing XML elements, the application is responsible for managing their namespaces. StAX provides several methods for this purpose. You set and retrieve the entire namespace context using setNamespaceContext(NamespaceContext) and getNamespaceContext(), respectively. Note that setting a namespace context does not cause the namespace declarations to be written out -- it just defines the prefix bindings. These are then used when your codes writes a START_ELEMENT event. The setPrefix(String, String) method can bind a prefix to a namespace URI in the given context, and getPrefix(String) to retrieve a prefix bound to the given URI. The setDefaultNamespace(String) can set the default namespace in the current context.
To actually write out a namespace declaration, the application can call writeNamespace(String, String) with a prefix and a URI, or writeDefaultNamespace(String) to declare the default namespace on the current element.
Writing attributes is similar to writing element start tags, except you also must supply the attribute's value. The three writeAttribute methods have similar signatures as the writeStartElement methods (except for the addition of a String argument for attribute value) and follow similar rules with respect to namespaces.
Unless you use one of the writeEmptyElement methods, to close each element start tag, you must call writeEndElement() after its content is written out.
You can write text content with the writeCharacters and writeCData methods. The latter wraps the text in a CDATA block. For better memory usage, a version of writeCharacters that takes a char buffer with offset and length (rather than a String object) is provided.
Methods writeEntityRef, writeProcessingInstruction, and writeComment can write entity references, processing instructions, and comments, respectively. To mark the end of the document, call the writeEndDocument() method.
The typical output stream-like methods, flush and close, force the writer to write any cached data to the underlying output, and close the writer object, respectively. Note that the close method only releases any resources that the writer acquired, but it does not close the underlying output.
Listing 5. Writing a simple XHTML document using XMLStreamWriter
final String XHTML_NS = "http://www.w3.org/1999/xhtml";
XMLOutputFactory f = XMLOutputFactory.newInstance();
XMLStreamWriter w = f.createXMLStreamWriter(System.out);
try {
w.writeStartDocument();
w.writeCharacters("\n");
w.writeDTD("<!DOCTYPE html " +
"PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" " +
"\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">");
w.writeCharacters("\n");
w.writeStartElement(XHTML_NS, "html");
w.writeDefaultNamespace(XHTML_NS);
w.writeAttribute("lang", "en");
w.writeCharacters("\n");
w.writeStartElement(XHTML_NS, "head");
w.writeStartElement(XHTML_NS, "title");
w.writeCharacters("Test");
w.writeEndElement();
w.writeEndElement();
w.writeCharacters("\n");
w.writeStartElement(XHTML_NS, "body");
w.writeCharacters("This is a test.");
w.writeEndElement();
w.writeEndElement();
w.writeEndDocument();
} finally {
w.close();
}
|
The event object-based serialization API embodied by XMLEventWriter consists of fewer methods than the low-level API. Just like its input side counterpart XMLEventReader, this writer uses event objects to represent pieces of the underlying XML InfoSet. The add(XMLEvent) method is all that a typical application requires to write out the entire document. Method add(XMLEventReader) is a convenience method that writes out all events that it obtains from the reader. Using this method, the application can effectively pipe the contents of an entire XML stream into another XML stream, unmodified.
A set of methods for namespace management are equivalent to those defined in XMLStreamWriter. Methods getNamespaceContext() and setNamespaceContext(NamespaceContext) provide access to the entire namespace context. Methods setPrefix(String, String) and getPrefix(String) can bind and discover namespace prefix bindings, respectively. Lastly, the setDefaultNamespace(String) method can set the default namespace for the current namespace context.
Finally, just like XMLStreamWriter, this writer provides methods flush() and close() to flush any cached data to the underlying output and close the writer, respectively.
Listing 6. Writing a simple XHTML document using XMLEventWriter
final String XHTML_NS = "http://www.w3.org/1999/xhtml";
final QName HTML_TAG = new QName(XHTML_NS, "html");
final QName HEAD_TAG = new QName(XHTML_NS, "head");
final QName TITLE_TAG = new QName(XHTML_NS, "title");
final QName BODY_TAG = new QName(XHTML_NS, "body");
XMLOutputFactory f = XMLOutputFactory.newInstance();
XMLEventWriter w = f.createXMLEventWriter(System.out);
XMLEventFactory ef = XMLEventFactory.newInstance();
try {
w.add(ef.createStartDocument());
w.add(ef.createIgnorableSpace("\n"));
w.add(ef.createDTD("<!DOCTYPE html " +
"PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" " +
"\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">"));
w.add(ef.createIgnorableSpace("\n"));
w.add(ef.createStartElement(HTML_TAG, null, null));
w.add(ef.createNamespace(XHTML_NS));
w.add(ef.createAttribute("lang", "en"));
w.add(ef.createIgnorableSpace("\n"));
w.add(ef.createStartElement(HEAD_TAG, null, null));
w.add(ef.createStartElement(TITLE_TAG, null, null));
w.add(ef.createCharacters("Test"));
w.add(ef.createEndElement(TITLE_TAG, null));
w.add(ef.createEndElement(HEAD_TAG, null));
w.add(ef.createIgnorableSpace("\n"));
w.add(ef.createStartElement(BODY_TAG, null, null));
w.add(ef.createCharacters("This is a test."));
w.add(ef.createEndElement(BODY_TAG, null));
w.add(ef.createEndElement(HTML_TAG, null));
w.add(ef.createEndDocument());
} finally {
w.close();
}
|
 |
Summary
This series introduced you to StAX and its possibilities for use in a wide range of applications that need to either read or write XML, or both. Representing XML streams as a series of event objects is in particular a powerful approach that offers both flexibility and efficiency. With StAX as part of the next release of Java Standard Edition, it will finally be at every Java developer's disposal.
Resources Learn
-
StAX'ing up XML, Part 1: An introduction to Streaming API for XML (StAX) (Peter Nehrer, developerWorks, November 2006): Start this three part series with an overview of StAX and its cursor-based API for processing XML.
-
StAX'ing up XML, Part 2: An introduction to Streaming API for XML (StAX) (Peter Nehrer, developerWorks, December 2006): Part 1 of this series introduced Streaming API for XML (StAX) and its cursor-based API. Learn about the event iterator-based API and explore its benefits to Java developers.
- JSR 173: Streaming API for XML: Read the Java Specification Request proposing a Java-based API for pull-parsing XML.
- XML Pull Parsing: Explore this site dedicated to promotion and education of pull-based XML parsing.
- BEA Dev2Dev Online: StAX: Visit BEA's Web page on StAX with link to WebLogic's StAX implementation.
- XML programming in Java technology, Part 1 (Doug Tidwell, developerWorks, January 2004): Take a tutorial about XML programming in Java, including the common APIs for XML and how to parse, create, manipulate, and transform XML documents.
- All about JAXP, Part 1 (Brett McLaughlin, developerWorks, May 2005): Learn how to work with the JAXP API's parsing and validation features.
- Tip: Use XML streaming parsers (Berthold Daum, developerWorks, November 2003): In this tip, learn to use the streaming parsers provided by StAX for more efficient XML parsing.
-
Tip: Write XML documents with StAX (Berthold Daum, developerWorks, December 2003): Serialize XML using StAX -- parse XML documents and write XML documents to an output stream.
- An XSLT style sheet and an XML dictionary approach to internationalization (Laura Menke, developerWorks, April 2001): Read an example of how XSLT can be useful in real-world problems: Minimize the files you need to edit when content on your site changes as you enable dynamic internationalization of your Web pages through a dictionary-driven approach.
- How an XSLT processor works (Benoît Marchal, developerWorks, March 2004): Learn more of the theory behind the XSLT processor and be a more productive XSLT programmer.
- Planning to upgrade XSLT 1.0 to 2.0, Part 1: Improvements in XSLT (David Marston and Joanne Tong, developerWorks, October 2006): Compare major XSLT 2.0 features and the XSLT 1.0 (the version currently in use) shortcomings they address.
- IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
- XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
- developerWorks technical events and webcasts: Stay current with technology in these sessions.
- Learn all about XML at the developerWorks XML zone.
Get products and technologies
Discuss
About the author  | |  | Peter Nehrer is a software consultant specializing in Eclipse-based enterprise solutions and Java EE applications. He is the founder of Ecliptical Software Inc. and a contributor to several Eclipse-related Open Source projects. He holds an M.S. in Computer Science from the University of Massachusetts at Amherst, MA. |
Rate this page
|  |