Skip to main content

skip to main content

developerWorks  >  XML  >

Tip: Screen XML documents efficiently with StAX

Retrieve the information you want, then stop the parsing process

developerWorks
Document options

Document options requiring JavaScript are not displayed

Sample code


New site feature

Check out our new article design and features. Tell us what you think.


Rate this page

Help us improve this content


Level: Intermediate

Berthold Daum (berthold.daum@bdaum.de), President, BDaum Industrial Communications

11 Dec 2003

With the Streaming API for XML (StAX), you can screen XML documents efficiently without the drawbacks of traditional push parsers. This tip shows you how to retrieve specific information from XML documents and how to stop the parsing process once this information is collected.

The screening or classification of XML documents is a common problem, especially in XML middleware. Routing XML documents to specific processors may require analysis of both the document type and the document content. The problem here is obtaining the required information from the document with the least possible overhead. Traditional parsers such as DOM or SAX are not well suited to this task. DOM, for example, parses the whole document and constructs a complete document tree in memory before it returns control to the client. Even DOM parsers that employ deferred node expansion, and thus are able to parse a document partially, have high resource demands because the document tree must be at least partially constructed in memory. This is simply not acceptable for screening purposes.

Like DOM, SAX parsers control the complete parsing process. By default, a SAX parser starts parsing at the beginning of a document and continues until the end. Client event handlers are informed through callbacks about the events during this parsing process. To avoid unnecessary overhead during document screening, such an event handler may want to stop the parsing process once it has gathered the required information. A common technique for achieving this in SAX is throwing an exception, which is discussed in the developerWorks tip "Stop a SAX parser when you have enough data" by Nicholas Chase. This will cause SAX to stop the parsing process. The information gathered by the event handler must be encoded in an error message that's wrapped in an exception object and posted to the parser's client. A special error handler in the client receives this exception and must parse the parser's error message to retrieve the required information! This may be a solution to the screening problem, but it's a complicated one.

Enter StAX

StAX offers a pull parser that gives client applications full control over the parsing process. A client application may decide at any time to discontinue the parsing process, and no tricks are required to stop the parser. This is ideal for screening purposes.

Listing 1 shows what a simple document classifier might look like. I use the cursor-based StAX API for this example. At the very first start tag of the document (the root element tag), I retrieve the kind attribute from this element. The value of this attribute is then passed back to the client and the parsing process is discontinued. The client may now act upon this returned value.


Listing 1. Screening documents
import java.io.*;

import javax.xml.stream.*;

public class Classifier {

   // Holds factory instance
   private XMLInputFactory xmlif;

   public static void main(String[] args)
      throws FileNotFoundException, XMLStreamException {
      Classifier router = new Classifier();
      String kind1 = router.getKind("somefile.xml");
      String kind2 = router.getKind("otherfile.xml");
   }

   /**
    * Return the document kind
    * @param string - the value of the "kind" attribute of the root element
    */
   private String getKind(String filename)
      throws FileNotFoundException, XMLStreamException {
      // Create input factory lazily
      if (xmlif == null) {
         // Use reference implementation
         System.setProperty(
            "javax.xml.stream.XMLInputFactory",
            "com.bea.xml.stream.MXParserFactory");
         xmlif = XMLInputFactory.newInstance();
      }
      // Create stream reader
      XMLStreamReader xmlr =
         xmlif.createXMLStreamReader(new FileReader(filename));
      // Main event loop
      while (xmlr.hasNext()) {
         // Process single event
         switch (xmlr.getEventType()) {
            // Process start tags
            case XMLStreamReader.START_ELEMENT :
               // Check attributes for first start tag
               for (int i = 0; i < xmlr.getAttributeCount(); i++) {
                  // Get attribute name
                  String localName = xmlr.getAttributeName(i);
                  if (localName.equals("kind")) {
                     // Return value
                     return xmlr.getAttributeValue(i);
                  }
               }
               return null;
         }
         // Move to next event
         xmlr.next();
      }
      return null;
   }
}

Note, that I use an instance field to hold the XMLInputFactory instance. This is done to improve efficiency. Compared to the actual parsing process (which is blazingly fast), the execution of XMLInputFactory.newInstance() and xmlif.createXMLStreamReader() cause considerable overhead. While createXMLStreamReader() must be executed once for each new document, you may reuse the XMLInputFactory instance and thus avoid the repeated execution of XMLInputFactory.newInstance().



Back to top


Next steps

This tip demonstrated the use of StAX parsers for screening and classification of XML documents. In the next tip, I will show how XML documents can be created through the StAX API.




Back to top


Download

NameSizeDownload method
x-tipstx3screening.zip2KBHTTP
Information about download methods


Resources



About the author

Berthold Daum is a consultant and writer based in Lützelbach, Germany. For information on his recent books, System Architecture with XML and Modeling Business Objects with XML Schema (both from Morgan Kaufman) see http://www.bdaum.de. You can contact Berthold at berthold.daum@bdaum.de.




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top