 | Level: Intermediate Tinny Ng (tng@ca.ibm.com), System House Business Scenarios Designer, IBM Toronto Laboratory
15 Jul 2003 IBM developer Tinny Ng shows you how to serialize XML data to a DOMString with different encodings. You'll also find examples that demonstrate how to use the MemBufFormatTarget, StdOutFormatTarget, and LocalFileFormatTarget output streams in XML4C/Xerces-C++.
Xerces-C++ is an XML parser written in C++ and distributed by the open source Apache XML project.
Since early last year, Xerces-C++ has added an experimental implementation of a subset of the
W3C Document Object Model (DOM) Level 3 as specified in the DOM Level 3
Core Specification and the DOM Level 3 Load and Save Specification (see Resources).
The DOM Level 3 Load and Save Specification defines a set of interfaces that allow users to load
and save XML content from different input sources to different output streams.
This article uses examples to show you how to save XML data in this way.
Users can stream the output data into a string, an internal buffer, the standard output, or
a file. In the following sections, I will show you how to serialize XML data to a
DOMString with different encodings, and also how to use MemBufFormatTarget, StdOutFormatTarget, and LocalFileFormatTarget in Xerces-C++.
Note: IBM XML for C++ (XML4C) integrates Xerces-C++ with International Components for Unicode (ICU)
to provide support for over 100 different encodings. In this document, I will use Xerces-C++
to represent the XML parser for C++. However, the behavior described should apply to both
XML4C and Xerces C++, unless otherwise specified .
Serializing XML data
The DOMBuilder class
provides an API for parsing XML documents and building the corresponding
DOM document tree; while the DOMWriter class
provides an API for serializing (writing) a DOM
document out in an XML document. To serialize XML data, first load the XML data to a DOM tree
using a DOMBuilder and then use a
DOMWriter to write out the DOM tree. For example:
Listing 1. Serializing XML data
// DOMImplementationLS contains factory methods for creating objects
// that implement the DOMBuilder and the DOMWriter interfaces
static const XMLCh gLS[] = { chLatin_L, chLatin_S, chNull };
DOMImplementation *impl =
DOMImplementationRegistry::getDOMImplementation(gLS);
// construct the DOMBuilder
DOMBuilder* myParser = ((DOMImplementationLS*)impl)->
createDOMBuilder(DOMImplementationLS::MODE_SYNCHRONOUS, 0);
// parse the XML data, assume it is saved in a local file
// called "theXMLFile.xml"
// the DOMBuilder will parse the data and return it as a DOM tree
DOMNode* aDOMNode = myParser->parseURI("theXMLFile.xml");
// construct the DOMWriter
DOMWriter* myWriter = ((DOMImplementationLS*)impl)->createDOMWriter();
// optionally, set some DOMWriter features
// set the format-pretty-print feature
if (myWriter->canSetFeature(XMLUni::fgDOMWRTFormatPrettyPrint, true))
myWriter->setFeature(XMLUni::fgDOMWRTFormatPrettyPrint, true);
// set the byte-order-mark feature
if (myWriter->canSetFeature(XMLUni::fgDOMWRTBOM, true))
myWriter->setFeature(XMLUni::fgDOMWRTBOM, true);
// serialize the DOMNode to a UTF-16 string
XMLCh* theXMLString_Unicode = myWriter->writeToString(*aDOMNode);
// release the memory
XMLString::release(&theXMLString_Unicode);
myWriter->release();
myParser->release();
|
Both DOMBuilder and DOMWriter
are constructed using
factory methods from DOMImplementationLS.
When finished, they both need to be released
explicitly to relinquish any associated resources. Also, the returned string from
writeToString is owned by the caller, who is responsible
for releasing the allocated memory.
You can also opt to set some features that control the behavior of the
DOMWriter.
Xerces-C++ has implemented a number of DOMWriter features that are specified in the
W3C DOM Level 3 Load and Save Specification.
A complete list can be found in the Xerces-C++ programming guide,
DOMWriter Supported Features (see Resources). A couple of them are worth highlighting:
- format-pretty-print -- This formats the output by adding a newline carriage return and
indented whitespace to produce a pretty-printed, human-readable form. The exact form of
the transformations is not specified in the
W3C DOM Level 3 Load and Save Specification, and thus the
parser has its
own interpretation. In releases prior to Xerces-C++ 2.2 (or XML4C 5.1), the parser
only pretty-prints the prologue and the epilogue.
It doesn't touch the content within the root
element. And from Xerces-C++ 2.2 (or XML4C 5.1) onwards, turning on this feature
also causes the content within the root element to be formatted.
- byte-order-mark -- This is a non-standard extension added in Xerces-C++ 2.2
(or XML4C 5.1) to enable the writing of the Byte-Order-Mark (BOM) in the resultant XML
stream. The BOM is written at the beginning of the resultant XML stream, if and only if
a
DOMDocumentNode is rendered for serialization, and the
output encoding is one of the following:
- UTF-16
- UTF-16LE
- UTF-16BE
- UCS-4
- UCS-4LE
- UCS-4BE
 |
Output streams supported by Xerces-C++
DOMWriter provides an API for writing a DOM node into various types of output streams.
Xerces-C++ supports four types of output streams:
DOMString
MemBufFormatTarget
StdOutFormatTarget
LocalFileFormatTarget
DOMString
Users can serialize a DOMNode into a
DOMString (that is, XMLCh* in Xerces-C++) using the
DOMWriter method writeToString. This
method completely ignores all
the encoding information available, and the returned string is always encoded in UTF-16. And
as mentioned above, the string returned from writeToString
is owned by the caller, who is responsible for releasing the allocated memory.
For example:
Listing 2. Serializing a DOMNode to a UTF-16 string
// construct the DOMWriter
DOMWriter* myWriter = ((DOMImplementationLS*)impl)->createDOMWriter();
// serialize a DOMNode to a UTF-16 string
XMLCh* theXMLString_Unicode = myWriter->writeToString(*aDOMNode);
// release the memory
XMLString::release(&theXMLString_Unicode);
myWriter->release();
|
If you would like to receive a string encoded in something other than UTF-16, you can
transcode the string manually using an XMLTranscoder. Construct your XMLTranscoder for a specific encoding using
XMLPlatformUtils::fgTransService-> makeNewTranscoderFor,
then call transcodeTo to transcode
the UTF-16 string to your specified encoding. For example:
Listing 3. Serializing a DOMNode to a Big5 string
// construct the DOMWriter
DOMWriter* myWriter = ((DOMImplementationLS*)impl)->createDOMWriter();
// serialize a DOMNode to a UTF-16 string
XMLCh* theXMLString_Unicode = myWriter->writeToString(*aDOMNode);
// construct a transcoder in Big5
XMLTransService::Codes resCode;
XMLTranscoder* aBig5Transcoder = XMLPlatformUtils::fgTransService->
makeNewTranscoderFor("Big5", resCode, 16*1024,
XMLPlatformUtils::fgMemoryManager);
// transcode the string into Big5
unsigned int charsEaten;
char resultXMLString_Encoded[16*1024+4];
aBig5Transcoder->transcodeTo(theXMLString_Unicode,
XMLString::stringLen(theXMLString_Unicode),
(XMLByte*) resultXMLString_Encoded,
16*1024,
charsEaten,
XMLTranscoder::UnRep_Throw );
// release the memory
XMLString::release(&theXMLString_Unicode);
delete aBig5Transcoder;
myWriter->release();
|
This assumes that the underlying transcoder that is integrated with the parser supports
the encoding you've specified. Xerces-C++ has intrinsic support for ASCII, UTF-8, UTF-16
(Big/Small Endian), UCS4 (Big/Small Endian), EBCDIC code pages IBM037 and IBM1140, ISO-8859-1
(aka Latin1), and Windows-1252. If you wish to have more encodings support -- say in Shift-JIS or
Big5 -- then you may wish to use XML4C which integrates the Xerces-C++ parser with IBM's
International Components for Unicode (ICU) and extends the support to over 100 different encodings.
However, the XMLTranscoder does not alter the encoding
information stored in the XML declaration
of the input string that was generated by writeToString.
Thus the encoding attribute of the manually transcoded XML string is
still "UTF-16" instead of "Big5". This can be misleading if you are serializing the entire DOMDocumentNode,
where the encoding information is included in the XML declaration.
In that case, it is recommended to use MemBufFormatTarget
instead, for receiving an encoded string other than UTF-16.
MemBufFormatTarget
MemBufFormatTarget saves the XML data to an internal buffer.
MemBufFormatTarget is
initialized to have a memory buffer of 1023 upon construction and will grow as needed.
It returns a null-terminated XMLByte stream upon request through the
getRawBuffer() method. Users should make their own copy of the returned
buffer if they intend to keep it independent on the state of the MemBufFormatTarget. Otherwise,
the buffer will either be deleted when MemBufFormatTarget is destroyed or it will be reset when the
reset() function is called.
The encoding of the returned XMLByte stream is determined in the following order:
- The encoding setting in the
DOMWriter is used
- If that is null, then the encoding attribute of the DOM stream to be written is used
- If neither of the above provides an encoding name, then a default encoding of UTF-8 is used
The DOMWriter will store the correct encoding information --
which matches the actual
encoding of the string -- in the encoding attribute of the XML declaration.
Listing 4 illustrates how to receive an XML string encoded in Big-5:
Listing 4. Using MemBufFormatTarget
// construct the DOMWriter
DOMWriter* myWriter = ((DOMImplementationLS*)impl)->createDOMWriter();
// construct the MemBufFormatTarget
XMLFormatTarget *myFormatTarget = new MemBufFormatTarget();
// set the encoding to be Big5
XMLCh tempStr[100];
XMLString::transcode("Big5", tempStr, 99);
myWriter->setEncoding(tempStr);
// serialize a DOMNode to an internal memory buffer
myWriter->writeNode(myFormatTarget, *aDOMNode);
// get the string which is encoded in Big 5 from the MemBufFormatTarget
char* theXMLString_Encoded = (char*)
((MemBufFormatTarget*)myFormatTarget)->getRawBuffer();
// release the memory
myWriter->release();
delete myFormatTarget;
|
Again, this also depends on the underlying transcoding capability supported by the
parser. If the encoding you've specified is not supported, DOMWriter issues a fatal error.
Besides serializing the XML data to an internal buffer, there are two other types of output streams -- StdOutFormatTarget and LocalFileFormatTarget.
StdOutFormatTarget
StdOutFormatTarget saves the XML data to the standard output. For example:
Listing 5. Using StdOutFormatTarget
// construct the DOMWriter
DOMWriter* myWriter = ((DOMImplementationLS*)impl)->createDOMWriter();
// construct the StdOutFormatTarget
XMLFormatTarget *myFormatTarget = new StdOutFormatTarget();
// serialize a DOMNode to the standard output
myWriter->writeNode(myFormatTarget, *aDOMNode);
// release the memory
myWriter->release();
delete myFormatTarget;
|
LocalFileFormatTarget
LocalFileFormatTarget saves the XML data to a physical local file.
Users need to pass a local file name as a parameter when constructing the
LocalFileFormatTarget. For example:
Listing 6. Using LocalFileFormatTarget
// construct the DOMWriter
DOMWriter* myWriter = ((DOMImplementationLS*)impl)->createDOMWriter();
// construct the LocalFileFormatTarget
XMLFormatTarget *myFormatTarget = new LocalFileFormatTarget("myXMLFile.xml");
// serialize a DOMNode to the local file "myXMLFile.xml"
myWriter->writeNode(myFormatTarget, *aDOMNode);
// optionally, you can flush the buffer to ensure all contents are written
myFormatTarget->flush();
// release the memory
myWriter->release();
delete myFormatTarget;
|
The file is created automatically if it doesn't already exist. Optionally, you can flush the content
of the file before doing any I/O to ensure that all the contents are written out.
Conclusion
You should now have a good understanding of how to serialize XML data into different types
of output streams with different encodings. Again, for more details please refer to the
W3C DOM Level 3 Load and Save Specification and the complete API documentation
in Xerces-C++.
Resources
About the author  | |  | Tinny Ng is an Advisory Software Developer currently working as a Business Scenario Solution Designer in the WebSphere System House at the IBM Toronto Lab. Previously she was the team lead of the XML for C++ parser development team, and has led the team to deliver nine Apache Xerces-C++ releases and seven IBM XML4C releases within two years. Tinny also led the architectural design work for the C++ XML parser, including redesigning the DOM implementation in the parser for faster performance, and defining a new DOM C++ binding which was then referenced by the W3C. She is an active Committer to the Apache XML Open Source Project, Xerces-C++. |
Rate this page
|  |