 | Level: Intermediate Matthias Nicola (mnicola@us.ibm.com), DB2/XML Performance, IBM Silicon Valley Laboratory Vitor Rodrigues (vrodrig@us.ibm.com), DB2 pureXML Technical Enablement, IBM
07 Dec 2006 Like other databases, the DB2® V8 XML Extender offers two storage and access models for XML: XML documents can be stored intact as unparsed text in CLOB columns, or they can be mapped and shredded to a set of relational tables. Both options have known performance limitations. The new pureXML™ technology in DB2® 9 seeks to overcome these limitations by storing and querying XML in its inherent hierarchical format. This article describes a series of measurements to characterize the cases in which pureXML does or doesn't provide a performance benefit, and to quantify the performance difference to CLOB or shredded storage.
Introduction
The pureXML™ technology in DB2® 9 is designed to provide
the highest level of performance for XML data management. This article compares
its performance with that of character large object (CLOB) and shredded XML storage. Many database systems
allow you to store XML data as CLOBs or "shred" the data into
relational tables. These two options are also supported in DB2® V8
through the XML Extender, which is still available unchanged in DB2 9 for backwards
compatibility. However, they are superseded by the pureXML features.
The DB2 XML Extender consists of a set of stored procedures, user-defined functions
(UDFs), and user-defined data types (UDTs) which add XML capabilities on top of the
core DB2 engine. Equipped with an XML parser and XML-specific logic in its
procedures and UDFs, the XML Extender performs XML storage and retrieval (supported
by traditional DB2 engine features). You can use the XML Extender with either
an XML Extender Column or an XML Extender Collection.
An XML Extender Column allows you to store XML documents intact as
plain unparsed text. This process is quite simple, but ignores the internal structure of the
XML documents. You can choose to make the underlying storage a CLOB column, a VARCHAR
column, or files within the file system. In a VARCHAR column, the XML Extender only stores
documents up to 3KB in size -- a limit which is hard to guarantee in many
applications. Knowledgeable database administrators (DBAs) can change the limit to 32k, but often even this maximum
cannot be guaranteed. External file storage is more flexible but does not benefit
from database-managed persistency and integrity. This is why CLOBs, which can store
documents up to 2GB in size, have emerged as the most common choice for XML Extender
Columns. Later, this article examines the performance of XML Extender CLOB columns.
An XML Extender Collection allows you to convert XML data to a relational
format. This requires a fixed mapping from the expected XML structure to a
collection of relational tables in your database schema. Based on that mapping, a
stored procedure extracts atomic data values from XML documents and inserts them
into traditional relational rows and columns. This process is known as
"shredding" or decomposition. It involves XML parsing and
translates a single logical XML document insert into a series of SQL row inserts.
In real-world applications, it can easily take several dozen relational tables to
represent all one-to-many relationships in the original XML structure. Thus, the
mapping becomes complex very quickly, and XML insert performance suffers
correspondingly. Once the data is available in relational format, plain SQL can be
used for data access and manipulation. However, reconstruction of the original XML
documents is also expensive. It requires multi-way joins and the generation of proper XML
tagging. The tagging can be defined by the standardized SQL/XML publishing
functions, to reconstruct the original or new and different documents. However, the XML
Extender Collection won't preserve any digital signatures associated with the
original XML document.
Making XML data available in relational format continues to be an important
requirement. The most common reason is the need to feed legacy SQL applications,
packaged business applications and business intelligence (BI) tools that consume only relational data. Therefore DB2 9 offers a new "decomp" solution, also known as "annotated schema shred" or "new shred", which can be 7 to
8 times faster than XML Extender Collection shredding. Later in this article, the performance of this new high-speed shredder is compared to the
IBM pureXML™ support in IBM DB2 9.
The new pureXML technology in DB2 9 is significantly different from CLOB or shredded XML storage.
It does not store documents as plain text and does not map XML to relational or
object-relational tables. Instead, it stores XML in its inherent hierarchical
format, which matches the XML data model. Any XML document is a well-defined tree of elements and
attributes, and XML queries are expressed in terms of tree traversal. Therefore,
it seems intuitive that a corresponding hierarchical storage and processing format allows for
efficient XML data management. To get to the bottom of this claim, the
performance of pureXML in DB2 9 is compared with CLOB-based and shredded XML processing.
Test setup
Table 1 summarizes the comparisons made in this article. Basically, key XML operations on CLOB and shredded storage are compared to corresponding pureXML operations.
Table 1: Comparing CLOB to shredded XML processing and pureXML
|
XML in CLOBs
|
DB2 9 pureXML
|
|---|
|
Insert XML into an XML Extender CLOB column
|
Insert XML into an XML column
| |
Full document retrieval from CLOBs
|
Full document retrieval from an XML column
| |
Query XML in CLOBs using XML Extender "extract" functions (XML parsing
at query time)
|
XQuery over an XML column
| | | |
XML shredded to relational tables
|
DB2 9 pureXML
|
|---|
|
Shred XML to relational tables, using the DB2 9 new shred
|
Insert XML into an XML column
| |
SQL/XML publishing to construct XML documents from relational data
(such as previously shredded XML)
|
XQuery over an XML column
|
All tests were performed with the following data and settings:
- A 4-CPU pSeries system with IBM® AIX®
5.2 (64bit) and a single DB2 9 instance
- Between 1,000 and 100,000 CustAcc documents (4kb to 20kb in size) were used, taken
from the financial scenario in the "DB2 9 XML performance
characteristics" article
- Database managed (DMS) table spaces with a page size of 32kb
- All table spaces were defined with the
no file system caching option, except where
otherwise noted (for some of the CLOB storage tests)
- All tables spaces were distributed over 10 physical disks, with the database
log on a separate striped volume
- Identical database configuration and tuning was used in all tests to ensure fair and
valid comparisons
Comparing CLOB to pureXML columns
This comparison is interesting because for a certain class of today's XML
applications, CLOB columns are a commonly choice for XML storage. Before DB2 9, there wasn't a better alternative. A fundamental difference between CLOB
storage and pureXML processing lies in XML parsing and its significant impact on
insert and query performance.
If XML documents are inserted into CLOB columns, they are inserted as unparsed
text objects. Avoiding XML parsing at insert time can be a performance benefit,
especially in CPU-bound systems. However, without XML parsing, the structure of the
XML documents is entirely ignored. This precludes the database from performing
intelligent and efficient search and extract operations on the stored text objects.
The only remedy is to invoke the XML parser at query execution time to "look into"
the XML documents such that search conditions can be evaluated. The significant CPU
consumption of XML parsing often leads to low search and extract performance.
Only a blind full document retrieval, which again ignores the internal XML
structure, can quickly read XML documents from CLOB columns.
The pureXML technology in DB2 9 parses XML documents at insert time and never at
query time. The XML documents are stored and queried in a parsed format, denoted in
DB2 9 by the use of a new data type, "XML". This parsed format is a tree structure of nodes,
which is different from the textual representation of XML documents. Search and extract operations can be performed
without XML parsing, a significant performance benefit, because the XML
parsing overhead is incurred at insert time instead. Similarly, document retrieval from an
XML column requires serialization, that is, converting the parsed XML
format back into its original textual representation. This overhead does not exist
when reading full XML documents from a CLOB where the XML is already stored in text
form.
In summary, CLOB storage offers good performance for insert and full document
retrieval, typically at the cost of poor search and extract performance. The XML
data type in DB2 9 sacrifices some of the insert and retrieval performance for much higher search and extract performance. This is a reasonable trade-off
because business data is more often searched and analyzed than inserted. Typically, there is one insert and multiple searches. Additionally, the potential overhead of
XML columns is often outweighed by the fact that XML columns are buffered in the
buffer pool while CLOB columns are not.
The next section reviews performance measurements to quantify these
trade-offs.
Comparing CLOB inserts to XML inserts
In the first test, we sequentially inserted 100,000 documents with and without
indexes, committing after every document as many OLTP applications would. The
result in Figure 1 shows the relative elapsed time between XML and CLOB column
inserts (lower is better). Setting the elapsed time for XML inserts as the
comparison baseline (100%), maintaining six XML indexes incurs only
little overhead (5% in our scenario).
Figure 1: Comparing the single-user insert performance of XML to CLOB columns
Inserting the same XML data into a CLOB column takes approximately half the
time (53%). This is because XML parsing and deeper processing of the XML data is
avoided. Loosely speaking, XML Extender "side tables" are for CLOB columns what real XML indexes are
for XML columns. Maintaining side tables at insert time requires
XML parsing and additional relational inserts because selected XML element and
attribute values are extracted and stored separately. Consequently, the insert
elapsed time is at least as high as for pureXML inserts, in our scenario even 23%
higher.
The experiments in Figure 1 used the no file system caching option for the DB2
table space. If you allow file system caching for the DMS table space containers
(Figure 2), the XML insert performance improves slightly while the CLOB inserts
suffer. Since CLOB inserts use direct writes, the file system cache is not needed
and is pure overhead. However, file system caching can help
read operations on CLOB columns if no XML parsing is involved.
Figure 2: The impact of file system caching on XML and CLOB column inserts (no indexes or side tables)
Single-user database applications are very rare, so it is important to look at
the performance behavior of a concurrent workload. The insert
test was modified to use 10 concurrent threads with 10 separate connections to the database,
each inserting 10,000 documents without think time. Figure 3 shows that the
increased workload intensity hurts CLOB performance drastically. While CLOB inserts
were about twice as fast as XML inserts in our single-user tests (Figure 1), they
are more than twice as slow in our multi-user test (207%). This is because CLOB
inserts do not benefit as much from parallelism as XML column inserts. With
increasing degrees of concurrency, buffered inserts for XML columns are far more
scalable than concurrent direct writes for CLOBs.
Figure 3: Multi-user insert performance of XML compared to CLOB columns
Figure 4 clarifies that the parallelism speed-up is much higher for XML columns
than for CLOB columns. Taking single-user XML inserts as the 100% baseline, our
system and configuration allowed 10 concurrent insert streams to insert the same
100,000 documents in 18% of the time, which is more than five times faster. CLOB inserts didn't
benefit nearly as much from the parallelism and still took 37% of the baseline,
which is only 1.4 times faster than non-parallel CLOB inserts.
Figure 4: Parallelism speed-up for XML and CLOB column inserts
All of these insert tests were also conducted with 1000, 5000, 10,000, and 50,000
documents. As expected, the elapsed insert time for both CLOB and XML columns is
approximately linear in the number of documents inserted. Therefore, the
corresponding charts are omitted for brevity.
XML queries on CLOB compared to XML columns
To assess the difference in query performance between XML and CLOB columns, five queries were designed to cover the following common search and retrieval cases:
- Full document retrieval of all documents, no predicate
- Full document retrieval of one document matching certain criteria (one
predicate)
- Full document retrieval of documents matching certain criteria (multiple
predicates)
- Partial retrieval of all documents
- Partial retrieval of all documents matching certain criteria
These operations were implemented in the five queries as follows:
Q1 (Select*): Selects all XML documents (select * from <table>)
Q2 (1Pred1Doc): Returns the Customer document for a given account number
Q3 (5PredSome): Returns Customer documents for all female customers with a primary
address in California who have an account in USD and have not yet reached Premium
customer status
Q4 (PartialAll): For each Customer, returns the name, and the sum of the balances
of all his/her accounts
Q5 (PartialSome): Gets the primary email address for all customers who are
holding IBM stock in any of their accounts.
For CLOBs, these queries were expressed in SQL with XML Extender extract functions.
For XML columns, they were expressed in XQuery notation. Whether the XQueries
were embedded in SQL or executed stand-alone did not make a performance difference
in our tests. All queries and some sample data are available in pureXMLvsCLOBvsShredded.zip in the Downloads section.
Figure 5 shows the query performance (elapsed time) of all five test queries for
both pureXML and CLOBs. You can see that pureXML queries can easily be 20, 30 or 40
times faster than queries over XML in CLOB columns. These results are for our
default setup with no file system caching for table spaces. Figure 6 contains the
same results with file system caching. File system caching has a
significant impact only on query Q1 which retrieves all documents without
any predicate evaluation. Without file system caching, Q1 performs similarly
on XML and CLOB columns, with a slight disadvantage (10%) for the CLOB column which
is not buffered in the buffer pool. File system caching significantly improves CLOB
retrieval performance (see Figure 6) such that query Q1 can be more than twice as fast
as on the XML column. This is because reading from the XML column requires
serialization, or the conversion of parsed XML back to text format. Without file
system caching, this overhead is mitigated by the fact that XML columns are
buffered in the DB2 buffer pool and CLOB columns are not.
Figure 5: Query performance, no indexes, no file system caching
For queries 2 through 5, file system caching does not play a significant role.
These queries always require sub-document level access to evaluate predicates
and to extract document fragments. This is where the real benefit of pureXML
kicks in: XML is stored in a parsed format, so no parsing is required at query
execution time. In our tests, this yields a performance speed-up between 7 times and
44 times.
Figure 6: Query performance, no indexes, with file system caching
It is important to note that the difference in query response time between XML
columns and CLOB columns increases drastically with the amount of data that needs
to parsed (for CLOBs) or traversed (for pureXML). Figure 7 shows the query response times as
a function of the number of documents in the table, ranging from 1,000 to 100,000
documents (when no indexes or side tables are used).
Figure 7: Query 2 performance as a function of the data volume
The XML Extender offers the concept of side tables to speed up the search for
XML documents in order to avoid XML parsing for predicate evaluation. At insert time,
specific elements and attributes are extracted into relational tables. You already know that this adds significant overhead to CLOB inserts, but the side tables can be
efficiently searched and joined with the main table containing the CLOBs. Three of
our five test queries (q2, q3, and q5) contain filtering predicates that can
benefit from side table look-ups. Side tables can avoid much of the XML parsing for
CLOBs and can often make CLOB queries 100 times faster or more.
In Figure 8, let's compare this to pureXML with real XML indexes that provide
similar benefits. All of the six bars in Figure 8 represent elapsed times of about
one second or less. And yet pureXML with indexes is 6 to 35 times faster
than CLOB columns with side tables. There are multiple reasons for this. A pureXML
index points directly to the row with the corresponding document. With side tables,
DB2 first performs an index lookup on the side table and then joins the
corresponding row with the main table that contains the CLOBs. Query Q3
(5PredSome) has multiple predicates and uses three side tables plus the main
table, so it needs to compute a four-way join. Query Q5 uses side tables for the
predicates, but requires an extract function (with XML parsing) to retrieve the
customers email addresses.
Figure 8: Predicate evaluation with indexes (pureXML) and side tables (CLOBs)
Figure 8 clarifies that although side tables can reduce the number of CLOBs that
must be parsed, XML indexes in DB2 9 similarly reduce the number of documents
that must be traversed for predicate evaluation. Thus, the use of indexes and
side tables typically reduces the absolute elapsed times for both XML and CLOBs, but
does not erase the large relative performance advantage of XML over CLOB
columns.
Multi-user query tests were executed with 10 users on both CLOB and XML
columns. The results are omitted because the relative performance difference
between CLOB and pureXML storage was similar to the single-user tests shown
above.
Shredded storage compared to XML columns
Shredding XML data to relational tables is still a common requirement. A typical
reason is that existing applications may not yet understand XML and require a
relational format. This includes legacy SQL applications and packaged business
applications as well as BI and reporting tools. The annotated XML Schema decomposition in DB2 9 ("new shred") resolves the functional and performance limitations of the older XML
Extender shredding. The following section examines how the new DB2 9 shredding performance
stacks up against the pureXML technology in DB2 9.
For these tests, the same Customer data was used as before. Due to various
repeating elements, a total of 87 columns in 12 tables were required to represent the
XML data in a traditional relational schema. The schema is shown in Figure 9,
indicating the number of columns and the number of rows in each table after
shredding 100,000 of our customer documents. In total, the tables contain more than
3.5 million relational rows to represent the 100,000 XML documents. The arrows in
Figure 9 indicate one-to-many relationships. Indexes are defined on all primary
and foreign key columns to support efficient joins between the 12 tables.
Figure 9: Relational schema to hold shredded XML data
Comparing shredding to relational tables with pureXML insert
Shredding the 100,000 documents in this relational schema with the new annotated
schema shred in DB2 9 is about 1.75 times slower than inserting the same XML documents
into an XML column (see Figure 10). This is with commits after every document, as is
typical of OLTP applications.
Figure 10: pureXML inserts compared to DB2 9 shredding, commit after every document
If you commit less frequently, you can increase the performance advantage of an
XML column over shredding. Committing after every other document makes XML
inserts 2.56 times faster than shredding (in our scenario and configuration). For bulk
insert or import operations, you would only commit every 50 or 100 documents and
find XML column inserts 4 to 5 times faster than shredding (see Figure 11). XML column
inserts benefit from larger commit intervals more than shredding does, with a
larger number of log pages per log I/O request.
Figure 11: pureXML insert compared to DB2 9 shredding, varying commit interval
For a commit interval of 50, Figure 12 compares the new annotated schema shred
in DB2 9 to the XML Extender shred in V8 and pureXML inserts. In our tests, the new
shred is 7 times faster than the XML Extender shred, and pureXML inserts can be even 30 times
faster than V8 shredding.
Figure 12: DB2 V8 XML Extender shredding compared to DB2 9 technology (commit interval = 50)
Comparing XML queries on shredded data to pureXML columns
Assume that you need to serve an XML-oriented application or
web service that requires query results in XML format. If the XML data is shredded
to relational tables, relational query results must be converted back to XML.
For this purpose, SQL/XML publishing functions can be used in the SELECT clause of
SQL queries to construct the required XML tagging for the result set.
Five SQL queries were defined over the relational schema in Figure 9 that are logically equivalent to our five XQueries over XML columns. These SQL
queries use traditional relational predicates, join some or all of the tables, and
use SQL/XML publishing functions to return the same XML results as the XQueries.
The relative performance results are shown in Figure 13.
Query Q2 (1Pred1Doc) returns only one document, based on an account number
lookup. In both cases (pureXML and publishing), there are indexes on the account
number to improve the performance of this query. Although the SQL/XML query is
quite fast, it's still 75 times slower than an XQuery on an XML column. This is because
an XQuery only needs to locate the document through the index and serialize it, while
SQL/XML publishing must join all 12 relational tables and construct the XML
document.
Queries Q1 (Select*) and Q3 (5PredSome) return many documents, so the cost
for SQL/XML document construction from relational data is multiplied. Query Q4
(PartialAll) reads a few values from each XML document and constructs a whole
new document out of them. With XQuery over pureXML storage the values are
read from the XML column and the construction is expressed in XQuery. With SQL/XML
publishing over shredded data, the values are read from relational columns, and the
construction is expressed in SQL/XML. In both cases, the construction is the
bottleneck. This is why both versions of Q4 perform approximately identically.
Figure 13: pureXML XQuery compared to SQL/XML publishing over shredded data
Q5 (PartialSome) uses multiple predicates to return very few XML documents
with only one element. In this case, SQL/XML is not very expensive because
it constructs just one element per result row and joins only 3 of the 12 tables.
Since searching over relational data can be somewhat faster than searching over XML
data, you see better performance for this query over shredded data (20 times faster
than XQuery in our setup).
Summary and conclusions
While a single-user workload may find CLOB inserts faster than pureXML inserts,
the intensity of concurrent inserts in typical business scenarios can make CLOB inserts 2 or 2.5 times slower than
inserts into an XML column. If inserts are committed after every document, pureXML inserts are about 60% to 70% faster than the new shredding solution in DB2 9.
For bulk inserts or import with less frequent commits, pureXML can ingest XML data
even 4 to 5 times faster than shredding. These tests used the new shredding solution in
DB2 9, which is 7 to 8 times faster than XML Extender shredding in V8.
XQuery over XML data in XML type columns can be 40 times faster than corresponding
queries over CLOB columns which require XML parsing at query time. The absolute performance difference between "pureXML queries" and "CLOB queries" increases (linearly) with the amount of data that is queried.
The relative performance difference between pureXML XQueries and SQL/XML
Publishing over shredded data depends heavily on the amount of relational data that
needs to be tagged as XML, but also on the number of joins required.
XQuery clearly outperforms publishing queries when complex data must be
retrieved. In some of our tests, retrieving XML data from DB2 pureXML storage can be 50 to 100 times faster than constructing XML data from relational tables.
However, simple search queries without complex joins that return results with
little or no XML tags can be much faster in SQL. Still, an application always needs
to weigh query performance against insert performance and evaluate the mix of both.
XML insert performance is summarized in Figure 14.
Figure 14: Summary of XML insert performance (commitcount =1)
Note that all performance results in this article were obtained in an isolated
lab environment with specific hardware, operating, and database configurations.
Different tests in different environments may show bigger or smaller performance
differences than the results described in this article.
Although all of our measurements used DB2 only, many of the performance issues
with CLOB and shredded XML storage are inherent in the general concepts of storing
XML as text or converting it to the relational data model, respectively. Therefore,
you may see similar performance characteristics in other DBMS environments that
support these concepts. Readers interested in such matters are encouraged to
explore such issues in their own environments, perhaps using the materials in this
article as a basis for developing their own tests.
Acknowledgement
For their reviews of this paper, we'd like to thank Henrik Loeser, Cindy
Saracco and Nikolaj Richers.
Download | Description | Name | Size | Download method |
|---|
| Sample scripts for this article | pureXMLvsCLOBvsShredded.zip | 14KB | HTTP |
|---|
Resources Learn
-
"What's new in DB2 Viper - XML to the core" (developerWorks, February 2006), Cynthia M. Saracco.
-
"DB2 9 XML Performance Characteristics" (developerWorks, June 2006), Irina Kogan, Matthias Nicola, and Berni Schiefer.
-
"15 Best Practices for pureXML Performance in DB2 9" (developerWorks, October 2006), Matthias Nicola.
-
"pureXML in DB2 9: Which way to query your XML data?" (developerWorks, June 2006), Matthias Nicola and Fatma Ozcan.
-
"An introduction to the SQL/XML publishing functions" (developerWorks, November 2005), Roman Melnyk.
-
"From DAD to annotated XML schema decomposition" (developerWorks, April 2006), Mayank Pradhan.
-
DB2 XML Extender, WORF, and WebSphere Application Developer (developerWorks, September 2004), Owen Cline.
-
"Exploit XML indexes for XML query performance in DB2 9" (developerWorks, November 2006), Matthias Nicola
-
Visit the DB2 XML (pureXML) wiki to stay up-to-speed on DB2 XML technology.
-
Browse the technology bookstore for books on these and other technical topics.
-
developerWorks Information Management zone: Learn more about DB2. Find technical documentation, how-to articles, education, downloads, product information, and more.
-
Stay current with developerWorks
technical events and webcasts.
Get products and technologies
-
DB2 v9: Get DB2 9 software.
-
Find useful information about the DB2 family of products and features as well as related WebSphere(R) Information Integration products and features at the DB2 Information Center.
-
DB2 Express-C: Free to develop, deploy, and distribute.
-
Build your next development project with
IBM
trial software, available for download directly from developerWorks.
Discuss
About the authors  | 
|  | Dr. Nicola is the technical lead for XML database performance at IBM's Silicon Valley Lab. His work focuses on all aspects of XML performance in DB2, including XQuery, SQL/XML, and all native XML features in DB2. Dr. Nicola works closely with the DB2 XML development teams as well as with customers and business partners who are using XML, assisting them in the design, implementation, and optimization of XML solutions. Prior to joining IBM, Dr. Nicola worked on data warehousing performance for Informix Software. He also worked for four years in research and industry projects on distributed and replicated databases. He received his doctorate in computer science in 1999 from the Technical University of Aachen, Germany. |
 | 
|  | Vitor Rodrigues is a software developer at the IBM Silicon Valley Lab. He graduated from University of Minho, Portugal, in Computer Science and Systems Engineering. He joined IBM in 2005 as an intern working on DB2 Everyplace and DB2 9 pureXML. Vitor was a part of the DB2 9 QA team for pureXML, where he acquired deep knowledge of the XML features in DB2 9. After his internship, Vitor became a regular employee, now working for the DB2 9 XML Enablement team. |
Rate this page
|  |