Performance Optimizations for the XML Data Type in SQL Server 2005

Microsoft SQL Server 9.0 Technical Articles

Shankar Pal, Babu Krishnaswamy, Vasili Zolotov, and Leo Giakoumakis
Microsoft Corporation

December 2005

Applies to:
Microsoft SQL Server 2005
XML Data Type

Summary: This paper explores several ideas to improve the query and data modification performance of the XML data type in Microsoft SQL Server 2005. To get the most value from this paper, you need to be familiar with the XML features in SQL Server 2005. For background material, see XML Support in Microsoft SQL Server 2005 and XML Best Practices for Microsoft SQL Server 2005 on the Microsoft Development Network (MSDN). (26 printed pages)

Click here for the Word version of this article [ http://download.microsoft.com/download/f/1/c/f1cf7b8d-7fb9-4b71-a658-e748e67f9eba/perfoptforxml.doc ] .

Introduction
Data Modeling with the XML Data Type
Bulk Loading XML Data
Indexing XML Data
Query and Data Modification
Conclusion

Introduction

Enterprise applications are increasingly using XML for modeling semi-structured and unstructured data. Microsoft SQL Server 2005 provides extensive support for XML data processing to help develop such applications. XML data can be stored natively in an XML data type column, which can be typed according to a collection of XML schemas or left untyped. Fine-grained data manipulation is supported using XQuery, an emerging W3C recommendation currently in Last Call, and an XML data modification language. The XML column can be indexed to improve query performance. Enterprise applications are increasingly using XML for modeling semi-structured and unstructured data and will benefit from the XML support in SQL Server 2005.

This paper provides suggestions for optimizing the storage, queries, and data modification of applications that use the XML data type. The ideas are illustrated with code samples. For a discussion of best practices for XML data modeling and usage, see the related white paper, XML Best Practices for Microsoft SQL Server 2005 [ http://msdn2.microsoft.com/en-us/library/ms345115.aspx ] , on the Microsoft Developer Network (MSDN). For information about optimizations for XML view technology using mapping, see Optimizing SQLXML Performance [ http://msdn2.microsoft.com/en-us/library/aa902661.aspx ] in the MSDN Library.

In this paper, we first consider the data modeling guidelines using XML, including database design principles, and then we provide query and data modification guidelines for optimizing the performance of applications.

Data Modeling with the XML Data Type

The XML data type provides data modeling capabilities for semi-structured and unstructured data within an enterprise. The performance of XML storage and query processing depends on the database schema design and includes factors such as the structure and granularity of the XML data and property promotion from XML columns.

The first decision to make is whether an application needs the features of the XML data model. Structured data is best modeled as relational and stored in tables with relational columns. Your interests are best served with the XML data model if you have semi-structured or markup data that need to preserve the document order and containment hierarchy, and perhaps has a recursive structure.

Sometimes, it is beneficial to store structured data in an XML data type column such as when the data has a flexible structure or the structure is not known a priori.

This scenario occurs in property management where metadata information about objects is modeled as XML and stored in an XML data type column. Properties of different types of objects, even with different structures and content models, can be stored in the same XML column and queried across. Properties queried most often are promoted into columns of the same table or a different table. The promoted properties can be indexed and queried, and query plans are simpler than querying the XML column.

Alternatively, the incoming XML data can be decomposed into tables and queried by using the SQL language. If XML generation is a significant part of the query workload, it may be beneficial to store a redundant copy of the XML data in an XML data type column. The redundant copy avoids the run-time cost of XML generation.

There are no absolute rules for modeling data with the XML data type and the pros and cons must be carefully weighed in each modeling situation. Equally important is the choice between typed and untyped XML columns, and even the way XML markup is introduced into the data. These and some other considerations are discussed in the remainder of this section.

Structure of XML Data

The same data can be marked up in different ways (element-centric, attribute-centric, and combined). This choice is dictated by the perception of what constitutes content (element value) and what constitutes meta-information (attribute value), and the cardinality of the markup (multi-occurrence of elements). Introducing XML markup into semi-structured and unstructured data in one way can be more efficient than some other ways for storage and query processing.

Using Specific Markups

Sometimes it is convenient to use generic element names and distinguish between different types of elements by using additional attributes. This does not perform well in queries because it does not allow XML index lookups to be done efficiently. For more information about XML indexing, see Indexing XML Data later in this paper.

Specific, semantically rich element names on the one hand yield markups that are more humanly readable and help to generate more efficient query plans. On the other hand, very verbose markups increase storage cost. The following example illustrates these points.

Example: Generic Versus Specific Markups

Suppose you want to use XML markup for book and DVD information. One choice is to have a generic element called <item> with an attribute @type that has one of the two values, book and DVD, to distinguish between the two types of items. A book and a DVD can be represented as follows:

<item type="book"><title>Writing Secure Code</title></item>
<item type="DVD"><title>The Godfather</title></item>

Path expressions for book and DVD, respectively, can be written as /item[@type = "book"] and /item[@type = "DVD"].

On the other hand, <book> and <DVD> are more direct XML markups:

<book><title>Writing Secure Code</title></book>
<DVD><title>The Godfather</title></DVD>

This representation yields simpler path expressions /book and /DVD. Query plans are also simpler and more efficient, because the predicate on the attribute @type is eliminated.

Furthermore, it cuts down the number of rows in the primary XML index for untyped XML from four (one row for <item>, one for @type and its value, one for <title>, and one for the title's value) to three (one row for <book> or <DVD>, one for <title>, and one for the title's value). For more information about indexing XML data, see Indexing XML Data later in this paper.

For typed XML in which <title> is a simple-valued element, the value of the <title> element is stored in the same row as the element itself. This reduces the storage overhead from three to two rows and is a significant savings.

Attribute-Centric Markups

An attribute's value is stored with the attribute markup in a single row of the primary XML index for both typed and untyped XML. By comparison, the value of a simple-valued element in untyped XML is stored in a separate row from the element markup. Thus, less storage is required by using attribute values within untyped XML.

Furthermore, evaluation of predicates is more efficient because the attribute's value is obtained from the same row as its markup in the primary XML index. This eliminates the need to access another row for the value. This is illustrated in the following example.

Example: Attribute-Centric Markup

In the example above under "Generic versus Specific Markups," title can be modeled as an attribute instead of as an element as follows:

<book title="Writing Secure Code"/>
<DVD title="The Godfather"/>

For untyped XML, this cuts down the number of rows in the primary XML index from three (a row each for <book>, <title>, and the title's value) to two (one row for <book> and a second one for the attribute @title). The case is similar for DVD.

The path expression /DVD[title = "The Godfather"] finds the DVD whose title is "The Godfather" with element-centric markup for title. With attribute-centric markup, the same query is written as /DVD[@title = "The Godfather"] and requires one less JOIN.

Typed or Untyped XML

Element and attribute values in untyped XML (XML data not described by XML schemas) are stored internally as Unicode strings. Operations on them require data conversion to the appropriate type. For example, when the path expression (/book/price)[1] > 19.99 is evaluated, the string value of <price> is converted to decimal for the numeric comparison. A large number of such comparisons can become costly.

Type information provided by XML schemas is used by the database engine in several ways. Inserted and updated XML data are verified for conformance with the XML schemas and stored in a binary representation (XML blob). Element and attribute values are stored as typed values within XML instances. This allows XML blobs to be parsed more efficiently than the corresponding textual form. Typed values are stored in XML indexes and allow index usage whenever data conversions are eliminated. Query compilation uses type information to check the static type correctness of XQuery expressions and data modification statements. Type mismatch errors are detected at compilation time and can be avoided by using explicit type casts.

Query optimizations based on type inference are also performed (if <price> of <book> is of type xs:decimal, conversion of (/book/price)[1] to xs:decimal is eliminated). This can have a positive effect on XML index lookup. A range predicate such as (/book/price)[1] < 19.99 performs a range scan on secondary XML index of type VALUE. For more information about indexing XML data, see Indexing XML Data in this paper.

The data conversion required for untyped XML prevents such range scans. Furthermore, the ordinal [1] in (/book/price)[1] is unnecessary if the XML schema specifies a single <price> element and only allows a single <book> element in each XML instance.

Typed XML requires validation during XML data insertion and modification. The cost of validation may be non-trivial and depends on factors such as the complexity of the schema definitions and the number of tags occurring in the XML data.

Property Promotion

During query processing, structural information, such as document order and containment hierarchy, are preserved in XML instances. Consequently, the query plans tend to be complex. The plan can be simplified for some queries by promoting scalar values from an XML column into relational columns of the same or different table, and writing queries directly against these columns. The promoted properties may be indexed. Materializing and indexing the property values yields better performance than using XQuery on the XML column in the same way that pre-computed values speed up query performance.

Property promotion improves performance when the property value is retrieved or the property value is used as a filter to retrieve the corresponding XML blob. In the latter case, the selectivity of the property value is an important factor.

Single-valued properties can be promoted into columns of the same table as computed columns. Both single-valued and multi-valued properties can be promoted into columns of a different table and maintained by using triggers. These two ways of promoting properties are considered in the following section.

Using a Computed Column

A Transact-SQL user-defined function is first created to extract a scalar value using XML data type methods. A computed column defined by the user-defined function is then appended to the table. These two steps are repeated for each promoted property, and relational indexes are created on those columns as needed.

The XQuery expression on the XML column must be rewritten as an SQL statement that uses the computed columns, and the XML instances are retrieved from the matching rows. Indexes on computed columns are selected by the query optimizer based on the costing of the query. Promoted properties yield faster performance than querying the XML column directly because the computed columns are pre-computed.

Indexing the computed column can be avoided when the column is used only in SELECT lists and not for evaluating predicates. In such cases, persistence of the computed column is sufficient for performance benefits. When the computed column is indexed, it needs to be persisted if the computed column expression is imprecise or non-deterministic.

The following example illustrates the use of computed column for property promotion.

Example: Using a Computed Column for Property Promotion

Suppose your workload typically looks up books given their ISBN number so that promoting the ISBN number into a computed column is worthwhile. Define a user-defined function to retrieve the ISBN number as follows:

CREATE FUNCTION udf_get_book_ISBN (@xData xml) RETURNS varchar(20)
WITH SCHEMABINDING
BEGIN
   RETURN @xData.value('(/book/@ISBN)[1]', 'varchar(20)')
END

Add a computed column to the table docs for ISBN:

CREATE TABLE docs (id int PRIMARY KEY, xCol XML)
ALTER TABLE docs ADD ISBN AS dbo.udf_get_book_ISBN(xCol)

Create a non-clustered index on the ISBN column:

CREATE INDEX COMPUTED_IDX ON docs (ISBN)

Rewrite your query as:

SELECT xCol 
FROM   docs
WHERE  xCol.exist ('/book/@ISBN[. = "0-2016-3361-2"]') = 1

to use the computed column as follows:

SELECT xCol
FROM   docs
WHERE  ISBN = '0-2016-3361-2'

The rewritten query generates a simpler query plan, because the extraction of the ISBN value is pre-computed.

Using a Property Table

A separate property table requires setting up insert, delete, and update triggers for its maintenance. It is suitable for multi-valued properties where each row in the property table contains a property value (unpivoted representation). An example illustrating the creation and maintenance of property tables can be found in XML Best Practices for Microsoft SQL Server 2005 [ http://msdn.microsoft.com/xml/default.aspx?pull=/library/en-us/dnsql90/html/sql25xmlbp.asp ] .

A sequence number column is desirable in the property table if the relative order of siblings is important for the application. This, however, complicates property table maintenance for XML subtree insertion and deletion.

Single-valued property columns can be added to the table for convenience. It introduces redundancy in the column, but eliminates a JOIN when both properties are required.

If the maximum cardinality N of the promoted property is small and known ahead of time, it may be convenient to create N computed columns instead of a separate property table and have the query processor maintain those column.

Bulk Loading XML Data

XML data can be bulk loaded into XML data type columns by using the bulk load capabilities of SQL Server. This includes the BCP IN, BULK INSERT, and OPENROWSET methods.

BCP input has been optimized to avoid intermediate copies of XML data wherever possible. Thus, if no (row or column) constraint exists on the XML column, BCP has the best performance among the three alternatives.

Using OpenRowset

OPENROWSET is a convenient way of loading XML data from files into XML columns, variables, and parameters. Querying the XML data in variables or parameters multiple times may retrieve the data that many times from the file. It is better to read the XML data once into an XML variable and to query it multiple times, as shown in the following example.

Example: Querying the Output of OPENROWSET

In the following query, the XML data is read from the file into the column [Contents] of the table expression XmlFile. The nodes() method finds the <author> elements in the XML instance. Each value() method evaluates a path expression relative to an <author> element, which loads the XML data from the file each time.

WITH XmlFile ([Contents]) AS (
SELECT CONVERT (XML, [BulkColumn]) 
FROM OPENROWSET (BULK N'C:\temp\Filedata.xml', SINGLE_BLOB) AS [XmlData]
)
SELECT nref.value('first-name[1]', 'nvarchar(32)') FirstName,
       nref.value('last-name[1]', 'nvarchar(32)') LastName
FROM    [XmlFile] CROSS APPLY [Contents].nodes('//author') AS p(nref)

The file data can be loaded once as shown in the following rewrite for better performance. The file content is read only once into the XML variable @xmlData and reused in the SELECT statement:

DECLARE @xmlData XML;
SELECT @xmlData = CONVERT (XML, [BulkColumn]) 
FROM OPENROWSET (BULK N'C:\temp\Filedata.xml', SINGLE_BLOB) AS [XmlData];

SELECT nref.value('first-name[1]', 'nvarchar(32)') FirstName,
       nref.value('last-name[1]', 'nvarchar(32)') LastName
FROM    @xmlData.nodes ('//author') AS p(nref)

Storage Space Consideration

In the presence of primary XML index, bulk loading data into an XML column and inserting very large XML instances takes up a large amount of transaction log space. This issue can be avoided by delaying the creation of the XML indexes on the XML column and using the SIMPLE recovery model, as shown in the following:

Run the following command where <database_name> is the name of the database into which the XML data will be loaded:
ALTER DATABASE <database_name> SET RECOVERY SIMPLE
Create the XML column in a new or existing table, but not the XML indexes.
Insert the XML data into the XML column.
Create the XML indexes on the XML column.
Run the following command to reset the recovery model to full:
ALTER DATABASE <database_name> SET RECOVERY FULL

Alternatively, you can use BCP to load the XML data into the database and use the BULK_LOGGED recovery model instead of SIMPLE. For more information about the SIMPLE and BULK LOGGED recovery models, see SQL Server 2005 Books Online.

In these recovery models, if the data file has been damaged since the last backup, the database must be restored from the backup and the operations redone.

In either case, pre-allocating the database file before inserting the XML data is faster than dynamically growing the database file.

Indexing XML Data

XML Indexes

For fine-grained queries into an XML column, it is advisable to create the primary XML index on the XML column. A primary XML index can be created on both untyped and typed XML columns, and indexes all paths and values within the entire XML column. Primary XML indexes create a B⁺tree based on a shredded representation of the XML instances in the XML column. This B⁺tree is created in addition to the XML blobs in the XML column and is larger in size than the combined size of the XML blobs in the XML column. The B⁺tree is used for querying the XML data using XML data type methods. The XML blob is used to optimize the cases in which the entire XML blob is retrieved from the base table, such as in SELECT * FROM docs. This is faster than serializing the XML content from the primary XML index owing to its smaller size and the serialization cost.

Secondary XML indexes provide further options for the query optimizer to come up with a better plan. Your application can get a further boost by using secondary XML indexes of type PATH, PROPERTY, and VALUE.

The PATH index is useful whenever path expressions such as /book[@ISBN = "0-2016-3361-2"] occur on an XML data type. The benefits are greater for longer path expressions. The PATH index provides good, overall speed-up.
The PROPERTY index is useful when multiple properties of an XML instance are retrieved within a SELECT statement. Clustering the properties of each XML instance together can yield better performance.
The VALUE index is useful for path expressions containing the descendant axes (the //-operator) and wildcards (/book[@* = "novel"]).

Analysis of the query workload is required to determine whether one or more of the secondary XML indexes are helpful. The index maintenance cost should also be taken into account in measuring the overall benefit of indexing the XML data.

Many applications know the expected query workload and will benefit by indexing only the paths occurring in the queries. Those paths can be promoted as properties as discussed in Property promotion later in this paper.

Partial XML Update

In-place update of the XML data type yields significant performance improvement during fine-grained data modification. The difference between the new (after the data modification) and the old (before the data modification) states is computed and applied to the XML column storage as well as the primary XML index. Changes in the primary XML index are propagated to the secondary XML indexes as well. The performance benefits come from the smaller amounts of data updated in storage and the corresponding savings in the transaction log. These savings offset the cost of comparing the new and the old states in most cases.

The best case scenario is the modification of the value of an attribute or an element using the "replace value of" statement in XML DML. This requires updating a single row in each of the primary and secondary XML indexes on the XML column. The update is also local to the on-disk page of the XML blob containing the updated attribute or the element. Of course, replacing the old value with a large value causes new disk pages to be written. The following is an example where the update is very efficient.

Example: Updating the Value of an Attribute

Modifying the <price> of a <book> as shown in this example performs in-place update of the XML instance and the XML indexes:

UPDATE docs
SET    xCol.modify ('replace value of (/book/price/text())[1] with 29.99')

For insertion of an attribute, an element, or a subtree, the newly inserted node and the siblings following it, together with their subtrees, are updated or inserted. Similar changes occur in the XML blob. The case is similar for node deletions and the sibling beyond the point of deletion are updated.

The worst case scenario occurs during the insertion of a node as the leftmost fragment of an XML data type instance or the insertion of the leftmost child of the root element. This updates the entire XML instance. This situation can be avoided by inserting the node as the rightmost fragment in the XML instance or the rightmost child of the root element.

Deletion of the leftmost fragment or the leftmost child of the root element has similar costs. If an element is inserted and deleted often, it is better to insert it as the rightmost fragment or the rightmost child of the root element. The following example illustrates an expensive case.

Example: Costly Update

The <publisher> element is inserted as the leftmost child of the <book> element and causes updating all the sub-elements of <book>.

UPDATE docs
SET    xCol.modify ('
 insert <publisher>Microsoft Press</publisher>
 before (/book/title)[1]')

Inserting the <publisher> as the rightmost child of <book> is more efficient:

UPDATE docs
SET    xCol.modify ('
 insert <publisher>Microsoft Press</publisher> into (/book)[1]')

XML schema constraints may determine the insertion point, and inserting the new node in the rightmost permissible position yields the best performance.

Optimizations Prevented by Union of Types

A value of a union type requiring an implicit cast prevents lookup of secondary XML indexes for the value, although secondary XML indexes may be used for matching paths. Thus, it prevents range scans from occurring on the VALUE secondary XML index. For more information, see Range Conditions in this white paper. The same reasoning applies to <xs:anyAttribute>, as well.

Model groups (<xs:choice> and <xs:all>), substitution groups, and wildcard sections (xs:any) have as a content model a union of types. When the exact type is not known during query compilation and optimization, run-time type casts may be needed for operating on their values. This slows down the query. Thus, such XML schema structures and data types should be avoided, if possible, for performance reasons.

Using schema structures that indicate singleton occurrence of elements helps with query optimization. For this reason, a <xs:choice> structure is preferred over a <sequence> with optional elements.

Disabling XML Index Selection

XML index selection is disabled in check constraints because the query optimizer does not guarantee that the XML index is modified before evaluating the constraint,or vice versa. Sufficient care must be exercised to ensure that the constraint can be evaluated efficiently on XML blobs by following the performance guidelines in this paper. Furthermore, XML index selection is disabled in views with CHECK OPTION.

Full-Text Index on the XML Column

A full-text index can be created on an XML column independently of XML indexes on the column. It indexes element content, ignores the XML markup tags and attribute values, and uses the markup tags as token boundaries.

The XQuery function fn:contains() has the semantics of a literal, substring match that is case-sensitive in the implementation. On the other hand, a full-text search using CONTAINS() uses a token match with stemming. Thus, their semantics are different. To illustrate the differences, a search for the word "data" matches the word "database" in Xquery, but not with full-text semantics. On the other hand, a search for the word "drove" matches the word "driving" with full-text semantics, but not in XQuery. Furthermore, full-text search cannot be used for searching over attribute values, while XQuery expressions need to use the aggregate function fn:string() to search over mixed content.

When full-text index exists on an XML column, it is advisable to do the following:

Filter the XML values of interest using full-text search.
Query the selected XML instances using XML data type methods. XML indexes on the XML column get used during this step.

This leads to the use of both full-text and XML indexes. The high selectivity of the search word or phrase in a full-text search narrows down further processing for the XQuery search to a relatively small number of rows in the base table. This can significantly speed up the query. This approach can be used when the search phrase consists of word stems that match keyword boundaries.

The XQuery search specifies the context (the node set) for the search. A part of the XML data can be promoted to a computed XML column XC on which a full-text index is created. This defines XC as the full-text search context.

Example: Combining Full-text with XQuery Matches

The following query performs a full-text search for the keyword, data, and verifies that the word "data" occurs in the context of the <title> element of a <book>. It uses the full-text contains() method to locate the XML instances containing the search word. The XML data type method exist() verifies that the XML instances contain the substring in the correct context.

select * 
from   docs 
where  contains(xCol, 'data')
AND    xCol.exist('/book/title/text()[contains(.,"data")]') = 1

Example: Using Prefix Search in Full-text

It is possible to perform prefix searches in full-text indexes. For a query that will match any keyword that starts with data, such as database, the previous query can be rewritten as follows. The XQuery search matches database as well.

select * 
from   docs 
where  contains(xCol, '"data*"')
and    xCol.exist('/book/title/text()[contains(.,"data")]') = 1

Note the use of the double quotes in the full-text contains() method.

Snapshot Isolation and XML Indexes

XML data modification updates old XML instances with the new ones. These changes are propagated to the primary and secondary XML indexes. Modified rows in the base table and the XML indexes are locked, and the row and page locks may escalate to table locks at the discretion of the query optimizer. Concurrency suffers owing to lock escalation, especially when modification is common in the workload.

In SQL Server 2005, snapshot-based isolations introduce a new isolation level called snapshot and a new implementation of read-committed isolation level. More information about these can be found in SQL Server Books Online. These are based on an internal versioning mechanism that eliminates lock contention between readers and writers when the database is enabled for snapshot isolation. The reduced lock contention may yield higher throughput.

A read operation under snapshot-based isolations can access the versioned data without getting blocked on a concurrent update. This reduced blocking can potentially improve transaction throughput in concurrent workloads.

With snapshot isolation, XML column values and the corresponding primary and secondary XML index rows are versioned upon update. This avoids unnecessary versioning of the XML columns when modifications in non-XML columns cause the containing row to be versioned. This optimization makes snapshot isolation very useful for XML processing.

Query and Data Modification

Merging Multiple value() Method Executions for Indexed XML

In the indexed case, the execution of multiple value() methods on the same typed XML column in a SELECT list may be combined for faster execution. The decision to combine the executions is made by the query optimizer based on its costing of the query. This can yield significant speed-up. Following is an example.

Example: Combining Multiple value() Method Executions

Suppose that the content model for <book> element is defined by the XML schema namespace http://www.microsoft.com/book in an XML schema collection bookCollection. Furthermore, a table TypedBooks is created with an XML column xBook typed using bookCollection, and the primary XML index is created on this column. The XML schema definition is shown in the following:

CREATE XML SCHEMA COLLECTION bookCollection AS 
'<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
   xmlns="http://www.microsoft.com/book" 
  targetNamespace="http://www.microsoft.com/book">
  <xsd:element name="book" type="bookType" />
  <xsd:complexType name="bookType">
   <xsd:sequence>
     <xsd:element name="title" type="xsd:string" />
     <xsd:element name="author" type="authorName" 
      maxOccurs="unbounded"/>
   <xsd:element name="price" type="xsd:decimal" />
   </xsd:sequence>
   <xsd:attribute name="subject" type="xsd:string" />
   <xsd:attribute name="releasedate" type="xsd:integer" />
   <xsd:attribute name="ISBN" type="xsd:string" />
  </xsd:complexType>
  <xsd:complexType name="authorName">
   <xsd:sequence>
     <xsd:element name="first-name" type="xsd:string" />
     <xsd:element name="last-name" type="xsd:string" />
   </xsd:sequence>
  </xsd:complexType>
</xsd:schema>'
GO
CREATE TABLE TypedBooks 
(id int PRIMARY KEY, xBook XML(DOCUMENT bookCollection))
CREATE PRIMARY XML INDEX idx_priXML_xBook ON TypedBooks (xBook)

In the following query, the execution of the value() methods are combined because they are invoked on the same XML column, and the singleton cardinality of the <title> and <price> elements are statically inferred from the XML schema:

WITH XMLNAMESPACES ('http://www.microsoft.com/book' AS "bk")
SELECT xBook.value ('/bk:book/title', 'nvarchar(128)') Title,
    xBook.value ('/bk:book/price', 'decimal(5,2)') Price
FROM    TypedBooks

The following conditions must hold for the optimization to occur:

The XML column must be typed so that singleton cardinality of nodes can be inferred from the XML schema collection typing the column. Wherever appropriate, the XML column should be declared with the column option DOCUMENT. Otherwise, the nodes() method must be used to generate single node references. The optimization works also for untyped XML when the nodes() method is used, thus ensuring singleton elements, and the value() method extracts attribute values from those elements.
Full paths must be specified for the optimization to occur. Paths containing wildcards (*), descendant axes (//-operator), ancestor axis (..), XPath functions, and node tests (node()) prevent this optimization.
Path expressions in the value() methods may not contain predicates or ordinals.
The optimization occurs for path expressions relative to the context items obtained from the nodes() method. In this case, the first argument of the value() methods must be the full relative path with the previously stated restrictions.
The value() method calls must appear contiguously in a SELECT list to be merged. The execution of non-contiguous value() methods are not combined.

The optimization may also occur with value() methods in the Transact-SQL WHERE clause for predicates such as xCol.value(...) = xCol.value(...). It does not occur in predicates of the type xCol.value(...) = constant.

Using the exist() Method for Checking Existence

For better performance, use the exist() method on the XML data type whenever possible, instead of the value() method. The exist() method is most helpful when used in the SQL WHERE clause and utilizes XML indexes more effectively than the value() method. This is true even when you use sql:variable() and sql:column() in XQuery expressions.

For example, consider the following query that retrieves books having the title "Writing Secure Code" using the exist() method:

SELECT * 
FROM   docs 
WHERE    xCol.exist('(/book/title/text())[.="Writing Secure Code"]') = 1

The PATH or VALUE secondary XML indexes are used to evaluate the path expression ((/book/title/text())[.="Writing Secure Code"]), including value lookup ("Writing Secure Code" in this example) in those indexes, to yield the XML instances to return. If the path and the search value are highly selective, the resulting execution can be much faster than evaluating the path expression over all the XML blobs in the column. The search value can be supplied by using sql:variable() or sql:column(). For more information, see Parameterize Your XQuery and XML DML Expression in this paper.

Writing the query using the value() method as shown in the following evaluates all book titles first and then applies the filter "Writing Secure Code":

SELECT * 
FROM   docs 
WHERE  xCol.value('(/book/title)[1]', 'varchar(50)') = 'Writing Secure Code'

This yields a less efficient query execution, because the filter value "Writing Secure Code" is not used in XML index lookup. A filter value specified by using a SQL variable or another value() method exhibits similar behavior.

Example: Use of sql:column()

The following query finds books written by more authors than the id of the book:

SELECT *
FROM     docs 
WHERE    xCol.exist('/book [count(author) > sql:column("id")]') = 1

Using the nodes()-value() Combination

The nodes() method generates a rowset of internal node references that can be used within a value() method to extract scalar values from those nodes. These methods together can be used to represent XML data in a relational form.

Each row in the output of the nodes() method represents a single reference so that the ordinal predicate used for selecting attributes of the context node in the value() method can be eliminated, as shown in the following example. Furthermore, if the nodes() method yields exactly one reference, removing the nodes() method altogether makes the query perform faster. These optimizations are most useful for XML variables and parameters.

Example: Eliminating Ordinal Predicate with the nodes() Method

This query extracts the ISBN attribute from each book instance in column xCol of table docs. The nodes() method emits a separate reference to each distinct <book> element (the context node), and there can be at most one @ISBN attribute on the context node.

SELECT ref.value('@ISBN', 'nvarchar(32)')
FROM   docs CROSS APPLY xCol.nodes('/book') AS node(ref)

If no more than one <book> element occurs in each XML instance, the following rewrite is faster:

SELECT xCol.value('(/book/@ISBN)[1]', 'nvarchar(32)')
FROM   docs

Optimizations for XML blobs

Multiple tempDB Files for Better Scalability of XML Variables and Parameters

XML variables and parameters use main memory as storage as long as their values are small. Large values, however, are backed by tempdb storage. In a multi-user scenario, if many large XML blobs occur, tempdb contention may become a bottleneck for good scalability. Creating multiple tempdb files reduces the storage contention and yields significantly better scalability. The next example illustrates how multiple tempdb files can be created.

Example: Creating Multiple tempdb Files

This example creates two additional data files for tempdb, each with an initial size of 8 MB, and two log files with an initial size of 1 MB.

USE TEMPDB
GO

ALTER DATABASE tempdb ADD FILE 
   (NAME = 'Tempdb_Data1', 
      FILENAME = 'C:\temp\Tempdb_Data1.MDF', SIZE = 8 MB),  
   (NAME = 'Tempdb_Data2', 
      FILENAME = 'C:\temp\Tempdb_Data2.MDF', SIZE = 8 MB) 
GO

ALTER DATABASE tempdb ADD log FILE 
   (NAME = 'Tempdb_Log1',
      FILENAME = 'C:\temp\Tempdb_Log1.LDF', SIZE = 1 MB),  
   (NAME = 'Tempdb_Log2',
      FILENAME = 'C:\temp\Tempdb_Log2.LDF', SIZE = 1 MB)
GO

These files can be removed by using the ALTER DATABASE tempdb REMOVE FILE command. For more information, see SQL Server Books Online.

Eliminating Extra Casts to XML data type

In an inlinable function with an input argument of type XML, the caller can supply a text or binary value that is implicitly converted to XML data type. Each use of the XML argument in the body of the callee casts the input value to XML data type. This cost can be avoided by copying the argument into an XML data type variable, which causes one conversion of the argument's value to the XML data type, and the XML variable is used multiple times in the body of the function or the stored procedure. The following example illustrates this point.

Example: Removing Conversions

Consider the following function GetTitleAndIsbnOfBook() that returns the title and the ISBN of a book:

CREATE FUNCTION GetTitleAndIsbnOfBook (@book XML) 
RETURNS TABLE AS
RETURN
   SELECT @book.value ('(/book/@ISBN)[1]', 'nvarchar(32)') ISBN,
            @book.value ('(/book/title)[1]', 'nvarchar(128)') title

If the function is invoked with a string value, a conversion to XML data type occurs for each value() method invocation. The function can be rewritten as follows to have only one conversion of its argument to XML data type. However, the table variable @retTab required for a multi-statement, table-valued function introduces additional cost. This can be offset by a sufficient number of accesses to the XML variable when the XML data size is large.

CREATE FUNCTION GetTitleAndIsbnOfBookOpt (@book varbinary(max)) 
RETURNS @retTab TABLE (ISBN nvarchar(32), title nvarchar(128)) AS 
BEGIN
     DECLARE @xbook XML
     SET @xbook = @book
     INSERT INTO @retTab
     SELECT @xbook.value ('(/book/@ISBN)[1]', 'nvarchar(32)'),
            @xbook.value ('(/book/title)[1]', 'nvarchar(128)')
     RETURN
     END

Specifying Singleton Elements

A singleton cardinality estimate removes the need to specify ordinals in queries and data modification statements. This simplifies the query plan and produces efficient JOIN operations. It typically involves making the proper choices for the inner and outer loops in nested loop joins.

In typed XML, elements by default have singleton cardinality in XML schema structures unless overridden by using the values of minOccurs and maxOccurs. In addition, the DOCUMENT constraint on a typed XML column, variable, and parameter guarantees exactly one top-level element in the XML data type instance.

For untyped data or when multiple sibling elements are allowed in a schema, singleton cardinality of nodes can be indicated in path expressions using an ordinal value that chooses exactly one node satisfying the path expression, as shown in the following example. The ordinal [1] is evaluated using the Transact-SQL TOP 1 ascending while the ordinal last() is evaluated as TOP 1 descending. The nodes() method also sets a singleton context item for each of the resulting XML instances.

If the choice of a single node is omitted, the query optimizer uses a default cardinality estimate which can be much too high. This can, for example, lead to suboptimal choices for the inner and outer loops in nested loop joins when predicates have to be computed. The effect is more pronounced in the case of XML blobs where no XML indexes exist and no statistical information is available for better estimation of the cardinality.

Example: Specifying Singleton Cardinality for Untyped XML

Suppose each XML instance in the xCol column contains a single top-level <book> element that has a single <title> sub-element. Consider the query:

SELECT   xCol.query ('/book/title')
FROM      docs

The query optimizer uses a default cardinality estimate for <title> element. Each <book> has a single title so that <title> is a singleton element, but the optimizer's estimate is much higher than that. The reformulated query conveys the correct cardinality to the optimizer:

SELECT   xCol.query ('(/book/title)[1]')
FROM      docs

The semantic difference between the similar-looking path expressions (/a/b)[1] and /a/b [1] is discussed in the paper XML Best Practices for Microsoft SQL Server 2005 [ http://msdn2.microsoft.com/en-us/library/ms345115.aspx ] on MSDN.

Eliminating Multiple Executions of XML data type Methods

A query such as:

SELECT case isnumeric (xCol.value ('(/book/price)[1]', 'nvarchar(32)'))
    when 1 then xCol.value ('(/book/price)[1]', 'decimal(5,2)')
        else 0
    end
FROM   docs

computes the <price> of a boo and converts the price to decimal (5, 2) if it is a numeric type. This logic is useful when the application might encounter non-numeric price values.

This query computes the value() method twice (this can be avoided by using a subquery as shown in the following) in which the value() method is computed in a subquery and reused in an outer SELECT:

SELECT case isnumeric(Price)
    when 1 then CAST (Price AS decimal(5,2))
        else 0
    end   
FROM    (SELECT xCol.value ('(/book/price)[1]', 'nvarchar(32)') Price
     FROM   docs) T

The same optimization can be used in other places as well such as in NULLIF:

SELECT NULLIF (Title, '')
FROM    (SELECT xCol.value ('(/book/title)[1]', 'nvarchar(64)') Title
     FROM   docs) T

Using the value() method in NULLIF() would compute the value() method twice when it returns a non-empty string.

Data(), text(), and string() Accessors

XQuery provides a function fn:data() to extract atomic, typed values from nodes, a node test text() to return text nodes, and the function fn:string() to return the string value of a node. However, their usage can be confusing. Guidelines for their proper use in SQL Server 2005 are illustrated in the following by using the XML instance <age>12</age>:

Untyped XML: The path expression /age/text() returns the text node under <age>, whose value is 12. The function fn:data(/age) returns the string value 12 as does fn:string(/age).
Typed XML: The expression /age/text() returns static error for any simple typed <age> element in SQL Server 2005. On the other hand, if <age> has a simple integer content, fn:data(/age) returns the integer 12, while fn:string(/age[1]) yields the string 12.

These functions have different performance characteristics. The fn:string() function recursively aggregates all text nodes under the context node. This is overkill when the context node is single-valued. Thus, fn:data() and text() not only suffice, but also are more efficient.

For untyped XML, when the value of a node is desired, returning a text node using text() is faster than fn:data(). The path expression /book/text() returns the text node children of a <book> element. Within the query() method, these text nodes are serialized out and appear to be a concatenation of the values of the text nodes. On the other hand, fn:data() aggregates all values in the subtree of the <book> element. This aggregation makes the computation of fn:data() more expensive than text() even for elements with simple content.

Text Aggregation in Untyped XML

According to XQuery semantics, a query such as the following on untyped XML:

SELECT xCol.value ('(/book/title[.="Writing Secure Code"])[1]',
        'nvarchar(64)')
FROM    docs

SELECT xCol.value ('(/book/title
[fn:string()="Writing Secure Code"])[1]'), 'nvarchar(64)')
FROM    docs

requires all text nodes under the <title> element to be aggregated to evaluate the predicate. This inhibits XML index lookup for the search string.

If the <title> element has only one text node, a more efficient way of writing the query is to evaluate the predicate on the text node, as shown in the following:

SELECT xCol.value ('(/book/title/text())[1]
[. = "Writing Secure Code"]', 'nvarchar(64)')
FROM    docs

XML index lookup for the value "Writing Secure Code" can occur in this case.

Parameterize Your XQuery and XML DML Expression

XQuery and XML DML expressions are not auto-parameterized. Thus, it is preferable to use sql:column() or sql:variable() to supply parameter values to your XQuery or XML DML expressions, if two XQuery expressions differ only in the values of parameters, instead of using dynamic SQL statements. Using these functions auto-parameterizes the query.

The following example shows a stored procedure execution. The technique can be applied to parameterization of any query, function or method call, or data modification statement.

For example, the following stored procedure finds books with a lower price than the input argument:

CREATE PROC sp_myProc
   @Price decimal
AS
   SELECT * 
   FROM   docs 
   WHERE  1 = xCol.exist('(/book/price)[. < sql:variable("@Price")]')

In ADO.NET and OLEDB, bind the input value of @Price to a parameter. This avoids query recompilation when the parameter is bound to a different value. Using sql:column() yields similar benefits.

The following Microsoft Visual Basic .NET code shows parameter binding in the stored procedure invocation:

'myConn is the connection string
SqlCommand cmd = New SqlCommand("sp_myProc", myConn)
cmd.CommandType = CommandType.StoredProcedure

'Parameter binding
Dim myParm As SqlParameter = cmd.Parameters.Add("@Price", _ 
SqlDbType.Decimal)
myParm.Direction = ParameterDirection.Input
myParm.value = 2

'Invoke the stored procedure
SqlDataReader myReader = cmd.ExecuteReader()

'Invoke the stored procedure a second time
myParm.value = 49.99
SqlDataReader myReader = cmd.ExecuteReader()

For more information, see the Microsoft Visual Studio .NET documentation.

Example: Using sql:variable() in Data Modification

Suppose the <price> of a <book> whose ISBN is "0-2016-3361-2" is discounted by 10 percent. Both the discount and the ISBN can be passed into the XML data modification statement as parameters, and the statement remains the same for a different book or a different discount.

DECLARE @discountFactor float, @sqlisbn nvarchar(32)
SET     @discountFactor = 0.9
SET     @sqlisbn = N'0-7356-1588-2'

UPDATE  docs
SET     xCol.modify('replace value of (/book/price/text())[1] with 
      sql:variable("@discountFactor")*(/book/price/text())[1]')
WHERE   xCol.exist('/book[@ISBN = sql:variable("@sqlisbn")]') = 1

Example: Using sql:variable() in Element Construction

The modify() method shown in the following illustrates the use of sql:variable() for supplying a value within a constructed element:

DECLARE @name nvarchar(64)
SET   @name = 'Microsoft Press'
UPDATE docs
SET    xCol.modify ('
 insert <publisher Name = "{sql:variable("@name")}"></publisher>
 into (/book/title)[1]')

Optimizations for Predicates and Ordinals

Full paths (absolute location paths from the root node to selected nodes containing only child and self axes) without node tests or branching (without predicates or ordinals on intermediate nodes in the path) can be evaluated more efficiently than path expressions with branching. In the indexed case, full paths can be used in index seeks. For XML blobs, parsing is faster for such paths than for paths with branching or wildcards (*).

Node tests and predicates at the end of a full path are used as filters on the selected nodes. Indexes are used. For XML blobs, parsing is efficient. Following is an example.

Example: Full-path Evaluation

Consider the path expression that selects books written by authors whose first name is Davis:

SELECT   xCol.query ('/book[author/first-name = "Davis"]')
FROM      docs

Although the predicate is not directly on the <book> element, the <first-name> nodes located using the collapsed path /book/author/first-name are filtered by the value "Davis." The returned <book> elements are those satisfying the given predicate.

Path-based lookup is efficient even for partially specified paths without predicates or ordinals, such as /book//first-name. The query compiler uses the LIKE operator to match such paths in the XML indexes. Thus, specifying as much of the path as possible contributes to more efficient processing.

Branching (node tests and predicates in the middle of a path expression) as in /book[@ISBN = "1-8610-0157-6"]/author/first-name evaluates the path expressions /book[@ISBN = "1-8610-0157-6"] and /book/author/first-name, and takes the intersection between the two sets of <book> elements. Consequently, execution is slower than path expressions without branching. The use of node tests and predicates in the middle of path expressions should be avoided as much as practical. This is sometimes possible with careful data modeling, as discussed in the example, Generic versus Specific Markups.

Moving Ordinals to the End of Paths

Ordinals used in path expressions for static type correctness are good candidates for placement at the end of path expressions. The path expression /book[1]/title[1] is equivalent to (/book/title)[1] if every <book> element has <title> children. The latter can be evaluated faster for both the XML indexed case and the XML blob case by determining the first <title> element under a <book> element in document order. Similarly, the path expression (/book/@ISBN)[1] yields faster execution than /book[1]/@ISBN.

Evaluating Predicates by Using the Context Node

In addition to moving predicates, ordinals, and node tests to the end of path expressions, evaluating these conditions by using the context node yields still better performance. Following is an example of this rewrite.

Example: Predicate Evaluation Using the Context Node

The following query finds books on the subject of "security." It requires the evaluation of two path expressions, namely /book and /book/@subject, and a check for the value security for the latter path.

SELECT *
FROM   docs
WHERE  xCol.exist ('/book[@subject = "security"]') =  1

The following rewrite evaluates a single path /book/@subject and checks whether this path has the value security. This yields a simpler query plan than the previous one and is much faster.

SELECT *
FROM   docs
WHERE  xCol.exist ('/book/@subject[. = "security"]') =  1

Range Conditions

Range conditions benefit from the use of typed XML. The data stored in XML columns and XML indexes are typed according to the type definitions specified in XML schemas. Value comparisons avoid run-time conversion of data and permit range scans on the VALUE secondary XML index. This also requires specifying the context node (.) in the range condition for efficient access, as the following example illustrates.

Example: Context Node in Range Conditions

Consider the query to find books in the typed XML column xBook of table TypedBooks whose price is in the range of $9.99 to $49.99:

SELECT xBook
FROM    TypedBooks
WHERE    xBook.exist ('
     declare default element namespace "http://www.microsoft.com/book";
   /book[price > 9.99 and price < 49.99]') =  1

The path expressions /book/price > 9.99 and /book/price < 49.99 are evaluated separately. The query optimizer does not know that the <price> elements are the same, because multiple <price> elements can exist under the <book> element. This inhibits range scan on the VALUE secondary XML index. The following rewrite ensures that the same context node for <price> is used, and range scan of the VALUE secondary XML index occurs for values between 9.99 and 49.99. This yields better performance:

SELECT xBook
FROM    TypedBooks
WHERE    xBook.exist ('
     declare default element namespace "http://www.microsoft.com/book";
   /book/price[. > 9.99 and . < 49.99]') =  1

Parent Axis

Use of the parent axis in path expressions blocks certain optimizations. The XQuery compiler concatenates segments of path expressions without branching into a longer path, which can be evaluated more efficiently than the segments separately. This optimization is referred to as path collapsing.

This technique, however, does not work for the parent axis. Furthermore, XML indexes cannot be used to evaluate parent axes. Instead, using paths for forward traversal only yields better performance. For example, instead of writing /book/title[../@ISBN = "0-7356-1588-2"], it is better to write the path expression as /book[@ISBN = "0-7356-1588-2"]/title.

Navigating to the parent of a node in typed XML loses type information and returns an element of the most general type xs:anyType. Further operations on the node may need explicit cast and slows down query processing. Also, the cast may prevent XML index use. It is advantageous to navigate down from the parent node instead of climbing up to the parent.

Dynamic Querying

XQuery expressions are specified as literals within XML data type methods. Their evaluation uses XML indexes as available and as chosen by the query optimizer.

Application development is convenient when XQuery expressions can be dynamically specified instead of literals. This is possible in the following ways:

Query Construction
Create the query as a string and use sp_executesql for its execution. Unlike EXEC, this caches the compiled query plan and the optimizer may reuse the compiled plan. The query can be parameterized, because it is formed as a string and may contain embedded parameters. Adequate care should be taken to avoid SQL injection attacks.
Use XPath Functions
Replace each location step in an XPath expression with the name() function, or local-name() and namespace-URI() functions. This yields a query to which you can pass in node names and search values. You can parameterize further as described in the Parameterize Your XQuery and XML DML Expression example. Such parameterized queries are convenient for developing applications. However, the query plan generated for it ignores XML indexes, because specific paths are not known at compilation time.

Although the query construction approach performs better than parameterizing the path expressions, it includes the cost of run-time query compilation, which makes it slower than specifying the full query as a literal. The actual query passed in by the user must be validated to avoid SQL injection attacks. Otherwise, this approach should be avoided in favor of parameterization of the query. For more information, see the Parameterize Your XQuery and XML DML Expression example. The following example illustrates this approach.

The second approach specifying node tests using node names avoids the SQL injection problem. However, the query plan becomes complex and performs much worse than the original query. This is shown in the second example that follows.

Example: Query Using sp_executesql

Suppose you want to create the following query dynamically and pass in the search value of @subject using a parameter:

SELECT  *
FROM     docs 
WHERE   xCol.exist('/book/@subject [. = "security"]') = 1

The dynamic query can be created and executed as shown in the following. The query string is created in the variable @SQLString and contains an embedded variable @bksubj used in the exist() method. The variable @subj supplies the run-time value of the parameter. The dynamic query passed in using @SQLString should be validated (not shown) to avoid SQL injection attacks.

DECLARE @SQLString NVARCHAR(500)
DECLARE @subj NVARCHAR(64)
DECLARE @ParmDefinition NVARCHAR(500)
--- Build the SQL string once 
SET @SQLString =
    N'SELECT * 
      FROM   docs 
   WHERE  xCol.exist(''/book/@subject[. =sql:variable("@bksubj")]'')=1'
SET @ParmDefinition = N'@bksubj NVARCHAR(64)'
--- Execute the string with the first parameter value
SET @subj = 'security'
EXECUTE sp_executesql @SQLString, @ParmDefinition,
                      @bksubj = @subj

Example: Query Using local-name()

The previous query can be rewritten to use tag names as literals as follows:

DECLARE @elemName nvarchar(4000), @attrName nvarchar(4000)
DECLARE @subjValue nvarchar(4000)
SET   @elemName = N'book'
SET   @attrName = N'subject'
SET   @subjValue = N'security'
SELECT  *
FROM     docs 
WHERE   xCol.exist('/*[local-name() = sql:variable("@elemName") and 
@*[local-name() = sql:variable("@attrName") and 
. = sql:variable("@subjValue")]]') = 1

The rewritten query contains wildcards (*) and node tests using node names and is hard to optimize well. Consequently, it performs much worse than the original query and the query construction approach.

Rowset Generation from XML Data

Some applications need to generate a rowset from XML data by promoting one or more properties into columns of the rowset. For example, an application may query for the authors of books and display the result as a table containing two columns for the first and last names. Such rowset generation can be done both at the server and at the client with different performance characteristics:

At the server, use one of the following mechanisms:
- Combination of nodes() and value() methods on XML data type
- OpenXML
- Streaming table-valued function in common language runtime (CLR)
Alternatively, the XML result is returned to the client, which uses client-side programming (DataSet) to convert the data to a rowset.

Client-side rowset generation offloads the server and is useful when almost the entire data sent from the server to the client is mapped into the rowset. Otherwise, the cost of shipping the data may outweigh the benefits of client-side processing.

Server-side rowset generation is useful for rowset generation from incoming XML data at the server. It is generally preferable when a small fraction of XML data stored at the server is promoted into columns of the rowset. More discussions regarding the relative merits and demerits among the server-side approaches can be found in XML Best Practices for Microsoft SQL Server 2005 [ http://msdn2.microsoft.com/en-us/library/ms345115.aspx ] on MSDN.

Conclusion

This paper discusses several ideas for optimizing the performance of applications using XML data type. These ideas range from data management aspects and XML schema design to the style for writing queries on XML data type. Employing these techniques can yield significant performance boost. It is helpful to study the Showplan output to see how the queries benefit from the XML indexes and to experiment with query rewrites to see how the query plans change.

For More Information:

http://msdn.microsoft.com/sql/ [ http://msdn.microsoft.com/sql/ ]