Bob Beauchemin
SQLskills.com
August 2005
Summary:
Use the relational query engine in SQL Server 2005 to make a single
query plan for the SQL and XQuery parts of your queries, and make the
implementation of XML queries fast and easy to predict and tune. (12
printed pages)
Contents
Introduction
XQuery and the XML Data Type
Types of XML Indexes
How the Indexes Help
XQuery and Schema-Validated Columns
Index and Workload Analysis
Other Tips to Speed Up Your XML Queries
Wrap-up
Introduction
Although
not everyone would agree, one of the main reasons for the success of
the relational database has been the inclusion of the SQL language. SQL
is a set-based declarative language. As opposed to COBOL (or most
.NET-based languages for that matter), when you use SQL, you tell the
database what data you are looking for, rather than how to obtain that
data. The SQL query processor determines the best plan to get the data
you want and then retrieves the data for you. As query-processing
engines mature, your SQL code will run faster and better (less I/O and
CPU) without the developer making changes to the code. What development
manager wouldn't be pleased to hear that programs will run faster and
more efficiently with no changes to source code because the query
engine gets better over time?
One of the ways to allow the
query processor the choice of optimized access is to create indexes
over the data. Creating the correct index can dramatically change how
the query engine evaluates the query. You decide which indexes to
create by analyzing which queries you actually perform and figuring up
how the engine could optimize those queries. A tool to analyze query
workloads and suggest indexes comes with SQL Server. In SQL Server
2005, this tool is Database Tuning Advisor.
In the early days
of XML, imperative programming (navigation through the XML DOM) was all
the rage. The XQuery language in general and XQuery inside the database
in particular make it possible for the query engine writers to approach
the task of optimizing queries against XML. The chances of success are
good because these folks have 20 years or so of practical experience
optimizing SQL queries against the relational data model. The SQL
Server 2005 implementation of XQuery over the built-in XML data type
holds the same promise of a declarative language, with optimization
through a query engine. And the query engine that SQL Server 2005
XQuery uses is the one built-in to SQL Server. SQL Server 2005 XQuery
uses the relational engine, with XQuery-specific enhancements. As an
example, XQuery mandates that the results be returned in document
order, even if you don't use "order by" in the query.
XQuery and the XML Data Type
You
use XQuery in SQL Server 2005 through a series of built-in SQL methods
on the XML data type. The XML data type is a new native type that can
be used as a column in a table, procedure parameters, or as T-SQL
variables. The built-in methods may be used with any instance of the
XML data type. Table 1 contains a list of the five methods, their
signatures, and what they do.
Table 1. XML Data Type Functions
Name | Signature | Usage |
exist | bit = X.exist(string xquery) | Checks for existence of nodes, returns 1 if any output returned from query, otherwise 0 |
value | scalar = X.value(
string xquery, string SQL type) | Returns a SQL scalar value from a query cast to specified SQL data type |
query | XML = X.query(string xquery) | Returns an XML data type instance from query |
nodes | X.nodes(string xquery) | Table-value function used for XML to relational decomposition. Returns one row for each node that matches the query. |
modify | X.modify(string xml-dml) | A mutator method that changes the XML value in place |
Note
that each of these methods require XQuery (or XML DML in the case of
modify) as a (n)varchar SQL input parameter. Each of these methods is
used as a part of an "ordinary" SQL statement as in the following
examples.
-- SQL query: return a one-column rowset containing an XML data type
SELECT invoice.query('
(: XQuery program :)
declare namespace inv="urn:www-develop-com:invoices";
declare namespace pmt="urn:www-develop-com:payments";
for $invitem in //inv:Invoice
return
<pmt:Payment>
<pmt:InvoiceID> {data($invitem/inv:InvoiceID)} </pmt:InvoiceID>
<pmt:CustomerName>
{data($invitem/inv:CustomerName)}
</pmt:CustomerName>
<pmt:PayAmt>
{data(sum($invitem/inv:LineItems/inv:LineItem/inv:Price))}
</pmt:PayAmt>
</pmt:Payment>
') AS xmldoc
FROM xmlinvoice
-- Extract a value from XML data type and use in a SQL predicate
SELECT id
FROM xmlinvoice
-- XML.value must return a scalar value (XML singleton or empty sequence)
WHERE invoice.value('
(: XQuery program :)
declare namespace inv="urn:www-develop-com:invoices";
sum(//inv:Invoice/inv:LineItems/inv:LineItem/inv:Price)
',
-- SQL data type
'money') > 100
When the query processor evaluates the SQL query, it uses the
SQL (relational) query engine. This applies to the XML portion of the
query as well as the SQL portion. Because the same query processor is
used for the entire query, the query produces a single query plan, as
SQL queries always do. And that's where indexes come in. When the XML
instance occurs as a column, that column can be indexed. The query
processor can use XML indexes to optimize XQuery, just as SQL indexes
can be used to optimize SQL queries.
Types of XML Indexes
SQL
Server 2005 supports four different types of XML indexes. Since an XML
index is somewhat different than a relational index, it is necessary to
know their implementation before we approach how to use them for
maximum effectiveness. There is a single "primary XML index" and three
different flavors of "secondary XML index". And it turns out that the
primary XML index isn't strictly an index on the original form of the
XML.
The primary XML index on an XML column is a clustered index
on an internal table known as the node table that users cannot use
directly from their T-SQL statements. The primary XML index is a B+tree
and its usefulness is due to the way that the optimizer creates a plan
for the entire query. Although the optimizer can operate on the entire
XML column as though it is a blob, when you need to execute XML
queries, it is more often useful to decompose the XML into relational
columns and rows. The primary XML index essentially contains one row
for each node in the XML instance. By creating an example primary XML
index by executing the following DDL, you can see the columns that the
primary XML index contains.
-- create the table
-- the clustering key must be the primary key of the table
-- to enable XML index creation
CREATE TABLE xmlinvoice (
invoiceid INT IDENTITY PRIMARY KEY,
invoice XML
)
GO
-- create the primary XML index
CREATE PRIMARY XML INDEX invoiceidx ON xmlinvoice(invoice)
GO
-- display the columns in the primary XML index (node table)
SELECT * FROM sys.columns c
JOIN sys.indexes i ON i.object_id = c.object_id
WHERE i.name = 'invoiceidx'
AND i.type = 1
Here are the columns that this statement produces. Some terms that I'm using require further explanation later in this article.
Table 2. Columns in the node table
Column Name | Column Description | Data Type |
id | node identifier in ordpath format | varbinary(900) |
nid | node name (tokenized) | int |
tagname | tag name | nvarchar(4000) |
taguri | tag uri | nvarchar(4000) |
tid | node data type (tokenized) | int |
value | first portion of the node value | sql_variant |
lvalue | long node value (pointer) | nvarchar(max) |
lvaluebin | long node value in binary (pointer) | varbinary(max) |
hid | path (tokenized) | varchar(900) |
xsinil | is it NULL (xsi:nil) | bit |
xsitype | does it use xsi:type | bit |
pk1 | primary key of the base table | int |
There
are 11 columns in the primary XML index besides the base table’s
primary key, which can be a multi-column key; it contains enough data
to execute any XQuery. The query processor uses the primary XML index
to execute every query except for the case where the entire document
will have to be output. In that case it's quicker to retrieve the XML
blob itself. Although having the primary XML index is a vast
improvement over creating it afresh during each query, the size of the
node table is usually around three times that of the XML data type in
the base table. The actual size depends upon the XML instances in the
XML column—if they contain many tags and small values, more rows are
created in the primary XML index and the index size is relatively
larger; if there are few tags and large values, then few rows are
created in the primary XML index and the index size is closer to the
data size. Take this into consideration when planning disk space. This
is because the node table contains explicit representations of
information (such as the path and node number) that is a different
representation of information inherent in the structure of the XML
document itself.
The primary XML index is clustered on the primary key of the base table (the pk1 column in the example above) and a node identifier (id). However, it is not a clustered index on the base table xmlinvoice.
It is necessary to have a primary key on the base table to create the
primary XML index. That primary key is used in a join of the XQuery
results with the base table. The XML data type itself cannot be used as
a primary key of the base table and so the invoiceid column was included in the base table definition to satisfy the requirement.
The
node identifier is represented by a node numbering system that is
optimized for operations on the document structure (such as
parent-child relationship and the relative order of nodes in the
document) and insertion of new nodes. This node numbering system is
known as ordpath. Some of the reasons for numbering all the
nodes are to maintain document order and structural integrity in the
query result. These are not requirements of relational systems, but are
requirements in XML. Using the ordpath numbering system makes
satisfying these requirements easier for the query engine; the ordpath format contains document order and structure information. See the paper
ORDPATHs: Insert-Friendly XML Node Labels [ http://www.cs.umb.edu/~poneil/ordpath.pdf ] by Patrick and Elizabeth O'Neil et al., for all of the details on ordpath.
Once
the primary XML index has been created, an additional three kinds of
secondary XML index can be created. The secondary XML indexes assist in
certain types of XQuery processing. These are called the PATH,
PROPERTY, and VALUE indexes. For example, you can create a PATH
secondary index using the primary XML index created above like this:
CREATE XML INDEX invpathidx ON xmlinvoices(invoice)
USING XML INDEX invoiceidx FOR PATH
Secondary
XML indexes are actually indexes on the node table. The PATH index, for
example in a normal non-clustered index on the (HID, VALUE) columns of
the node table. To see the key columns for all the indexes on the node
table in index order, you can execute this query:
select i.name as indexname, c.name as colname, ic.*
from sys.index_columns ic
join sys.columns c on ic.column_id = c.column_id
and ic.object_id = c.object_id
join sys.indexes i on ic.object_id = i.object_id
and ic.index_id = i.index_id
where ic.object_id =
(select object_id from sys.indexes
where name = 'invoiceidx' and type = 1)
order by index_id, key_ordinal
How the Indexes Help
Now
that we've seen what the XML indexes consist of in terms of columns and
rows, let's see how they are useful with particular kinds of XML
queries. We'll begin with a discussion of how SQL Server 2005 actually
executes a SQL query that contains an XML data type method.
When
an SQL-with-XML query is executed against a table containing an XML
data type column, the query must process every XML instance in every
row. At the top level, there are two ways that such a query can be
executed:
- Select the rows in the base table (that is, the
relational table that contains the XML data type column) that qualify
first, and then process each XML instance using XQuery. This is known
as top-down query processing.
- Process all XML instances using
the XQuery first, and then join the rows that qualify to the base
table. This is known as bottom-up query processing.
The
SQL Server query optimizer analyzes both the XQuery pieces and
relational pieces of the query as a single entity, and creates a single
query plan that encompasses and best optimizes the entire SQL statement.
If
you've created only the primary XML index it is almost always used in
each step of the XQuery portion of the query plan. It is better to use
the primary XML index to process the query in almost every case.
Without a primary XML index, a table-valued function is used to
evaluate the query, as can be seen in this query plan fragment:

Once
the primary XML index is in place, the optimizer chooses which indexes
to use. If you have all three secondary indexes, there are actually
four choices:
- Index scan or seek on the primary XML index
- Index scan or seek on node table's PATH index
- Index scan or seek on node table's PROPERTY index
- Index scan or seek on node table's VALUE index
The
primary XML index is clustered (data is stored) in XML document order;
this makes it ideal for processing subtrees. Much of the work in XML
queries consists of processing subtrees or assembling an answer by
using subtrees, so the clustered index on the node table will be
frequently used. Here's an example of how the same query plan looks
after only the primary XML index is created.

The
PATH, PROPERTY, and VALUE index are more special purpose and are meant
to help specific queries. We'll continue with examples that use the
exist() and query() methods on the XML data type.
The PATH XML
index is built on the Path ID (HID) and Value columns of the primary
XML index. Because it contains both paths and values, if you need the
value (for comparison) by using the path, it is a good "covering"
index, as shown in this query:
-- this only uses path index if value index is not available
select * from xmlinvoice
where invoice.exist('/Invoice/@InvoiceID[. = "1003"]') = 1
You
need to have two conditions for the PATH index to be useful. You'll
need the path to the node you're using and the path should not contain
predicates or wildcards. Knowing both the path and value enables index
seeks into the PATH index. The following example uses the PATH to
determine which rows contain InvoiceID 1003 and the primary XML index
to find the Invoice node serialize its value as output:
select invoice.query('/Invoice[@InvoiceID = "1003"]')
from xmlinvoice
Changing the query to contain both a predicate and wildcard in the path does not use the PATH index:
select invoice.query('/Invoice[@InvoiceID = "1003"]//LineItem')
from xmlinvoice
The
PROPERTY index contains the primary key of the base table (pk1), Path
ID (HID), and Value, in that order. Because it also contains the
primary key of the base table, it helps for searching multi-valued
properties in the same XML instance. Even though all the Invoice
documents have the same specific structure, this is not known to the
XQuery processor and therefore every attribute and subelement is
considered part of a property bag. We'll see later that typing the XML
by using an XML schema lessens the number of unknown property bags the
processor has; the structure is known through the schema. In the
preceding example, PROPERTY index is used to scan for CustomerName
elements under Invoice; CustomerName is considered part of a property
bag of subelements. Even when attributes are used in predicates,
property CustomerName is useful. In the example below, the PROPERTY
index is used to search by Invoice elements anywhere in the document
they occur.
select * from xmlinvoice
where invoice.exist('//Invoice/@InvoiceID[. = "1003"]') = 1
In
queries like this, the PROPERTY index is preferred over the path index
if both are available because the PATH is not very selective. If you
change the selectivity of the comparison predicate:
select * from xmlinvoice
where invoice.exist('//Invoice/@InvoiceID[. > "1003"]') = 1
then the PROPERTY index will be used.
The
VALUE index contains the same index columns as the PATH index, Value
and Path ID (HID), but in the reverse order. Because it contains the
value before the path, it’s useful for expressions that contain both
path wildcards and values, such as:
-- uses value index if the search value "Mary Weaver" is more selective than the path
select * from xmlinvoice
where invoice.exist('/Invoice/CustomerName/text()[. = "Mary Weaver"]') = 1
-- uses value index due to path wildcard and attribute wildcard
//Invoice/LineItems/LineItem/@*[. = "special"]
Note
that, if the preferred type of secondary XML index is not available, an
alternate secondary index or the primary XML index may be used. In the
example above, if the VALUE secondary index is not available the query
processor might decide to use the primary XML index. If the PROPERTY
secondary index is not available the processor often uses a two-step
process combining PATH and the primary XML index; sometimes a two-step
process is used even with the PROPERTY index. Adding another step
(i.e., JOIN) to the query plan almost always results in a slower query.
So
far, we've only been using the exist() method on the XML data type
using a single path and predicate. Things work approximately the same
way with the other XML methods. The query method may use node
construction in addition to selection. Construction is optimized by
using a special tag "Desc" that can be seen in the query plan. Any part
of the XQuery that requires selection, however, will use the same (sub)
plan as we've been seeing. Bear in mind that any index observations are
made with specific sets of data; your results may vary.
XQuery and Schema-Validated Columns
When
an XML Schema Collection validates the XML data type column, the order
and structure of the documents and the cardinality of each subelement
may be known at query compilation time. This allows the query optimizer
more chances to optimize the query. We can specify an XML schema for
Invoices in a schema collection named invoice_xsd and restrict the XML
column to contain only documents (the XML data type can ordinarily
contain documents or fragments), and it would look like this:
-- create the table, must have primary key
CREATE TABLE xmlinvoice2(
invoiceid INTEGER IDENTITY PRIMARY KEY,
invoice XML(DOCUMENT invoice_xsd)
)
GO
When
we issue the same queries against a schema-valid column, there seem to
be three major changes in query plan and index usage.
- More bottom-up type queries. Because of the XML
schema, the number of nodes that need to be searched for a specific
document is known, and sometimes fewer than the number of documents
(rows) in the table. When this occurs, a bottom-up query will filter
away more of the data.
- Greater use of the VALUE secondary
index, as opposed to PROPERTY and PATH. Because of the schema, the
processor knows that a specific element occurs in only one place in the
document, and also that the type of values that the VALUE index is more
important and useful and filtering can be done in one step instead of
two.
- If an element is defined as a numeric or integral data
type, scans for a numeric range (e.g., LineItems priced between $20 and
$30) can be done more efficiently. No separate step consisting of data
type conversion is required.
As an example of the greater
usage of VALUE index, the following query changes from a top-down query
with a two-step (sub)plan using PROPERTY index and clustered node table
index to a bottom-up query with a one-step (sub)plan using the VALUE
index.
select *
from xmlinvoice
where invoice.exist('/Invoice/CustomerName[. = "Mary Weaver"]') = 1
The
DOCUMENT qualifier is used to infer the cardinality of 1 for the
top-level element. DOCUMENT means that the column must contain a
document with a single XML root element (no fragments); this is used
for data validation and static type inference. However, a predicate
expression that starts with //Invoice is optimized differently (uses
VALUE index) than one that starts with /Invoice (uses PATH index). The
performance of the two will likely be close.
Index and Workload Analysis
Given
the fact that the primary XML index is taking up three times the space
of the XML content in the data type, if you could choose only one
secondary XML index, which one would you choose? It really depends on
your workload. The good news is that, because SQL and XQuery are
combined to yield a single query plan, ordinary plan analysis, via any
of the showplan methods including graphic showplan in SQL Server
Enterprise Manager, will work just as well for XML indexes as with
relational indexes. You create the index and observe the effect on the
query plan. There are a few caveats, however. First, you cannot force
index query hints on XML indexes for the purpose of comparing different
index strategies for performance. Also, although all four XML indexes
on an XML column are used for query optimization and are "ordinary"
relational indexes, Database Tuning Advisor does not suggest XML
indexes.
When reading a showplan for a SQL/XQuery query, there are a couple of new XQuery specific items to recognize:
- Table-Valued Function XML Reader UDF with XPath
Filter—this item refers to the on-the-fly creation of a rowset having
the node table format (the node table is not actually created) for the
XQuery portion of the query. You'll only see this when doing queries on
an XML column when no XML indexes exist.
- UDX—this item refers
to internal operators for XQuery processing. There are five such
operators; the name of the operator can be found in the "Name" property
if you bring up the Properties window (note: this does not show up in
the "hover-over" query step information). The operators are:
- Serializer UDX—serializes the query result as XML
- TextAdd UDX—evaluates the XQuery string() function
- Contains UDX—evaluates the XQuery contains() function
- Data UDX—evaluates the XQuery data() function
- Check UDX—validates XML being inserted
Other Tips to Speed Up Your XML Queries
Use specific XQuery query styles:
You might notice that using the dot (.) in a predicate produced a
different (and simpler and faster) query plan than using the attribute
name in the predicate. In the examples above, compare the two queries:
select * from xmlinvoice
where invoice.exist('/Invoice/@InvoiceID[. = "1003"]') = 1
and
select * from xmlinvoice
where invoice.exist('/Invoice[@InvoiceID = "1003"]') = 1
Although
the result is the same the latter form usually requires one more
evaluation step. This is because the query processor is evaluating only
one node in the first form (using the PATH index if its present) and is
using two evaluation steps (one for /Invoice, one for
/Invoice/InvoiceID) in the second form. Although looking at the plan
for two "equivalent" queries might seem strange for XML aficionados,
SQL query tuners have been doing this for years. Note that the two
queries above only produce the same results when using the XML data
type exist method, they produce different results when used with the
query method.
Avoid wildcards in your queries if possible:
Wildcards in a path expression containing elements (e.g.,
/Invoice//Sku/*) are only useful if you don't know the exact structure
of the document, or if the Sku element can occur at different levels of
hierarchy. In general, you should structure your document to avoid
this, although this is not possible when your data structure uses
recursion.
Hoist often searched XML values to relational values:
If a given attribute is used frequently in predicates, you can save
query-processing time by making this a computed column or redundant
column in your relational table. If you always find yourself search on
InvoiceID, making it a column allows top-down queries to work more
effectively. You might not even have to use the XML instance in the
query, if you want the entire document. Refer to
Performance Optimizations for the XML Data Type
[ http://msdn2.microsoft.com/en-us/library/ms345118.aspx ] by Shankar
Pal et al., for examples of how to do this with both single and
multi-valued attributes.
Use full-text search in conjunction with XQuery:
To search XML documents or do content-sensitive queries on text, you
can combine SQL Server full-text search with XQuery. Full-text search
will index the text nodes of XML documents, though not the elements or
attributes. You can use the FULLTEXT contains verb to do
stem-based or context-based queries over an entire collection of
documents (this is most often the top-down part of the query) to select
individual documents to operate on, then use XQuery to do structural
element and attribute-sensitive structures. Remember that the XQuery contains verb is not at all the same as the FULLTEXT contains verb. XQuery contains is a substring-based function and the SQL Server 2005 implementation uses a binary collation. See
XML Support in Microsoft SQL Server 2005
[ http://msdn2.microsoft.com/en-us/library/ms345117.aspx ] by Shankar
Pal et al., for an example of combining fulltext and XQuery.
Wrap-up
I
hope you've enjoyed the tour through the XML indexes and other hints to
make your XML queries run faster. Remember that, as with any index,
excessive use of XML indexes can make insert and modification methods
run slower, because the index is maintained along with the raw data.
This is especially true of the node table, because the entire document
must be shredded during each insert, although modification does not
require replacing the entire document. XML indexes should be managed
like other indexes with respect to dropping and recreating the indexes
in conjunction with bulk loading, index defragmenting, and other
database administration techniques.
Using the mature relational
query engine to produce a single query plan for both the SQL and XQuery
parts of the query should make the SQL Server 2005 implementation of
XML queries one of the fastest and easiest to predict and tune. Use
this power to your advantage.
About the author Bob Beauchemin
is a database-centric application practitioner and architect,
instructor, course author, writer, and Director of Developer Skills for
SQLskills. Over the past two years he's been teaching his SQL Server
2005 course to premier customers worldwide through the Ascend program.
He is lead author of the book "A First Look at SQL Server 2005 For
Developers", author of "Essential ADO.NET" and written articles on SQL
Server and other databases, ADO.NET, and OLE DB for MSDN, SQL Server
Magazine, and others. Bob can be reached at bobb@sqlskills.com.