Author: Shankar Pal, Vishesh Parikh, Vasili Zolotov, Leo Giakoumakis, Michael Rys
Microsoft Corporation
April 2004
Applies to:
Microsoft® SQL Server™
Summary:
Learn about the guidelines for XML data modeling and usage in SQL
Server 2005, and see illustrative examples. To get the most from this
article, you should have a basic understanding of XML features in SQL
Server; for background material, see XML Support in Microsoft SQL Server 2005 [ http://msdn2.microsoft.com/en-us/library/ms345117.aspx ] on the Microsoft Development Network.
Contents
Introduction
Data modeling
Data modeling using XML data type
Usage
Data binding in queries
Catalog views for native XML support
Introduction
SQL
Server 2000 and SQLXML Web Releases provide powerful XML data
management capabilities. These features focus on the mapping between
relational and XML data. XML views of relational data can be defined
using annotated XSD (AXSD) to provide an XML-centric approach that
supports bulkload of XML data, and query and update capabilities on XML
data. Transact-SQL extensions provide SQL-centric approach for mapping
relational query results to XML (using FOR XML) and generating
relational views from XML (using OpenXML).
Microsoft® SQL
Server™ 2005 provides extensive support for XML data processing. XML
values can be stored natively in an XML data type column, which can be
typed according to a collection of XML schemas, or left untyped. You
can index the XML column. Furthermore, fine-grained data manipulation
is supported using XQuery and XML DML, the latter being an extension
for data modification.
In addition, the SQLXML, FOR XML and
OpenXML features have been extended in SQL Server 2005. Together with
the newly added native XML support, SQL Server 2005 provides a powerful
platform for developing rich applications for semi-structured and
unstructured data management.
With all the added
functionality, the users have more design choices for their data
storage and application development. This article provides guidelines
for XML data modeling and usage in SQL Server 2005. It is divided into
the following two topics:
- Data modeling
XML data can be
stored in multiple ways in SQL Server 2005— for example, using native
XML data type and XML shredded into tables. This topic provides
guidelines for making the appropriate choices for modeling your XML
data. It also covers indexing of XML data, property promotion, and
typing of XML instances.
- Usage
This topic discusses usage-related
topics, such as loading XML data into the server and type inference
during query compilation, explains and differentiates closely-related
features, and suggests appropriate use of these features. The ideas are
illustrated with examples.
Data modeling
This
section outlines reasons for using XML in SQL Server 2005, provides
guidelines for choosing between native XML storage and XML view
technology, and gives data modeling suggestions.
Relational or XML data model
If
your data is highly structured with known schema, relational model is
likely to work the best for data storage. Microsoft SQL Server provides
the necessary functionality and the tools you may need. On the other
hand, if the structure is flexible (semi-structured or unstructured) or
unknown, you have to give due considerations to modeling such data.
XML
is a good choice if you want a platform-independent model to ensure
portability of the data using structural and semantic markup.
Furthermore, it is an appropriate option if some of the following
properties are satisfied:
- Your data is sparse, you do not know the
structure of the data, or the structure of your data may change
significantly in the future.
- Your data represents containment hierarchy (as opposed to references among entities) and may be recursive.
- Order of markups and values is inherent in your data.
- You want to query into the data or update parts of it based on its structure.
If
none of these conditions are met, you should use relational data model.
For example, if your data is in XML format but your application merely
uses the database to store and retrieve the data, an [n]varchar(max)
column is all you need.
Storing the data in an XML column
brings additional benefits—the engine checks that the data is
well-formed or valid according to a pre-specified XML schema, and
supports fine-grained query and update of the XML data.
Reasons for storing XML data in SQL Server 2005
Here are some reasons for using native XML features in SQL Server 2005 as opposed to managing your XML data in the file system:
- You want to use administrative functionality
of the database server for managing your XML data (for example, backup,
recovery and replication).
- You want to share, query, and
modify your XML data in an efficient and transacted way. Fine-grained
data access is important to your application. For example, you may want
to extract some of the sections within an XML document, or you may want
to insert a new section without replacing your whole document.
- You
have relational data and SQL applications, and you want
interoperability between relational and XML data within your
application. You need language support for query and data-modification
for cross-domain applications.
- You want the server to guarantee well-formed data, and optionally validate your data according to XML schemas.
- You want indexing of XML data for efficient query processing and good scalability, and the use a first-rate query optimizer.
- You want SOAP, ADO.NET and OLE DB accesses to the XML data.
If
none of these conditions are satisfied, you may be better off storing
your data as a non-XML, large object type, such as [n]varchar(max) or
varbinary(max).
XML storage options
The storage options for XML in SQL Server 2005 are as follows:
- Native storage as XML data type:
The data is stored in an internal representation that preserves the
XML content of the data, such as containment hierarchy, document order,
element and attribute values, and so on. Specifically, the InfoSet
content of the XML data is preserved (for more information on InfoSet,
see http://www.w3.org/TR/xml-infoset
[ http://www.w3.org/tr/xml-infoset ] ). It may not be an exact copy of
the text XML, since the following information are not retained:
insignificant white spaces, order of attributes, namespace prefixes,
and XML declaration.
For typed XML data type (that is, XML data type bound to XML
schemas), the type-relevant information of the post-schema validation
Infoset (PSVI), which adds type information to the Infoset, is encoded
in the internal representation. This improves parsing speed
significantly. (For more information, see the W3C XML Schema
specifications at http://www.w3.org/TR/xmlschema-1 [ http://www.w3.org/tr/xmlschema-1 ] and
http://www.w3.org/TR/xmlschema-2 [ http://www.w3.org/tr/xmlschema-2 ] and the XQuery 1.0 and XPath 2.0 Data Model working draft at
http://www.w3.org/TR/2005/WD-xpath-datamodel-20050211 [ http://www.w3.org/tr/2005/wd-xpath-datamodel-20050211 ] .)
- Mapping between XML and relational storage:
Using annotated schema (AXSD), the XML is decomposed into columns in
one or more tables, preserving fidelity of the data at the relational
level—hierarchical structure is preserved, while order among elements
is ignored. The schema cannot be recursive.
- Large object storage ([n]varchar(max) and varbinary(max)):
An exact copy of the data is stored. This is useful for
special-purpose applications such as legal documents. Most applications
do not require an exact copy, and are satisfied with the XML content
(Infoset fidelity).
In general, you may need to use a combination of
these approaches. For example, you may want to store your XML data in
an XML data type column and promote properties from it into relational
columns. Or, you may want to use mapping technology and store
non-recursive parts in non-XML columns and only the recursive parts in
XML data type columns.
Choice of XML technology
The choice of XML technology (native XML versus XML view) generally depends upon the following factors:
- Storage options:
Your XML data may be more suitable for large object storage (for
example, a product manual), or more amenable to storage in relational
columns (for example, line items converted to XML). Each storage option
preserves document fidelity to a different extent.
- Query capabilities:
You may find one storage option more suitable than others, based on
the nature of your queries and on the extent to which you query your
XML data. Fine-grained query of your XML data—for example, predicate
evaluation on XML nodes—is supported to varying degrees in the two
storage options.
- Indexing XML data:
You may want to index the XML data to speed up XML query
performance. Indexing options vary with the storage options; you need
to make the appropriate choices to optimize your workload.
- Data modification capabilities:
Some workloads involve fine-grained modification of XML data (for
example, adding a new section within a document), while others do not
(for example, Web content). Data modification language support may be
important for your application.
- Schema support:
Your XML data may be described by a schema, which may or may not be
an XML schema document. The support for schema-bound XML depends upon
the XML technology.
Needless to say, different choices have different performance characteristics.
Native XML storage
You can store your XML data in an XML data type column at the server. This is a suitable choice if:
- You want a straightforward way of storing your XML data at the server while preserving document order and document structure.
- You may or may not have a schema for your XML data.
- You want to query and modify your XML data.
- You want to index the XML data for faster query processing.
- Your application needs system catalog views to administer your XML data and XML schemas.
Native
XML storage is useful when you have XML documents with a wide range of
structures, or XML documents conforming to different or complex schemas
that are too hard to map to relational structures.
Example: modeling XML data using XML data type
Consider
a product manual in XML format, consisting of a separate chapter for
each topic, and multiple sections within each chapter. A section can
contain sub-sections, so that <section> is a recursive element.
Product manuals contain plenty of mixed content, diagrams and technical
material; the data is semi-structured. Users may want to perform
contextual search for topics of interest (e.g. the section on
"clustered index" within the chapter on "indexing"), and query
technical quantities.
A suitable storage model for your XML
documents is an XML data type column. This preserves the Infoset
content of your XML data. Indexing the XML column benefits query
performance.
Example: retaining exact copies of XML data
Suppose
government regulations require you to retain exact textual copies of
your XML documents (for example, signed documents, legal documents, or
stock transaction orders). You may want to store your documents in an
[n]varchar(max) column.
For querying, convert the data to XML
data type at runtime and execute XQuery on it. The runtime conversion
may be expensive especially when the document is large. If you query
often, you can redundantly store the documents in an XML data type
column and index it, while you return exact document copies from the
[n]varchar(max) column.
The XML column may be a computed column
based on the [n]varchar(max) column. However, you cannot create an XML
index on a computed, XML column, nor can an XML index be built on
[n]varchar(max) or varbinary(max) columns.
XML view technology
By
defining a mapping between your XML schemas and the tables in your
database, you create an "XML view" of your persistent data. XML bulk
load can be used to populate the underlying tables using the XML view.
You can query the XML view using XPath 1.0; the query is translated
into SQL queries on the tables. Similarly, updates are propagated to
those tables as well.
This technology is useful when:
- You want to have an XML-centric programming model using XML views over your existing relational data.
- You have a schema (XSD, XDR) for your XML data, which an external partner may have provided.
- Order is not important in your data, your queryable data is not recursive, or the maximal recursion depth is known in advance.
- You want to query and modify the data through the XML view using XPath 1.0.
- You want to bulkload XML data and decompose them into the underlying tables using the XML view.
Examples
include relational data exposed as XML for data exchange and web
services, and XML data with fixed schema. For more information, see
http://msdn.microsoft.com/SQLXML [ http://msdn.microsoft.com/sqlxml ] .
You can also publish XML from relational and XML data stored at the server using FOR XML. For more information, refer to Using FOR XML to Generate XML from Rowsets in this article.
Example: modeling data using Annotated XML Schema (AXSD)
Suppose
you have existing relational data (e.g. customers, orders and line
items) that you would like to manipulate as XML. Define an XML view
using AXSD over the relational data. The XML view allows you to bulk
load XML data into your tables, and query and update the relational
data using the XML view. This model is useful if you need to exchange
data with XML markup with other applications while your SQL
applications work uninterrupted.
Hybrid model
Quite
often, a combination of relational and XML data type columns is
appropriate for data modeling. Some of the values from your XML data
can be stored in relational columns, and the rest, or the entire XML
value, stored in an XML column. This may yield better performance (for
example, you have full control over the indexes created on the
relational columns) and locking characteristics. However, you undertake
greater responsibility for managing your data storage.
The
values to store in relational columns depend on your workload. For
example, if you retrieve entire XML values based on the path expression
/Customer/@CustId, then promoting the value of the CustId
attribute into a relational column and indexing it may yield faster
query performance. On the other hand, if your XML data is extensively
and non-redundantly decomposed into relational columns, the re-assembly
cost may be significant.
For highly structured XML data (for example, the content of a table has been converted into XML), you can map all values to relational columns, possibly using XML view technology.
Data modeling using XML data type
This
section discusses data modeling topics for native XML storage. These
include indexing XML data, property promotion, and typed XML data type.
Same or different table
An
XML data type column can be created in a table containing other
relational columns, or in a separate table with a foreign key
relationship to a main table.
Create an XML data type column in the same table when one of the following conditions is true:
- Your application performs data retrieval on the XML column and does not require an XML index on the XML column, or.
- You
want to build an XML index on the XML data type column, and the primary
key of the main table is the same as its clustering key. See the
section on Indexing an XML Data Type Column for more details.
Create the XML data type column in a separate table if the following conditions are true:
- You want to build an XML index on the XML data
type column, but the primary key of the main table is not the same as
its clustering key, or the main table does not have a primary key, or
the main table is a heap (that is, no clustering key). This may be true
if the main table already exists.
- You do not want table scans
to slow down due to the presence of the XML column in the table, which
takes up space whether it is stored in-row or out-of-row.
Granularity of XML data
The
granularity of the XML data stored in an XML column is critical for
locking and update characteristic. SQL Server employs the same locking
mechanism for both XML and non-XML data. Thus, row-level locking causes
all XML instances in the row to be locked. When the granularity is
large, locking large XML instances for updates causes throughput to
decline in a multi-user scenario. To improve concurrency in a high
update scenario, the XML data can be shredded into relational rows in
one or more tables. Severe decomposition of this kind may lose object
encapsulation and the structure of the XML data, and raises re-assembly
cost.
Updates to an XML instance are performed in-place and
incrementally, i.e. without replacing the entire XML instance in most
cases. Thus, updating the value of a single attribute is efficient and
fairly independent of the size of the XML instance.
A balance between data modeling requirements and locking characteristics is important for good design.
Untyped, typed, and constrained XML data type
The
SQL Server 2005 XML data type implements the ISO SQL-2003 standard XML
data type. As such, it can store well-formed XML 1.0 documents as well
as so-called XML content fragments with text nodes and an arbitrary
number of top-level elements in an untyped XML column. The
system checks for the well-formedness of the data, does not require the
column to be bound to XML schemas, and rejects data that is not
well-formed in the extended sense. This is true also of untyped XML
variables and parameters.
If you have XML schemas describing your XML data, you can associate the schemas with the XML column to yield typed
XML. The XML schemas are used to validate the data, perform more
precise type checks during compilation of query and data modification
statements than untyped XML, and optimize storage and query processing.
Use untyped XML data type under the following conditions:
- You do not have a schema for your XML data.
- You
have schemas but you do not want the server to validate the data. This
is sometimes the case when an application performs client-side
validation before storing the data at the server, or temporarily stores
XML data invalid according to the schema, or uses XML schema features
not supported at the server (e.g. key/keyref).
Use typed XML data type under the following conditions:
- You have schemas for your XML data and you want the server to validate your XML data according on the XML schemas.
- You want to take advantage of storage and query optimizations based on type information.
- You want to take better advantage of type information during compilation of your queries such as static type errors.
Typed
XML columns, parameters and variables can store XML documents or
content, which you have to specify as a flag (DOCUMENT or CONTENT,
respectively) at the time of declaration. Furthermore, you have to
provide one or more XML schemas. Specify DOCUMENT if each XML instance
has exactly one top-level element; otherwise, use CONTENT. The query
compiler uses DOCUMENT flag in type checks during query compilation to
infer singleton top-level elements.
In addition to typing an
XML column, you can use relational (column or row) constraints on typed
or untyped XML data type columns. Use constraints under the following
conditions:
- Your business rules cannot be expressed in
XML schemas. For example, the delivery address of a flower shop must be
within 50 miles of its business location, which can be written as a
constraint on the XML column. The constraint may involve XML data type
methods within scalar (as opposed to table-valued) user-defined
functions.
- Your constraint involves other XML or non-XML columns in the table. An example is the enforcement of the ID of a Customer (/Customer/@CustId) found in an XML instance to match the value in a relational CustomerID column.
Document Type Definition (DTD)
XML
data type columns, variables, and parameters can be typed using XML
schema but not using DTD. You can convert DTDs to XML schema documents
using third-party tools, and load the XML schemas into the database.
Inline
DTD can be used for both untyped and typed XML instances to supply
default values and to replace entity references with their expanded
form.
Internal storage of XML data
The
XML data supplied by a user is stored internally in a binary format,
which can be parsed faster than the textual representation of the XML
data. This binary format yields some compression in the general case,
and is limited to 2GB per instance. The amount of compression depends
upon the length and number of repeating tags, and the types of the
values occurring in the XML data. The following example shows how the
size of the stored XML data can be computed.
Example: computing stored XML size
We use table docs (pk INT PRIMARY KEY, xCol XML)
with an untyped XML column in most of our examples—these can be
extended to typed XML in a straightforward way (see SQL Server 2005
Books Online for information on the use of typed XML).
CREATE TABLE docs (pk INT PRIMARY KEY, xCol XML)
For ease of exposition, queries are described for XML data instances such as the following:
INSERT INTO docs VALUES (1,
'<book genre="security" publicationdate="2002" ISBN="0-7356-1588-2">
<title>Writing Secure Code</title>
<author>
<first-name>Michael</first-name>
<last-name>Howard</last-name>
</author>
<author>
<first-name>David</first-name>
<last-name>LeBlanc</last-name>
</author>
<price>39.99</price>
</book>')
The stored size in bytes of the XML instances in the XML column can be found using the DATALENGTH() function:
SELECT DATALENGTH (xCol)
FROM docs
In-row and out-of-row storage
Small XML
data type instances are stored within the rows of a table. Larger
values that cannot be accommodated within a disk page are stored out of
row with an in-row pointer of 16 bytes.
Storing XML values
in-row reduces the record density and slows down table scans over the
non-XML columns in the table. In such cases, the ''large value types
out of row'' option can be specified in the system stored procedure sp_tableoption to store all large data types off-row.
Indexing an XML data type column
XML
indexes can be created on XML data type columns. It indexes all tags,
values and paths over the XML instances in the column and benefits
query performance. Your application may benefit from an XML index under
the following conditions:
- Queries on XML columns are common in your
workload. XML index maintenance cost during data modification must be
taken into account.
- Your XML values are relatively large and
the retrieved parts are relatively small. Building the index avoids
parsing the whole data at runtime and benefits index lookups for
efficient query processing.
The first index on an XML
column is the "primary XML index". Using it, three types of secondary
XML indexes can be created on the XML column to speed up common classes
of queries, as described below.
Primary XML index
This
indexes all tags, values and paths within the XML instances in an XML
column. The base table (i.e. the table in which the XML column occurs)
must have a clustered index on the primary key of the table; the
primary key is used to correlate index rows with the rows in the base
table. Full XML instances are retrieved from the XML columns (e.g. SELECT *). Queries use the primary XML index, returning scalar values or XML subtrees using the index.
Example: creating primary XML index
The following statement creates a primary XML index called idx_xCol on the XML column xCol of the table docs:
CREATE PRIMARY XML INDEX idx_xCol on docs (xCol)
Secondary XML indexes
Once
the primary XML index has been created, you may want to create
secondary XML indexes to speed up different classes of queries within
your workload. Three types of secondary XML indexes—PATH, PROPERTY and
VALUE—benefit path-based queries, custom property management scenarios,
and value-based queries, respectively.
The PATH index builds a B+-tree
on (path, value) pair of each XML node in document order over all XML
instances in the column. The PROPERTY index creates a B+-tree
clustered on the (PK, path, value) pair within each XML instance, where
PK is the primary key of the base table. Finally, the VALUE index
creates a B+-tree on (value, path) pair of each node in document order across all XML instances in the XML column.
Here are some guidelines for creating one or more of these indexes:
- If your workload uses path expressions heavily
on XML columns, the PATH secondary XML index is likely to speed up your
workload. The most common case is the use of exist() method on XML columns in WHERE clause of Transact-SQL.
- If
your workload retrieves multiple values from individual XML instances
using path expressions, clustering paths within each XML instance in
the PROPERTY index may be helpful. This scenario typically occurs in a
property bag scenario when properties of an object are fetched and its
relational primary key value is known.
- If your workload
involves querying for values within XML instances without knowing the
element or attribute names that contain those values, you may want to
create the VALUE index. This typically occurs with descendant axes
lookups, such as //author[last-name="Howard"], where
<author> elements can occur at any level of the hierarchy and the
search value ("Howard") is more selective than the path. It also occurs
in "wildcard" queries, such as /book [@* = "novel"], where the query looks for <book> elements with some attribute having the value "novel".
Example: path-based lookup
Suppose the query below is common in your workload:
SELECT pk, xCol
FROM docs
WHERE xCol.exist ('/book[@genre = "security"]') = 1
The path expression /book/@genre and the value
"security" correspond to the key fields of the PATH index.
Consequently, secondary XML index of type PATH is helpful for this
workload:
CREATE XML INDEX idx_xCol_Path on docs (xCol)
USING XML INDEX idx_xCol FOR PATH
Example: fetching properties of an object
Consider the query below that retrieves the first names of authors of a book from each row in table docs:
SELECT ref.value ('first-name', 'nvarchar(64)'),
ref.value ('last-name', 'nvarchar(64)')
FROM docs CROSS APPLY xCol.nodes ('/book/author) R(ref)
The property index is useful in this case and is created as follows:
CREATE XML INDEX idx_xCol_Property on docs (xCol)
USING XML INDEX idx_xCol FOR PROPERTY
Example: value-based query
In
the following query, a partial path is specified using //, so that the
lookup based on the value of ISBN benefits from the use of the VALUE
index:
SELECT xCol
FROM docs
WHERE xCol.exist ('//book/@ISBN[. = "0-7356-1588-2"]') = 1
The VALUE index is created as follows:
CREATE XML INDEX idx_xCol_Value on docs (xCol)
USING XML INDEX idx_xCol FOR VALUE
XML index on multiple filegroups
XML
indexes are collocated with the base table, i.e., XML index rows are
stored in the same file groups and table partitions as the
corresponding base table rows. This may sometimes require large
filegroups for XML blobs and their collocated XML indexes. The
TEXTIMAGE_ON <filegroup> specification in the CREATE TABLE
statement stores the XML blobs in the specified filegroup—the XML index
rows are still collocated with the base table, while large XML node
values are in the same filegroup as the XML blobs. This reduces the
size of the individual filegroups and provides more convenience for
data management. For example, when the non-XML data in the row is small
relative to the size of the XML data, this technique can distribute the
storage more evenly.
Full-text index on XML column
You
can create a full-text index on XML columns; this indexes the content
of the XML values while ignoring the XML markup. Attribute values are
not full-text indexed (since they are considered part of the markup)
and element tags are used as token boundaries. You can combine
full-text search with XML index usage in some scenarios:
- Filter the XML values of interest using SQL full-text search.
- Query those XML instances, which uses XML index on the XML column.
Example: combining full-text search with XML querying
The
steps for creating full-text index on an XML column are identical to
those for other SQL type columns. The DDL statements are as follows, in
which PK__docs__023D5A04 is the single-column primary key index of the
table:
CREATE FULLTEXT CATALOG ft AS DEFAULT
CREATE FULLTEXT INDEX ON dbo.docs (xCol) KEY INDEX PK__docs__023D5A04
Once the full-text index has been created on the XML column,
the following query checks that an XML instance contains the word
"Secure" in the title of a book:
SELECT *
FROM docs
WHERE CONTAINS(xCol,'Secure')
AND xCol.exist('/book/title/text()[contains(.,"Secure")]') =1
The CONTAINS() method uses the full-text index to subset the XML instances that contain the word "Secure" anywhere in the document. The exist() clause ensures that the word "Secure" occurs in the title of a book.
Full-text search using CONTAINS() and XQuery contains()
have different semantics. The latter is a substring match, while the
former is a token match using stemming. Thus, if the search is for the
string "run" in the title, then "run", "runs" and "running" all match,
since both the full-text CONTAINS() and the XQuery contains() are satisfied. However, the query above does not match the word "UnSecured" in the title (the full-text CONTAINS() fails but the XQuery contains() is satisfied). Furthermore, full-text search employs word stemming, while XQuery contains() is a literal match. In general, for a pure substring match, the full-text CONTAINS() clause should be removed. This difference is illustrated in the next example.
Example: full-text search on XML values using stemming
The XQuery contains() check in Example: combining full-text search with XML querying cannot be eliminated in general. Consider the query:
SELECT *
FROM docs
WHERE CONTAINS(xCol,'run')
The word "ran" in the document matches the search condition
owing to stemming. Furthermore, the search context is not checked using
XQuery.
When XML is decomposed using AXSD into relational
columns that are full-text indexed, XPath queries over the XML view do
not perform full-text search on the underlying tables.
Support for different languages in full-text index on XML column
Unlike
nvarchar or varchar columns that can have only one word breaker for the
entire column, an XML data type column supports multiple language word
breakers using the xml:lang attribute on XML elements. The word breaker
for the specified language is used on the content of that element. A
sub-element can specify a different language in an xml:lang attribute.
Thus, not only can different XML instances but also a single XML
instance can involve multiple word breakers.
This gives rise
to interesting possibilities. For example, a Word 2003 document may
contain sections in different languages. The document in WordML XML
representation can be stored in an XML data type column, and the
appropriate language word breakers are used for full-text indexing.
A full-text query can specify the language to use, as shown in the example below.
Example: full-text search specifying a language
The query below specifies that the full-text search should be performed for the German language.
SELECT * FROM docs
WHERE contains (xCol, 'Visionen', LANGUAGE 'German')
Property promotion
If queries are made
principally on a small number of element and attribute values (for
example, find customers based on customer id—i.e., the value of /Customer/@CustId
is specified), you may want to promote those values into relational
columns. This is helpful when queries are issued on a small part of the
XML data while the entire XML instance is retrieved. Creating XML index
on the XML column is overkill; instead, the promoted column can be
indexed. Queries must be written to use the promoted column (that is,
the query optimizer does not re-target queries on the XML column to the
promoted column).
The promoted column can be a computed column
in the same table or a separate, user-maintained column in a table.
This is adequate when singleton values (that is, single-valued
properties) are promoted from each XML instance. However, for
multi-valued properties, you have to create a separate table for the
property, as described below.
Computed column based on XML data type
A
computed column can be created using a user-defined function (UDF) that
invokes XML data type methods. The type of the computed column can be
any SQL type, including XML. This is illustrated in the following
example.
Example: computed column based on XML data type method
Create the user-defined function for ISBN of books:
CREATE FUNCTION udf_get_book_ISBN (@xData xml)
RETURNS varchar(20)
BEGIN
DECLARE @ISBN varchar(20)
SELECT @ISBN = @xData.value('/book[1]/@ISBN', 'varchar(20)')
RETURN @ISBN
END
Add a computed column to the table for ISBN:
ALTER TABLE docs
ADD ISBN AS dbo.udf_get_book_ISBN(xCol)
The computed column can be indexed in the usual way.
Example: queries on computed column based on XML data type methods
To obtain the <book> whose ISBN is 0-7356-1588-2, the query
SELECT pk, xCol
FROM docs
WHERE xCol.exist ('/book[@ISBN = "0-7356-1588-2"]') = 1
on the XML column can be rewritten to use the computed column as follows:
SELECT pk, xCol
FROM docs
WHERE ISBN = '0-7356-1588-2'
You can create a user-defined function to return XML data
type and create a computed column using the UDF. However, you cannot
create an XML index on the computed, XML column.
Creating property tables
You
may want to promote some of the multi-valued properties from your XML
data into one or more tables, create indexes on those tables, and
retarget your queries to use them. A typical scenario is one in which a
small number of properties cover most of your query workload. You can
do the following:
- Create one or more tables to hold the
multi-valued properties. You may find it convenient to store one
property per table, and to duplicate the primary key of the base table
in the property tables for back join with the base table.
- If you want to maintain the relative order of the properties, you need to introduce a separate column for the relative order.
- Create triggers on the XML column to maintain the property table(s). Within the triggers, do one of the following:
- Use XML data type methods, such as nodes() and value(), to insert and delete rows of the property table(s). (See the section Value(), nodes(), and OpenXML() for more discussion of the nodes() method.)
- Create streaming table-valued function(s) in CLR to insert and delete rows of the property table(s).
- Write
queries for SQL access to the property tables and XML access to the XML
column in the base table, with joins between the tables using their
primary key.
Example: create property table
Suppose
you want to promote first name of authors. Books have one or more
authors, so that first name is a multi-valued property. Each first name
is stored in a separate row of a property table. The primary key of the
base table is duplicated in the property table for back join.
CREATE TABLE tblPropAuthor (propPK int, propAuthor varchar(max))
Example: create user-defined function to generate a rowset from XML instance
The
table-valued user-defined function udf_XML2Table below accepts a
primary key value and an XML instance. It retrieves the first name of
all authors of <book> elements and returns a rowset of (primary
key, first name) pairs. Indexing a computed column based on XML data
type methods without using a wrapper user-defined function is not
supported in SQL Server 2005.
CREATE FUNCTION udf_XML2Table (@pk int, @xCol xml)
RETURNS table WITH SCHEMABINDING
AS RETURN(
select @pk as PropPK, nref.value('.', 'varchar(max)') as propAuthor
from @xCol.nodes('/book/author/first-name') R(nref)
)
Example: create triggers to populate property table
Insert trigger: inserts rows into the property table
CREATE TRIGGER trg_docs_INS on docs FOR INSERT
AS
BEGIN
insert into tblPropAuthor
select p.*
from inserted as I CROSS APPLY
dbo.udf_XML2Table(I.pk, I.xCol) as P
END
Delete trigger: deletes rows from the property table based on the primary key value of deleted rows
create trigger trg_docs_DEL on docs for delete
as
begin
delete tblPropAuthor where propPK IN
(select p.PropPK
from deleted as D CROSS APPLY
dbo.udf_XML2Table(D.pk, D.xCol) as P
)
end
Update trigger: deletes existing rows in property table
corresponding to the updated XML instance and inserts new rows into
property table
create trigger trg_docs_UPD
on docs
for update
as
if update(xCol) or update(pk)
begin
delete tblPropAuthor where propPK IN
(select p.PropPK
from deleted as D CROSS APPLY
dbo.udf_XML2Table(D.pk, D.xCol) as P
)
insert into tblPropAuthor
select p.*
from inserted as I CROSS APPLY
dbo.udf_XML2Table(i.pk, i.xCol) as P
end
Example: find XML instances whose authors have the first name "David"
The
query can be formulated on the XML column. Alternatively, it can search
the property table for first name "David" and perform a back join with
the base table to return the XML instance, as shown below.
SELECT xCol
FROM docs JOIN tblPropAuthor ON docs.pk = tblPropAuthor.propPK
WHERE tblPropAuthor.propAuthor = 'David'
Example: solution using CLR streaming table-valued function
This solution consists of the following steps:
- Define a CLR class CXmlStreamingTVF that
implements IEnumerator and contains a method InitMethod to generate a
streaming table-valued output by applying a simple path expression on
an XML instance.
- Create an assembly and a Transact-SQL user-defined function (UDF) to invoke the CLR class.
- Define insert, update and delete triggers using the UDF to maintain the property table(s).
First,
create the streaming CLR function shown below. XML data type is exposed
as a managed class SqlXml in ADO.NET; it supports the method CreateReader() that returns an XmlReader.
using System;
using System.Xml;
using System.IO;
using System.Data;
using System.Data.Sql;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Collections;
public class CXmlStreamingTVF : IEnumerator {
private XmlReader m_reader;
private SqlXml m_doc;
private string m_name;
private string[] m_path;
private int m_pathLoc;
public CXmlStreamingTVF (SqlXml doc, string simplePath) {
m_doc = doc;
m_reader = m_doc.CreateReader();
m_path = simplePath.Split(new char[]{'/'});
m_pathLoc = m_path.Length-1;
}
//Three IEnumerator methods.
//Custom code for Navigating the document for a simple path.
public bool MoveNext () {
bool new_row = false;
while (!new_row && !m_reader.EOF) {
m_reader.Read();
if (m_reader.LocalName==m_path[m_pathLoc] &&
m_pathLoc==m_path.Length-1 &&
m_reader.NodeType==XmlNodeType.Element) {
m_name = m_reader.ReadString();
new_row = true;
}
else if (m_reader.LocalName==m_path[m_pathLoc] &&
m_reader.NodeType==XmlNodeType.Element &&
m_reader.IsEmptyElement==false) {
if (m_pathLoc==1 && m_reader.Depth!=0)
continue;
m_pathLoc++;
}
else if (m_pathLoc!=1 &&
m_reader.LocalName==m_path[m_pathLoc-1] &&
m_reader.NodeType==XmlNodeType.EndElement) {
m_pathLoc--;
}
}
return new_row;
}
public object Current { get { return this; } }
public void Reset () {
m_reader.Close();
m_reader = m_doc.CreateReader();
}
[SqlFunctionAttribute (FillRowMethodName="CLROpenXml")]
public static IEnumerator InitMethod (SqlXml doc, string simplePath)
{
return new CXmlStreamingTVF(doc, simplePath);
}
public static void CLROpenXml(Object obj, out string name) {
CXmlStreamingTVF stream = (CXmlStreamingTVF) obj;
name = stream.m_name;
}
}
Next, create an assembly and a Transact-SQL user-defined function SQL_streaming_xml_tvf corresponding to the CLR method InitMethod().
CREATE ASSEMBLY CLRXML
FROM 'C:\temp\StreamingTVF.dll'
WITH PERMISSION_SET = SAFE
CREATE FUNCTION SQL_streaming_xml_tvf (
@xData XML, @xPath nvarchar(max))
RETURNS table (FirstName nvarchar(max))
AS
EXTERNAL NAME [CLRXML].[CXmlStreamingTVF].[InitMethod]
The UDF is used to define the table-valued function CLR_udf_XML2Table for rowset generation:
create function CLR_udf_XML2Table (@pk int, @xCol xml)
returns @ret_Table table (FK int, FirstName varchar(max))
with schemabinding
as
begin
insert into @ret_Table
select @pk, FirstName
FROM SQL_streaming_xml_tvf (@xCol, '/book/author/first-name')
return
end
Finally, define triggers as in Example "Create triggers to
populate property table" with the function CLR_udf_XML2Table replacing
udf_XML2Table. Thus, the insert trigger is as follows:
create trigger CLR_trg_docs_INS on docs for insert
as
begin
insert into tblPropAuthor
select p.*
from inserted as I CROSS APPLY
dbo.CLR_udf_XML2Table(I.pk, I.xCol) as P
end
The delete and the update triggers are similar to the non-CLR ones and are obtained by merely replacing the function udf_XML2Table() with CLR_udf_XML2Table().
Pros and cons of these two alternatives
When the function udf_XML2Table()
used to generate, remove and modify the rows in the property table is
CPU intensive, the CLR-based approach is generally faster. This
includes XML data with very complex structure so that the XML parsing
is computationally expensive. When the evaluation of the function udf_XML2Table()
is cheap, the difference diminishes. For small XML sizes and simple
path expressions, the cost of context switching may hurt the CLR-based
solution more.
Unlike the Transact-SQL and XQuery-based
solution, the path expression in the CLR-based solution is hard-coded.
This works well as long as the path expressions are known ahead of
time. In all other cases, the Transact-SQL and XQuery-based solutions
are the only viable ones.
XML schema collections
An
XML schema collection is a meta-data entity, scoped by a relational
schema, which contains one or more XML schemas that may be related
(e.g. through <xs:import>) or unrelated. Individual XML schemas
within an XML schema collection are identified using their target
namespace.
An XML schema collection is created using CREATE XML
SCHEMA COLLECTION syntax and providing one or more XML schemas. More
XML schema components can be added to an existing XML schema, and more
schemas can be added to an XML schema collection using ALTER XML SCHEMA
COLLECTION syntax. XML schema collections can be secured like any SQL
object using SQL Server 2005's security model.
Multi-typed column
An
XML schema collection C types an XML column xCol according to multiple
XML schemas. Additionally, the flag DOCUMENT or CONTENT specifies
whether XML trees or fragments, respectively, can be stored in column
xCol.
For DOCUMENT, each XML instance specifies the target
namespace of its top-level element in the instance, according to which
it is validated and typed. For CONTENT, on the other hand, each
top-level element can specify any one of the target namespaces in C.
The XML instance is validated and typed according to all the target
namespaces occurring in an instance.
Schema evolution
XML
schema collection is used to type XML columns, variables and
parameters. It provides a mechanism for XML schema evolution. Suppose
you add an XML schema with target namespace BOOK-V1 to an XML schema
collection C. An XML column xCol typed using C can store XML data
conforming to BOOK-V1 schema.
Suppose an application wants to
extend the XML schema with new schema components, such as complex type
definitions and top-level element declarations. These new schema
components can be added to BOOK-V1 schema and do not require
re-validation of the existing XML data in column xCol.
Suppose
later the application wants to provide a new version of the XML schema,
for which it chooses the target namespace BOOK-V2. This XML schema can
be added to C. The XML column can store instances of both BOOK-V1 and
BOOK-V2, and execute queries and data modification on XML instances
conforming to these namespaces.
Lax validation disallowed in wildcard sections
The
XML schema processor does not support lax validation in wildcard
sections (xs:any and xs:anyAttribute) and xs:anyType. For wildcard
sections, the XML schema can specify either processContents = "strict" or processContents = "skip". For xs:anyType, only strict validation is supported.
Strict
validation ensures that more precise type information regarding the XML
nodes instantiating these schema components is known during validation
and used during query compilation. Skip semantics loses the typing
information and the corresponding nodes are treated as untyped
(xdt:untyped in the case of elements and xdt:untypedAtomic in the case
of attributes).
If skip semantics for xs:anyType is desired, then introduce a new complex type that uses xs:any and xs:anyAttribute with processContents = "skip" as shown below:
<xs:complexType name="skipAnyType" mixed="true">
<xs:sequence>
<xs:any processContents="skip" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:anyAttribute processContents="skip"/>
</xs:complexType>
Using xs:datetime, xs:date and xs:time
Values
of type xs:datetime, xs:data and xs:time must be specified in ISO 8601
format and include a time zone. Otherwise the data validation for these
values fails. Thus, 2005-05-27T14:11:00.943Z is valid as a value of
type xs:datetime but the following are not: 2005-05-27 14:11:00.943Z
(missing date and time separator "T"), 2005-05-27T14:11:00.943 (missing
time zone) and 2005-05-27 14:11:00.943 (missing time separator and time
zone). Similarly, 2005-05-27Z is a valid xs:date value but 2005-05-27
is not since no time zone is specified.
Untyped XML data may
contain date, time and datetime values that an application may wish to
convert to the SQL types dateTime or smallDateTime. These date, time
and datetime values may not conform to ISO 8601 format or contain a
time zone. Similarly, typed XML may contain such values as types other
than xs:date, xs:time and xs:dateTime (e.g., xs:string). In both cases,
the values should be converted first to [n]varchar and then to SQL
datetime or smalldatetime, as the following example illustrates.
Example: extracting datetime value from untyped XML
To obtain the value of the CreationTime attribute from the following data
declare @var xml
select @var =
'<QueryExecutionStats>
<GeneralStats ExecutionCount="1"
LastExecutionTime="2005-05-19 14:11:00.943"
CreationTime="2005-05-19 14:11:00.913"/>
<WorkerTime Total="3361" Last="3361" Min="3361" Max="3361"/>
<PhysicalReads Total="0" Last="0" Min="0" Max="0"/>
<PhysicalWrites Total="0" Last="0" Min="0" Max="0"/>
<LogicalReads Total="0" Last="0" Min="0" Max="0"/>
</QueryExecutionStats>'
a value() method is used to retrieve the value as nvarchar(64), which is then cast to SQL datetime type:
select cast (@var.value(
'(/QueryExecutionStats /GeneralStats/@CreationTime)[1]',
'nvarchar(max)') AS datetime) as creation_time
Usage
Loading XML data
Transfering XML data from SQL Server 2000 to SQL Server 2005
You can transfer XML data to SQL Server 2005 in multiple ways. We discuss a few options.
- If you have your data in an [n]text or image
column in a SQL Server 2000 database, import the table using, say, DTS,
into a SQL Server 2005 database. Change the column type to
[n]varchar(max) or varbinary(max), respectively, and then to XML using
ALTER TABLE statement.
- You can bulk copy your data from SQL Server 2000 using bcp out, and bulk insert into the SQL Server 2005 database using bcp in.
- If
you have data in relational columns in a SQL Server 2000 database,
create a new table with an [n]text column and optionally a primary key
column for a row identifier. Use client side programming to retrieve
XML generated at the server with FOR XML, and write it into the [n]text
column. Then use the above-mentioned techniques to transfer data to a
SQL Server 2005 database. You may choose to write the XML into an XML
column in the SQL Server 2005 database directly.
Example: changing column Type to XML
Suppose
you want to change the type of an [n]text, [n]varchar, varbinary, or
XML column XYZ in table R to XML typed using the XML schema collection
bookCollection. The following statement performs this type change:
ALTER TABLE R ALTER COLUMN XYZ XML (bookCollection)
The target is untyped XML if no XML schema collection is specified.
Text encoding
SQL
Server 2005 stores XML data in Unicode (UTF-16). XML data retrieved
from the server comes out in UTF-16 encoding; if you want a different
encoding for data retrieval, your application needs to perform the
necessary conversion on the retrieved UTF-16 data.
When
converting a string type to XML data type, SQL Server 2005 uses the
code page of the collation of the source string to determine the
encoding. If XML encoding information is specified using the "encoding"
attribute in the XML declaration (e.g., <?xml version="1.0" encoding="windows-1256"?>),
the encoding must be compatible with the string's code page. The string
data can be parsed correctly by the XML parser as long as these two
collations are compatible. Otherwise, an error may be raised or invalid
data may be loaded. The same behavior also occurs when a client
application sends a string value to the server for conversion to XML
data type.
Sometimes, you may have XML data in different
encodings, or have no advance knowledge about the encodings. The
recommendation in such situations is to provide the XML data as a
binary data type (e.g., varbinary(max)). The server derives
the encoding from the byte-order mark of the data stream (0xFFFE
indicates UTF-16) or, if present, the XML declaration. Consequently,
the easiest way of avoiding XML encoding mismatch in an XML parameter
is to send the XML data from the client as native XML (using the SqlXml
class in ADO.NET) or binary type, or to convert from [var]binary data
type to XML at the server.
To summarize, the rules are:
- If your text XML is in Unicode (UCS-2,
UTF-16), assigning it to an XML column, variable or parameter does not
pose any problems.
- If the encoding is not Unicode and is
implicit (due to the source code page), the string code page in the
database should be the same as or compatible with the code points that
you want to load (use COLLATE if necessary). If no such server code
page exists, you have to add an explicit XML declaration to specify the
proper encoding.
- To use an explicit encoding, either use
varbinary type, which has no interaction with code pages, or use a
string type of the appropriate code page. Then assign the data to XML
column, variable or parameter.
Thus, if you want to pass
UTF-8, it is safest to pass it in as varbinary(max). UTF-16 data can be
passed in as nvarchar(max) where no byte order mark is required, or as
varbinary(max) with the byte order mark 0xFFFE as the first two byte to
indicate UTF-16 encoding.
Bulk loading XML data
You
can bulk load XML data into the server using SQL Server's bulk loading
capabilities, such as BCP, OPENROWSET and BULK INSERT. OPENROWSET
allows you to load data into an XML column from files. The following
example illustrates this point.
Example: loading XML from files
This
example shows how to insert a row in table docs. The value of the XML
column is loaded from file C:\temp\xmlfile.xml as binary LOB (BLOB),
and the pk column is supplied the value 10. The file is loaded as a
BLOB (instead of a CLOB or NCLOB) to accept any encoding that the XML
document may be encoded in.
INSERT INTO docs
SELECT 10, xCol
FROM (SELECT *
FROM OPENROWSET (BULK 'C:\temp\xmlfile.xml', SINGLE_BLOB)
AS xCol) AS R(xCol)
Non-binary collations
The XML
collation used for XML data type is a binary collation and is
case-sensitive (the so-called Unicode code point collation).
Applications may have a different requirement, such as case insensitive
searches. This can be achieved by promoting the appropriate string
values into a computed column of type varchar with the appropriate
collation. Query the computed column for collation-dependent
operations. Furthermore, suppose the XML column contains German and
Chinese data strings. You can use operations specific to each of these
collations on two computed columns, one for each of these languages.
XQuery and type inference
XQuery (http://www.w3.org/TR/xquery/
[ http://www.w3.org/tr/xquery/ ] ) embedded in Transact-SQL is the
language supported for querying XML data type. The language is under
development (currently in last call) by the World Wide Web Consortium
(W3C) with the participation of all major database vendors including
Microsoft. It includes XPath 2.0 as navigation language. Language
constructs for data modification are available on XML data type as
well. See books online for information on the XQuery constructs,
functions and operators supported in SQL Server 2005.
Error model
Compilation
errors are returned from syntactically incorrect Xquery expressions and
XML DML statements. The compilation phase checks static type
correctness of XQuery expressions and DML statements, and uses XML
schemas for type inferences in case of typed XML. It raises static type
errors if an expression could fail at runtime due to type safety
violation. Examples of static error are addition of a string to an
integer and querying for a non-existent node for typed data.
As
a deviation from the W3C standard, XQuery runtime errors are converted
into empty sequences, which may propagate as empty XML or NULL to the
query result depending upon the invocation context.
Explicit
casting to the proper type allows users to work around static errors
although runtime cast errors will be transformed to empty sequences.
The following subsections discuss type checking in greater detail.
Singleton checks
Location
steps, function parameters and operators (e.g. eq) requiring singletons
return an error if the compiler cannot determine whether a singleton is
guaranteed at runtime. The problem arises often with untyped data and
sometimes with typed data. For example, lookup of an attribute requires
a singleton parent element; an ordinal selecting a single parent node
is adequate. Evaluation of nodes()-value() combination (see the section Value(), nodes(), and OpenXML()) to extract attribute values may not require the ordinal specification, as shown in the next example, since the nodes() method emits singleton context items.
Example: known singleton
In this example, the nodes() method generates a separate row for each <book> element. (See the section Value(), nodes(), and OpenXML() for a more detailed description of the nodes() method.) The value() method evaluated on a <book> node extracts the value of @genre, which, being an attribute, is a singleton.
SELECT nref.value('@genre', 'varchar(max)') Genre
FROM docs CROSS APPLY xCol.nodes('//book') AS R(nref)
XML schema is used for type checking of typed XML. If a node
is specified as singleton in the XML schema, the compiler uses that
information and no error occurs. Otherwise, an ordinal selecting a
single node is required. In particular, the use of descendant axis,
such as in /book//title, loses singleton cardinality
inference for the <title> element even if the XML schema
specifies it to be so. Rewrite it as (/book//title)[1].
It is important to keep the distinction between //first-name[1] and (//first-name)[1]
in mind for type checking. The former returns a sequence of
<first-name> nodes in which each node is the leftmost
<first-name> node amongst its siblings. The latter returns the
first, singleton <first-name> node in document order in the XML
instance.
Example: use of value()
The query below on untyped XML column results in static, compilation error since value()
expects a singleton node as the first argument and the compiler cannot
determine whether only one <last-name> node will occur at runtime:
SELECT xCol.value('//author/last-name', 'nvarchar(50)') LastName
FROM docs
It is tempting to try the following fix:
SELECT xCol.value('//author/last-name[1]', 'nvarchar(50)') LastName
FROM docs
However, this does not rectify the error since multiple
<author> nodes may occur in each XML instance. The following
rewrite works:
SELECT xCol.value('(//author/last-name)[1]', 'nvarchar(50)') LastName
FROM docs
This query returns the value of the first <last-name> element in each XML instance.
Parent axis
If
the type of a node cannot be determined, it becomes xs:anyType, which
is not implicitly cast to any other type. This occurs most notably
during navigation using parent axis (e.g. xCol.query('/book/@genre/../price'));
the parent node type is determined to be xs:anyType. An element may
also be defined as xs:anyType in an XML schema. In both cases, the loss
of more precise type information often leads to static type errors, and
requires explicit cast of atomic values to their specific type.
Data(),text() and string() accessors
XQuery has a function fn:data() to extract scalar, typed values from nodes, a node test text() to return text nodes, and the function fn:string()
that returns the string value of a node. Their usages are sometimes
confusing. Guidelines for their proper use in SQL Server 2005 are as
follows. Consider the XML instance <age>12</age>.
- Untyped XML: The path expression /age/text() returns the text node "12". The function fn:data(/age) returns the string value "12" and so does fn:string(/age).
- Typed
XML: In SQL Server 2005, the expression /age/text() returns static
error for any simple typed <age> element. On the other hand, fn:data(/age) returns integer 12, while fn:string(/age) yields the string "12".
Used within a query()
method, the result of each of these functions and nodes tests are
converted to text nodes and serialized as a single XML data type
instance. Text nodes are visually represented by their string value,
and the serialization of multiple text nodes appears as a concatenation
of their string values. However, a search for the concatenated string
or a part of it yields an empty result whenever it does not come from a
single text node. This difference is illustrated in the example below.
Example: serialization of text nodes
In the following query, the text nodes under all the <author> elements are retrieved within a query() method. There are four such text nodes in the example used in Example: creating primary XML index, and the serialized output appears as "MichaelHowardDavidLeBlanc" in SQL Server Management Studio.
SELECT xCol.query ('//author/*/text()') LastName
FROM docs
A search for the value "MichaelHowardDavidLeBlanc" in the
query below returns an empty result since the search value does not
equal that of any single text node under an <author> element:
SELECT xCol.query('//author/*/text()[. = "MichaelHowardDavidLeBlanc"]')
FROM docs
Functions and operators over union types
Union types require careful handling owing to type checking. Two of the problems are illustrated in the following examples.
Example: function over union type
Consider an element definition for <r> of a union type
<xs:element name="r">
<xs:simpleType>
<xs:union memberTypes="xs:int xs:float xs:double"/>
</xs:simpleType>
</xs:element>
Within XQuery context, the "average" function fn:avg (//r)
returns a static error since the XQuery compiler cannot sum values of
different types (xs:int, xs:float or xs:double) for the <r>
elements in the argument of fn:avg(). To get around this, rewrite the function invocation as fn:avg(for $r in //r return xs:double ($r)).
Example: operator over union type
The addition operation '+' requires precise types of the operands, so that the expression (//r)[1] + 1
returns a static error with the above type definition for element
<r>. One rewrite to fix the problem is xs:int( (//r)[1]) +1.
Value(), nodes(), and OpenXML()
You can use multiple value() methods on XML data type in a SELECT clause to generate a rowset of extracted values. The nodes() method yields an internal reference for each selected node which can be used to query further. The nodes() method can operate over an XML column. The combination of nodes() and value()
methods can be more efficient in generating the rowset when it has many
columns and perhaps the path expressions used in its generation are
complex.
The nodes() method yields instances of a
special XML data type, each of which has its context set to a different
selected node. Such an XML instance supports query(), value(), nodes() and exist() methods, and can be used in count(*) aggregations and IS NULL checks. All other uses result in error.
Example: use of nodes()
Suppose
you want to extract first and last names of authors, whose first name
is not "David", as a rowset consisting of two columns, FirstName and
LastName. Using nodes() and value() methods, you can achieve this as follows:
SELECT nref.value('first-name[1]', 'nvarchar(50)') FirstName,
nref.value('last-name[1]', 'nvarchar(50)') LastName
FROM docs CROSS APPLY xCol.nodes('//author') AS R(nref)
WHERE nref.exist('.[first-name != "David"]') = 1
In this example, nodes('//author') yields a rowset
of references to <author> elements for each XML instance. The
first and last names of authors are obtained by evaluating value() methods relative to those references.
SQL Server 2000 provides a facility for generating a rowset from an XML instance using OpenXml(). You can specify the relational schema for the rowset and how values inside the XML instance map to columns in the rowset.
Example: use of OpenXml() on XML data type
We can rewrite the query from the previous example using OpenXml() as shown below, by creating a cursor, reading each XML instance into an XML variable, and applying OpenXML to it:
DECLARE name_cursor CURSOR
FOR
SELECT xCol
FROM docs
OPEN name_cursor
DECLARE @xmlVal XML
DECLARE @idoc int
FETCH NEXT FROM name_cursor INTO @xmlVal
WHILE (@@FETCH_STATUS = 0)
BEGIN
EXEC sp_xml_preparedocument @idoc OUTPUT, @xmlVal
SELECT *
FROM OPENXML (@idoc, '//author')
WITH (FirstName varchar(50) 'first-name',
LastName varchar(50) 'last-name') R
WHERE R.FirstName != 'David'
EXEC sp_xml_removedocument @idoc
FETCH NEXT FROM name_cursor INTO @xmlVal
END
CLOSE name_cursor
DEALLOCATE name_cursor
OpenXml() creates an in-memory representation and uses work tables instead of the query processor. Its parsing procedure sp_xml_preparedocument requires a well-formed XML document and does not accept XML fragments. OpenXML()
relies on the XPath 1.0 processor of MSXMLSQL, which is a private
version of the MSXML 3.0 processor used by the database engine, instead
of the XQuery engine. The work tables are not shared among multiple
calls to OpenXml() even on the same XML instance. This limits its scalability. OpenXml() allows you to access an edge table format for the XML data when the WITH clause is not specified. Also, it allows you to use the remainder of the XML value in a separate, "overflow" column.
The combination of nodes() and value() functions use XML indexes effectively. Thus, this combination can exhibit greater scalability than OpenXml.
Example: use of OpenXml on a single XML instance
OpenXml
is often used to shred a single XML instance into a relational form,
for example, when the XML data is received on the wire. In this case,
no cursor is required. This example shows a stored procedure that
accepts a single XML instance for shredding the XML the same way as
that using the cursor example above.
CREATE PROCEDURE SHRED_SINGLE_XML @xmlVal nvarchar(max)
AS
BEGIN
DECLARE @idoc INT
EXEC sp_xml_preparedocument @idoc OUTPUT, @xmlVal
SELECT *
FROM OPENXML (@idoc, '//author')
WITH (FirstName varchar(50) 'first-name',
LastName varchar(50) 'last-name') R
WHERE R.FirstName != 'David'
EXEC sp_xml_removedocument @idoc
END
The stored procedure can be invoked as shown below:
DECLARE @xVal XML
SET @xVal = (SELECT xCOl FROM docs WHERE pk=1)
EXEC SHRED_SINGLE_XML @xVal
Using FOR XML to generate XML from rowsets
You can generate an XML data type instance from a rowset using FOR XML with the new TYPE directive.
The result can be assigned to an XML data type column, variable or parameter. Furthermore, FOR XML can be nested to generate any hierarchical structure. This makes nested FOR XML much more convenient to write than FOR XML EXPLICIT,
but it may not perform as well for deep hierarchies. FOR XML also
introduces a new PATH mode that specifies the path in the XML tree
where a column's value should appear.
The new FOR XML TYPE
directive can be used to define read-only XML views over relational
data with SQL syntax. The view can be queried with SQL statements and
embedded XQuery, as the following example shows. For instance, you can
refer to such SQL views in stored procedures. More information can be
found in the MSDN article
What's New in FOR XML in Microsoft SQL Server 2005 [ http://msdn2.microsoft.com/en-us/library/ms345137.aspx ] .
Example: SQL view returning generated XML data type
The
following SQL view definition creates an XML view over a relational
column (pk) and book authors retrieved from an XML column:
CREATE VIEW V (xmlVal) AS
SELECT pk, xCol.query('/book/author')
FROM docs
FOR XML AUTO, TYPE
The view V contains a single row with a single column xmlVal of XML type. It
can be queried like a regular XML data type instance. For example, the
following query returns the author whose first name is "David":
SELECT xmlVal.query('//author[first-name = "David"]')
FROM V
The query execution materializes the XML instance before executing the query() method on it. Hence, this approach does not perform or scale well except when the aggregated XML instance is small.
SQL
view definitions are somewhat analogous to XML views created using
annotated schemas. However, there are important differences. The SQL
view definition is read-only and must be manipulated with embedded
XQuery; not so for XML views using annotated schema. Furthermore, the
SQL view materializes the XML result before applying the XQuery
expression, while XPath queries on XML views evaluate SQL queries on
the underlying tables.
Adding business logic
Your business logic can be added to XML data in several ways:
- You can write row or column constraints to
enforce domain-specific constraints during insertion and modification
of XML data. Constraints using XML data type methods are allowed only
within a scalar user-defined function.
- You can write a trigger
on the XML column that fires when you insert or update values in the
column. The trigger can contain domain-specific validation rules or
populate property tables.
- You can write SQLCLR functions in
managed code to which you pass XML values, and use XML processing
capabilities provided by System.Xml namespace. An example is to apply
XSL transformation to XML data, as shown below. Alternatively, you can
deserialize the XML into one or more managed classes and operate on
them using managed code.
- You can write Transact-SQL stored procedures and functions that invoke processing on the XML column for your business needs.
Example: applying XSL transformation
Consider
a CLR function TransformXml() that accepts an XML data type instance
and an XSL transformation stored in a file, applies the transformation
to the XML data and returns the transformed XML in the result. A
skeleton function written in C# is as follows:
using System;
using System.Data.SqlTypes;
using System.Xml;
using System.Xml.XPath;
using System.Xml.Xsl;
public class TransformXml
{
public static SqlXml ApplyXslTransform (SqlXml XmlData, string xslPath) {
// Load XSL transformation
XslCompiledTransform xform = new XslCompiledTransform();
xform.Load (xslPath);
// Load XML data
XPathDocument xDoc = new XPathDocument (XmlData.CreateReader());
XPathNavigator nav = xDoc.CreateNavigator ();
// Apply the transformation
// using makes sure that we flush the writer at the end
using (XmlWriter writer = nav.AppendChild())
{
xform.Transform(XmlData.CreateReader(), writer);
}
// Return the transformed value
SqlXml retSqlXml = new SqlXml (nav.ReadSubtree());
return (retSqlXml);
}
}
Once the assembly is registered and a corresponding user-defined Transact-SQL function SqlXslTransform() corresponding to ApplyXslTransform() is created , the function can be invoked from Transact-SQL as in the following query:
SELECT SqlXslTransform (xCol, 'C:\temp\xsltransform.xsl')
FROM docs
WHERE xCol.exist('/book/title/text()[contains(.,"Secure")]') =1
The query result contains a rowset of the transformed XML.
SQLCLR
opens up a whole new world that can be used for decomposing XML data
into tables or property promotion, and querying XML data using managed
classes in the System.Xml namespace. More information can be found in
SQL Server 2005 and Visual Studio "Whidbey" books online.
Data binding in queries
When
your data resides in a combination of relational and XML data type
columns, you may want to write queries that combine relational and XML
data processing. For example, you can convert the data in relational
and XML columns into an XML data type instance using FOR XML and query it using XQuery. Conversely, you can generate a rowset from XML values (see Usage) and query it using Transact-SQL.
A
more convenient and efficient way of writing cross-domain queries is to
use the value of a SQL variable or column within XQuery or XML DML
expressions:
- You can use sql:variable() to use the value of a SQL variable in your XQuery or XML DML expression.
- You can use sql:column() to use values from a relation column in your XQuery or XML DML expression.
This
approach allows applications to parameterize queries, as shown in the
example below. However, XML and user-defined type are not permitted in sql:variable() and sql:column().
Example: data binding using sql:variable()
The query below is a modified version of the one shown in Example: queries on computed column based on XML data type methods. In this version, the ISBN of interest is passed in using a SQL variable @isbn. By replacing the constant with sql:variable(), the query can be used to search for any ISBN, not just the one whose ISBN is 0-7356-1588-2.
DECLARE @isbn varchar(20)
SET @isbn = '0-7356-1588-2'
SELECT xCol
FROM docs
WHERE xCol.exist ('/book[@ISBN = sql:variable("@isbn")]') = 1
Sql:column()
can be used in a similar way and provides additional benefits. Indexes
over the column may be used for efficiency as decided by the cost-based
query optimizer. Furthermore, the computed column may store a promoted
property, as discussed in Computed column based on XML data type.
Catalog views for native XML support
Catalog views exist to provide meta-data information regarding XML usage. A few of these are discussed below.
XML indexes
XML
index entries appear in the catalog view sys.indexes with the index
"type" 3. The "name" column contains the name of the XML index.
XML
indexes are also recorded in the catalog view sys.xml_indexes, which
contains all the columns of sys.indexes and a few special ones
meaningful for XML indexes. The value NULL in the column
"secondary_type" indicates a primary XML index; the values 'P', 'R' and
'V' stand for PATH, PROPERTY and VALUE secondary XML indexes,
respectively.
Space usage of XML indexes can be found in the table-valued function sys.dm_db_index_physical_stats()
. It provides information such as the number of disk pages occupied,
average row size in bytes, number of records and other information for
all index types, including XML indexes. This information is available
for each database partition; XML indexes use the same partitioning
scheme and partitioning function of the base table.
Example: space usage of XML indexes
SELECT sum (page_count)
FROM sys.dm_db_index_physical_stats (db_id(), object_id('docs'),
DEFAULT, DEFAULT, 'DETAILED') SDPS
JOIN sys.xml_indexes SXI ON (SXI.index_id = SDPS.index_id)
WHERE SXI.name = 'idx_xCol_Path'
This yields the number of disk pages occupied by the XML index idx_xCol_Path in table docs across all partitions. Without the sum() function, the result would return the disk page usage per partition.
Retrieving XML schema collections
XML
schema collections are enumerated in the catalog view
sys.xml_schema_collections. The XML schema collection "sys" is defined
by system and contains pre-defined namespaces that can be used in all
user-defined XML schema collections without having to load them
explicitly. This list contains the namespaces for xml, xs, xsi, fn and
xdt. Two other catalog views worth mentioning are:
sys.xml_schema_namespaces, which enumerates all namespaces within each
XML schema collection; and sys.xml_schema_components, which enumerates
all XML schema components within each XML schema.
The built-in function XML_SCHEMA_NAMESPACE(schemaName, XmlSchemacollectionName, namespace-uri)
yields an XML data type instance containing XML schema fragments for
schemas contained in an XML schema collection, except for the
pre-defined XML schemas.
You can enumerate the contents of an XML schema collection in the following ways:
- Write Transact-SQL queries on the appropriate catalog views for XML schema collections.
- Use the built-in function XML_SCHEMA_NAMESPACE(). You can apply XML data type methods on the output of this function. However, you cannot modify the underlying XML schemas.
These are illustrated in the examples below.
Example: enumerate XML namespaces in XML schema collection
Use the following query for XML schema collection "myCollection":
SELECT XSN.name
FROM sys.xml_schema_collections XSC JOIN sys.xml_schema_namespaces XSN
ON (XSC.xml_collection_id = XSN.xml_collection_id)
WHERE XSC.name = 'myCollection'
Example: enumerate contents of an XML schema collection
The following statement enumerates the contents of the XML schema collection "myCollection" within (relational) schema dbo.
SELECT XML_SCHEMA_NAMESPACE (N'dbo', N'myCollection')
Individual XML schemas within the collection can be obtained
as XML data type instances by specifying the target namespace as the
third argument to XML_SCHEMA_NAMESPACE(), as shown below.
Example: output a specified schema from an XML schema collection
The
following statement outputs the XML schema with target namespace
"http://www.microsoft.com/books" from the XML schema collection
"myCollection" within (relational) schema dbo.
SELECT XML_SCHEMA_NAMESPACE (N'dbo', N'myCollection',
N'http://www.microsoft.com/books')
Querying XML schemas
If you want to query XML schemas that you have loaded into XML schema collections, you may do so in the following ways:
- Write Transact-SQL queries on catalog views for XML schema namespaces.
- Create
a table containing an XML data type column to store your XML schemas,
in addition to loading them into the XML type system. You can query the
XML column using the XML data type methods. Furthermore, you can build
XML index on this column. However, maintaining consistency between the
XML schemas stored in the XML column and the XML type system is left to
the application. For example, if you drop the XML schema namespace from
the XML type system, you have to drop it also from your table to
preserve consistency.