Balaji Rathakrishnan
Christian Kleinerman
Brad Richards
Ramachandran Venkatesh
Vineet Rao
Isaac Kunen
Published: May 2005
Summary:
This paper describes the new CLR integration features of SQL Server
2005 and how database application developers and architects can take
advantage of them to write user-defined procedures, functions, and
triggers, as well as define new types and aggregates. (46 printed pages)
We
compare CLR-based programming against existing programming models
supported in SQL Server, such as Transact-SQL and extended stored
procedures, highlight the relative strengths and weaknesses of each
technique, and provide a set of high-level guidelines on how to choose
between the available programming alternatives. We also provide a set
of code examples illustrating CLR integration features.
Contents
Introduction
Overview of CLR Integration
Manual Deployment
Building, Deploying, and Debugging Using Visual Studio
CLR and Its Alternatives
CLR vs. Transact-SQL
CLR vs. XPs
Code in the Middle Tier
Example: Production Scheduling
Solving Common Database Programming Tasks and Problems
Data Validation using the .NET Framework Library
Producing Result Sets
Performing Custom Aggregations over Data
User Defined Types
Conclusion
Introduction
Microsoft
SQL Server 2005 significantly enhances the database programming model
by hosting the Microsoft .NET Framework 2.0 Common Language Runtime
(CLR). This enables developers to write procedures, triggers, and
functions in any of the CLR languages, particularly Microsoft Visual C#
.NET, Microsoft Visual Basic .NET, and Microsoft Visual C++. This also
allows developers to extend the database with new types and aggregates.
This article describes how best to take advantage of this
technology from the perspective of a database application developer,
comparing CLR integration techniques with existing programming language
support in SQL Server: Transact-SQL (T-SQL) and extended stored
procedures (XPs). This paper is not a reference for these features;
reference information can be found in the SQL Server 2005 Books Online.
The samples in this paper assume the April 2005 CTP of SQL Server 2005
or later.
The target audience of this article includes database
application developers, architects, and administrators. The paper
assumes a working familiarity with .NET Framework-based programming and
database programming.
Overview of CLR Integration
What
follows is a brief overview of the SQL Server functionality that is
enabled by CLR integration and how Visual Studio 2005 supports these
features.
Manual Deployment
Registering and executing managed code in the database consists of the following steps:
- The developer writes a managed program as a set of class definitions. SQL Server routines—stored procedures, functions, or triggers—are written as static (or Shared
in Microsoft Visual Basic .NET) methods of a class. User-defined types
and aggregates are written as entire classes. The developer compiles
the code and creates an assembly.
- The assembly is uploaded into a SQL Server database, where it is stored in the system catalogs using the CREATE ASSEMBLY data definition language (DDL) statement.
- Transact-SQL
(T-SQL) objects, such as routines, types, and aggregates are then
created and bound to entry points (methods in the case of routines and
classes for types and aggregates) in the assembly that has been already
uploaded. This is accomplished using the CREATE PROCEDURE/FUNCTION/TRIGGER/TYPE/AGGREGATE statements.
- After
routines are created, they can be used like T-SQL routines by
applications. For example, CLR functions can be called from T-SQL
queries and CLR procedures can be called from a client application or
from a T-SQL batch as if they were T-SQL procedures.
Building, Deploying, and Debugging Using Visual Studio
Visual Studio 2005 supports development, deployment, and debugging of managed code in SQL Server 2005. A new SQL Server Project
provides code templates that make it easy for developers to get started
writing code for CLR-based database routines, types and aggregates.
This project also allows developers to add references to other
assemblies in the database.
When a SQL Server Project is
built, it is compiled into an assembly. Deploying the project uploads
the assembly binary into the SQL Server database that is associated
with the project. The deploy operation also automatically creates the
routines, types, and aggregates defined in the assembly in the database
based on the custom attributes (SqlProcedure, SqlFunction, SqlTrigger, etc.) in the code. Deploying also uploads the source code and debugging symbols (the .pdb file) associated with the assembly.
As
debugging is a fundamental part of the developer experience for any
platform, SQL Server 2005 and Visual Studio 2005 provide database
programmers with such capabilities. A key feature in debugging SQL
Server 2005 objects is the ease of setup and use. Connections to the
computer running SQL Server may be debugged in much the same way as
processes running under a traditional operating system. Functionality
of the debugger is not affected by the type of connection to the server
that the client has: both Tabular Data Stream (TDS) and HTTP
connections can be debugged. In addition, debugging works seamlessly
across languages, allowing the user to step from T-SQL into CLR methods
and vice versa.
CLR and Its Alternatives
In
evaluating the use of CLR integration a developer needs to compare it
to the other available options. Here we aim to provide a base for that
comparison, positioning it against existing programming techniques:
Transact-SQL, extended stored procedures, and code in the middle-tier.
In this section we will concentrate on user defined routines.
CLR vs. Transact-SQL
Transact-SQL
(T-SQL) is the native programming language supported by SQL Server.
Like most versions of SQL, it contains data manipulation features and
data definition features. The data manipulation features can be broadly
categorized into two parts: a declarative query language (composed of SELECT/INSERT/UPDATE/DELETE statements) and a procedural language (WHILE,
assignment, triggers, cursors, etc.) Broadly speaking, CLR support in
SQL Server provides an alternative to the procedural portion of T-SQL.
Even
without CLR support, it is important to recognize that database
applications should use the declarative query language as much as
possible. This portion of the language is able to leverage the power of
the query processor, which is best able to optimize and perform bulk
operations. Database applications should only resort to procedural
programming to express logic that cannot be expressed within the query
language.
All of this remains true with CLR support in SQL
Server: the CLR should not be used to write procedural code that can be
expressed using the declarative features of the T-SQL language.
Developers should be aware that there are a number of significant
enhancements to the T-SQL query language in SQL Server 2005 that
augment the expressive power of the T-SQL query language, and should
ensure that they are taking full advantage of them before writing
procedural code, whether in the CLR or not. Some of these added
features include:
- The ability to write recursive queries to traverse recursive hierarchies in a table
- New analytical functions such as RANK and ROW_NUMBER that allow ranking rows in a result set
- New relational operators such as EXCEPT, INTERSECT, APPLY, PIVOT and UNPIVOT
Developers
should view the CLR as an efficient alternative for logic that cannot
be expressed declaratively in the query language.
Let us look at
some scenarios where CLR-based programming can complement the
expressive power of the T-SQL query language. Often, there is a need
for embedding procedural logic inside a query that can be called as a
function. This includes situations such as:
SQL
Server 2000 introduced T-SQL functions (both scalar and table-valued)
that enable these scenarios. With SQL Server 2005, these functions can
be more easily written using the CLR languages, since developers can
take advantage of the much more extensive libraries in the .NET
Framework API. In addition, CLR programming languages provide rich data
structures (such as arrays, lists, etc.) that are lacking in T-SQL, and
can perform significantly better due to the different execution models
of the CLR and T-SQL.
Functions are, in general, good candidates
to be written using the CLR, since there is seldom a need to access the
database from within a function: values from the database are usually
passed as arguments. This then plays to the strength of the CLR, which
is better at computational tasks than T-SQL.
Data Access from the CLR
We
now look at the CLR as an option for writing routines that perform data
access, both from the perspective of the programming model and
performance.
In T-SQL, query language statements such as SELECT, INSERT, UPDATE, and DELETE
are simply embedded within procedural code. Managed code, on the other
hand, uses the ADO.NET data access provider for SQL Server (SqlClient). In
this approach, all query language statements are represented by dynamic
strings that are passed as arguments to methods and properties in the
ADO.NET API.
Because of this difference, data access code
written using the CLR can be more verbose than T-SQL. More importantly,
because the SQL statements are encoded in dynamic strings they are not
compiled or validated until they are executed, impacting both the
debugging of the code and its performance. However, the database
programming model with ADO.NET is very similar to that used in the
client or middle-tiers, which makes it easier both to move the code
between the tiers and to leverage existing skills.
Again, it is
important to note that both T-SQL- and CLR-based programming models use
the same SQL query language; only the procedural portions differ.
As
already mentioned, managed code has a decisive performance advantage
over T-SQL with respect to most procedural computation, but for
data-access T-SQL generally fares better. Therefore, a good general
rule is that computation- and logic-intensive code is a better choice
for implementation in the CLR than is data-access intensive code.
Let
us now look at some typical primitives and patterns in data-access
programming, comparing how T-SQL and managed programming using the
integrated CLR and ADO.NET perform in these scenarios.
Sending Results to the Client
Our
first scenario involves sending a set of rows to the client without
"consuming" them in the server, i.e., no navigation of the rows is done
inside the routine. With T-SQL, simply embedding a SELECT statement in the T-SQL procedure has the effect of sending rows produced by the SELECT to the client. With managed code, the SqlPipe object is used to send results to the client. T-SQL and ADO.NET perform approximately equally in this scenario.
Submitting SQL Statements
Submitting
SQL statements from the CLR involves traversing additional layers of
code to switch between managed and native SQL code. Because of this,
T-SQL has a performance advantage when issuing a SQL query. Note that
after the statement is submitted to the query processor, there is no
difference in performance based on the source of the statement (whether
in T-SQL or managed code). If the query is complex and takes a long
time to evaluate then the difference in performance between T-SQL and
managed code will be negligible. For short, simple queries, the
overhead of the additional code layers can impact the performance of a
managed procedure.
Typical data-access intensive stored
procedures are likely to involve submitting a sequence of SQL
statements. If the SQL statements are simple and do not take
significant amount of time to execute, then the calling overhead from
managed code can dominate the execution time; such procedures will
perform better written in T-SQL.
Forward-Only, Read-Only Row Navigation
In T-SQL, forward-only, read-only navigation is implemented using a cursor. In CLR code it is implemented with a SqlDataReader.
Typically, there is some processing done for each piece of data.
Ignoring this, T-SQL has an advantage, since the navigation of rows
using the CLR is slightly slower than in T-SQL. However, since the CLR
will significantly outperform T-SQL on any processing done on the data,
the CLR performance will overtake that of T-SQL as the amount of
processing increases.
Additionally, one should be conscious of
the potential for additional latency when using T-SQL cursors. Although
some queries will necessarily exhibit some latency because they must
materialize intermediate results, STATIC or KEYSET
cursors will always materialize the final result set in a temporary
table before producing any results. A cursor can be either explicitly
declared as STATIC or KEYSET, or implicitly converted to one due to certain properties of the query and data. The CLR SqlDataReader will always produce results as they become available, avoiding this latency.
Row-Navigation with Updates
If
the problem involves updating rows based on the current position of the
cursor, then there is no relevant performance comparison: this
functionality is not supported through ADO.NET, and must be done
through T-SQL updateable cursors. Keep in mind, however, that it is
generally better to use UPDATE statements to update rows in bulk, saving cursor-based modifications for when they cannot be expressed with declarative SQL.
Summary
Here is a summary of some guidelines we have seen can be used in choosing between CLR and T-SQL:
- Use declarative T-SQL SELECT, INSERT, UPDATE, and DELETE
statements whenever possible. Procedural and row-based processing
should be used only when the logic is not expressible using the
declarative language.
- If the procedure is simply a wrapper for declarative T-SQL commands it should be written in T-SQL.
- If
the procedure primarily involves forward-only, read-only row navigation
through a result set with some processing of each row, using the CLR is
likely more efficient.
- If the procedure involves both
significant data access and computation, consider separating the
procedural code into a CLR portion that calls into a T-SQL procedure to
perform data access, or a T-SQL procedure that calls into the CLR to
perform computation. Another alternative is to use a single T-SQL batch
that includes a set of queries that are executed once from managed code
to reduce the number of round trips of submitting T-SQL statements from
managed.
Later sections discuss more in-depth when and how to appropriately use T-SQL and CLR when working with result sets.
CLR vs. XPs
In
previous releases of SQL Server, extended stored procedures (XPs) were
the only alternative to T-SQL with which to write server-side code with
logic that was difficult to write in T-SQL. CLR integration provides a
more robust alternative to XPs. In addition, with CLR integration, many
stored procedures can be better expressed as table-valued functions,
allowing them to be invoked and manipulated using the query language.
Some of the benefits of using CLR procedures over XPs are:
- Granular control: SQL Server
administrators have little control over what XPs can or cannot do.
Using the Code Access Security model, a SQL Server administrator can
assign one of three permission buckets—SAFE, EXTERNAL_ACCESS, or UNSAFE—to exert varying degrees of control over the operations that managed code is allowed to perform.
- Reliability: Managed code, especially in the SAFE and EXTERNAL_ACCESS
permission sets, provides a more reliable programming model than XPs
do. Verifiable managed code ensures that all access to objects is
performed through strongly typed interfaces, reducing the likelihood
that the program accesses or corrupts memory buffers belonging to SQL
Server.
- Data access: With XPs, an explicit connection
back to the database—a loop-back connection—must be made in order to
access the local SQL Server database. In addition, this loop-back
connection must be explicitly bound to the transaction context of the
original session to ensure that the XP participates in the transaction
in which it is invoked. Managed CLR code can access local data using a
more natural and efficient programming model that takes advantage of
the current connection and transaction context.
- Additional Data Types: The managed APIs support new data types (such as XML, (n)varchar(max), and varbinary(max)) introduced in SQL Server 2005, while the ODS APIs have not been extended to support these new types.
- Scalability:
The managed APIs that expose resources such as memory, threads, and
synchronization are implemented on top of the SQL Server resource
manager, allowing SQL Server to manage these resources for CLR code.
Conversely, SQL Server has no view or control over the resource usage
of an XP. If an XP consumes too much CPU time or memory, there is no
way to detect or control this from within SQL Server. With CLR code,
SQL Server can detect that a given thread has not yielded for a long
period of time and force the task to yield so that other work can be
scheduled. Consequently, using managed code provides for better
scalability and robustness.
As mentioned above, for data
access and sending result sets to the client CLR routines outperform
XPs. For code that does not involve data access or sending results,
comparing the performance of XPs and managed code is a matter of
comparing managed code with native code. In general, managed code
cannot beat the performance of native code in these scenarios.
Furthermore, there is an additional cost during transitions from
managed to native code when running inside SQL Server because SQL
Server needs to do additional book-keeping on thread-specific settings
when moving out to native code and back. Consequently, XPs can
significantly outperform managed code running inside SQL Server for
cases where there are frequent transitions between managed and native
code.
For most procedures, the benefits of managed code make
CLR procedures a more attractive alternative than XPs. For cases where
the performance is primarily determined by computationally-intensive
processing and frequent managed-native transitions, the benefits of the
CLR should be weighed against the raw performance advantage of XPs.
Code in the Middle Tier
Another
choice for developers is to place their logic outside the database.
This allows them to write their code in their choice of languages. By
providing a rich programming model in the database, CLR integration
offers developers the choice of moving such logic into the database. Of
course, this clearly does not mean that all—or even most—code should be
moved to the database.
Moving logic to the database tier can
reduce the amount of data flowing on the network, but puts an
additional load on the valuable CPU resources of the server. This
tradeoff should be considered carefully before making code placement
decisions for an application. The following considerations can favor
the database tier as the preferred code location:
- Data validation: Keeping the data
validation logic in the database offers better encapsulation of this
logic with the data, avoiding duplication of validation logic across
different data touch points, such as back-end processing, bulk upload,
data updates from the middle tier, etc.
- Network traffic reduction:
Placing logic in the database may be appropriate for data processing
tasks that involve processing a large amount of data while producing a
very small percentage of that data. Typical examples include data
analysis applications such as demand forecasting, scheduling production
based on forecast demands, etc.
Of course, these
considerations are important even without CLR integration; CLR
integration simply helps ensure that the choice of programming language
does not interfere with the right code location decision.
Example: Production Scheduling
Production
scheduling is a common task in manufacturing companies. At a high
level, it involves creating a plan for when to produce items in order
to satisfy demand, while minimizing the total cost of producing and
storing the items. Several algorithms exist to take demand forecasts,
inventory storage costs, and production line setup costs as input, and
produce a manufacturing strategy as output.
Assuming future
demand forecasts are stored in a table in SQL Server, an implementation
of such an algorithm has the following characteristics:
- It takes a large amount of data (demand forecasts) as input.
- It produces a small result (such as number of units to produce at a given date or dates).
- It requires considerable computation to derive the output from the inputs.
Implementing
such an algorithm in the middle tier is an option, but there will be a
large cost in shipping demand data out of the database. Implementing it
in T-SQL as a stored procedure is also feasible, but the lack of
complex data types will make implementation difficult, and the
performance of T-SQL will be an issue because of the quantity and
complexity of the computations required. Of course, performance
characteristics will vary depending on the actual volume of data and
complexity of the algorithm.
To verify the suitability of CLR
integration to such a scenario, we took a specific production
scheduling algorithm—a dynamic programming implementation of the
Wagner-Whitin algorithm—and implemented it using both the CLR and
T-SQL. As expected, CLR integration greatly outperformed T-SQL. The
implementation in C# was also more straightforward, since the algorithm
uses single- and multi-dimensional arrays not available in T-SQL.
Overall, the CLR version performed several orders of magnitude better
than a T-SQL implementation.
Let us assume the following simple database schema that keeps track of the list of products that can be produced.
Table t_products:
Column name | Type | Description |
pid | int | Primary key ID of a product |
pName | nvarchar(256) | Name of the product |
inventoryCost | int | Cost of storing this product per period |
startupCost | int | Cost of setting up a manufacturing line to begin fabrication of this product |
In addition, the following table stores the demand forecast information per product per day. We assume that the pid column is a foreign key into the table t_products, and that there is a uniqueness constraint on the pid, demandDate pair.
Table t_salesForecast:
Column name | Type | Description |
pid | int | Product ID |
demandDate | smalldatetime | Day for which demand is forecasted |
demandQty | int | Demand forecast for given product |
We
created a stored procedure that takes, in addition to this product
data, parameters expressing the range of dates for which it should
create a production schedule.
This stored procedure returns a rowset with the schema in the following table.
Column name | Type | Description |
product | nvarchar(256) | Name of product |
period | datetime | Production day |
quantity | int | Number of items to be manufactured |
The
C# version of the code is listed below to illustrate the kind of
scenario that can significantly benefit from CLR integration:
using System;
using System.Data;
using System.Collections;
using System.Data.SqlTypes;
using System.Data.SqlClient;
using Microsoft.SqlServer.Server;
public partial class ProductionScheduler
{
const int MAXNAME = 256; // The maximum name size
[Microsoft.SqlServer.Server.SqlProcedure] // Flag as a SQL procedure
public static void Schedule(SqlDateTime start, SqlDateTime end)
{
// Guarantee that we have a valid connection while we run
using (SqlConnection conn =
new SqlConnection("context connection=true"))
{
conn.Open(); // open the connection
SqlPipe pipe = SqlContext.Pipe; // get the pipe
// Find all the products in the database with any demand
// whatsoever along with data about their production costs.
// Make sure they are ordered.
ArrayList items = new ArrayList();
SqlCommand cmd = new SqlCommand(
" SELECT DISTINCT tp.pid, pname, startupCost,"
" inventoryCost" +
" FROM t_products tp" +
" JOIN t_salesForecast ts" +
" ON tp.pid = ts.pid" +
" ORDER BY pid",
conn);
SqlDataReader reader = cmd.ExecuteReader();
while (reader.Read())
{
items.Add(new Item(
reader.GetInt32(0), reader.GetSqlChars(1),
reader.GetInt32(2), reader.GetInt32(3)));
}
reader.Close();
// Now get all the production schedule information, ordered
// by PID and demand date
"SELECT pid, demandDate, demandQty" +
" FROM t_salesForecast" +
" WHERE demandDate >= @start" +
" AND demandDate <= @end" +
" ORDER BY pid, demandDate",
conn);
cmd.Parameters.AddWithValue("@start", start);
cmd.Parameters.AddWithValue("@end", end);
reader = cmd.ExecuteReader();
// Read each section of schedule information into the items.
reader.Read();
for (int i = 0; (i < items.Count) && (!reader.IsClosed); i++)
{
((Item)(items[i])).readData(reader);
}
// ensure the reader is closed
if (!reader.IsClosed) reader.Close();
foreach (Item item in items)
{
// Compute the schedule and report it
item.ComputeSchedule();
item.OutputSchedule(pipe);
}
}
}
class Item
{
// Information about the product we are scheduling. These will
// be pulled from the database.
private int pid;
private SqlChars name;
private int startCost;
private int holdCost;
// Store how many dates we have.
private int size = 0;
// The dates on which we have demand. These are guaranteed to
// be unique by the database, and we will load them in order.
private ArrayList dates = new ArrayList();
// Store what the demand was on each date.
private ArrayList quantities = new ArrayList();
// Our schedule, which we have not yet computed.
int[] schedule = null;
// Set up the metadata for the return
SqlMetaData[] metadata = new SqlMetaData[] {
new SqlMetaData("product", SqlDbType.NVarChar, MAXNAME),
new SqlMetaData("period", SqlDbType.DateTime),
new SqlMetaData("quantity", SqlDbType.Int)
};
public Item(int pid, SqlChars name, int startCost, int holdCost)
{
this.pid = pid;
this.name = name;
this.startCost = startCost;
this.holdCost = holdCost;
}
/*
* Read data from the stream until the PID does not match
* ours anymore. We assume the reader is cued up to our
* information and we leave it cued to the next item's
* information UNLESS there is no more information, in which
* case we close the reader to indicate as much.
*/
public void readData(SqlDataReader reader)
{
size = 0;
do
{
if (reader.GetInt32(0) == pid)
{
size++;
dates.Add(reader.GetDateTime(1));
quantities.Add(reader.GetInt32(2));
}
else
{
return;
}
}
while (reader.Read());
// reader ran out. close it.
reader.Close();
}
/*
* This method is called to compute the production schedule
* for the item. It does no I/O, but puts in motion the
* dynamic programming algorithm which produces the schedule.
*/
public void ComputeSchedule()
{
int[] days = ComputeProductionDays();
schedule = new int[size];
for (int i = 0; i < size; i++)
{
schedule[days[i]] += (Int32)(quantities[i]);
}
}
/*
* Pipe the schedule to the user, computing it if need be.
*/
public void OutputSchedule(SqlPipe pipe)
{
// Ensure that the schedule has been computed.
if (schedule == null)
{
ComputeSchedule();
}
// Make a record in which to store the data.
SqlDataRecord record = new SqlDataRecord(metadata);
record.SetSqlChars(0, name);
// Start up the output pipe.
pipe.SendResultsStart(record);
for (int i = 0; i < size; i++)
{
// Pipe each day of production. Omit zero production
// days.
if (schedule[i] != 0)
{
record.SetDateTime(1, (DateTime)(dates[i]));
record.SetInt32(2, schedule[i]);
pipe.SendResultsRow(record);
}
}
pipe.SendResultsEnd();
}
/*
* Compute the table and then walk it to find the best
* days to produce the item.
*/
private int[] ComputeProductionDays()
{
// We fill this in. It says when each day's quota is
// actually produced.
int[] productionDays = new int[size];
// First, compute the table.
int[][] table = ComputeTable();
// Then walk the table, creating a second table which encodes
// the best production days.
int[] optimal = new int[size + 1];
int[] optimalLoc = new int[size];
optimal[size] = 0;
for (int i = size - 1; i >= 0; i--)
{
int min = table[i][i] + optimal[i + 1];
int minloc = i;
for (int j = i+1; j < size; j++)
{
int temp = table[i][j] + optimal[j + 1];
if (temp < min)
{
min = temp;
minloc = j;
}
}
optimal[i] = min;
optimalLoc[i] = minloc;
}
// Finally, decode the optimal values into production days.
int pday = 0;
int until = optimalLoc[0] + 1;
for (int i = 0; i < size; i++)
{
if (until == i)
{
pday = i;
until = optimalLoc[i] + 1;
}
productionDays[i] = pday;
}
// We now have a list of days which we will produce the good.
return productionDays;
}
/*
* The main part of the dynamic programming solution. Each entry
* table[i,j] stores the cost of producing enough of the good on
* day i to meet needs through day j. This table is only half-
* filled when complete.
*/
private int[][] ComputeTable()
{
int[][] table = new int[size][];
for (int i = 0; i < size; i++) table[i] = new int[size];
for (int i = 0; i < size; i++)
{
// If we produce the good on the same day we ship it we
// incur a startup cost.
table[i][i] = startCost;
// For other days, we have the cost for the previous
// cell plus the cost of storing the good for this long.
for (int j = i + 1; j < size; j++)
{
table[i][j] = table[i][j - 1] +
(((int)quantities[j]) * holdCost *
diff((DateTime)(dates[i]), (DateTime)(dates[j])));
}
}
return table;
}
/*
* A utility to compute the difference between two days.
*/
private int diff(DateTime start, DateTime end)
{
TimeSpan diff = end.Subtract(start);
return diff.Days;
}
}
};
Solving Common Database Programming Tasks and Problems
The
previous section positioned CLR based programming at a high level in
relation to T-SQL, extended stored procedures (XPs), and code in the
middle-tier. In this section, we look at a set of programming tasks and
patterns that a database application developer might encounter, and
discuss how (or how not) to use CLR integration to solve them. We
provide several code examples, both in C# and Visual Basic .NET.
Data Validation using the .NET Framework Library
CLR
integration in SQL Server 2005 allows users to leverage the rich
functionality provided by the .NET Framework class libraries to solve
their database programming problems.
An example of this is the
use of regular expressions to provide more complete textual
pattern-matching than what is available using the LIKE operator in T-SQL. Consider the following code, which is nothing more than a simple wrapper around the RegEx class in the System.Text.RegularExpressions namespace.
In Visual Basic .NET:
Imports System.Data.SqlTypes
Imports Microsoft.SqlServer.Server
Imports System.Text.RegularExpressions
Partial Public Class Validation
<SqlFunction(IsDeterministic:=True, IsPrecise:=True)> _
Public Shared Function RegExMatch(ByVal pattern As String, _
ByVal matchString As String) As Boolean
Dim r1 As Regex = New Regex(pattern.TrimEnd(Nothing))
Return r1.Match(matchString.TrimEnd(Nothing)).Success
End Function
<SqlFunction(IsDeterministic:=True, IsPrecise:=True)> _
Public Shared Function ExtractAreaCode(ByVal matchString As String)_
As SqlString
Dim r1 As Regex = New Regex("\((?<ac>[1-9][0-9][0-9])\)")
Dim m As Match = r1.Match(matchString)
If m.Success Then
Return m.Value.Substring(1, 3)
Else
Return SqlString.Null
End If
End Function
End Class
In C#:
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Text.RegularExpressions;
public partial class Validation
{
[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static bool RegExMatch(string pattern, string matchString)
{
Regex r1 = new Regex(pattern.TrimEnd(null));
return r1.Match(matchString.TrimEnd(null)).Success;
}
[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlString ExtractAreaCode(string matchString)
{
Regex r1 = new Regex("\\((?<ac>[1-9][0-9][0-9])\\)");
Match m = r1.Match(matchString);
if (m.Success)
return m.Value.Substring(1, 3);
else return SqlString.Null;
}
}
Let us assume that the RegExMatch() and ExtractAreaCode() methods have been registered as user-defined functions in the database with the RETURNS NULL ON NULL INPUT
option, so that the function returns null when any of its inputs are
null. This allows us to avoid writing any special null handling code
inside the function.
We can now define constraints on columns of
a table that use the above code to validate e-mail addresses and phone
numbers as follows:
CREATE TABLE contacts
(
firstName nvarchar(30),
lastName nvarchar(30),
emailAddress nvarchar(30) CHECK
(dbo.RegExMatch('[a-zA-Z0-9_\-]+@([a-zA-Z0-9_\-]+\.)+(com|org|edu)',
emailAddress) = 1),
usPhoneNo nvarchar(30) CHECK
(dbo.RegExMatch(
'\([1-9][0-9][0-9]\) [0-9][0-9][0-9]\-[0-9][0-9][0-9][0-9]',
usPhoneNo)=1),
areaCode AS dbo.ExtractAreaCode(UsPhoneNo) PERSISTED
)
Note that the areaCode column is a persisted computed column that extracts the area code from the usPhoneNo column using the ExtractAreaCode() function. The areaCode column could be indexed, facilitating queries against the table that look for contacts by area code.
More
generally, this example demonstrates how one can leverage the .NET
Framework libraries to augment the T-SQL built-in function library with
useful functions that are hard to express in T-SQL.
Producing Result Sets
The
need to produce result sets from a database object (such as a stored
procedure or a view) running inside the server is one of the most
common database programming tasks. If the result set can be built using
a single query then this can be achieved by simply using a view or an
inline table-valued function. However, if there is a need for multiple
statements and procedural logic to build the result set then there are
two options: stored procedures and table-valued functions. While SQL
Server 2000 has table-valued functions, they can only be written in
T-SQL. With CLR integration in SQL Server 2005 these functions can also
be written using a managed language. In this section we look at how to
write stored procedures and table-valued functions using the CLR.
From
T-SQL, it is possible to return relational results either as the return
value of a table-valued function or through the ever present implicit
"caller's pipe" within a stored procedure: from anywhere within a
stored procedure—regardless of nesting of execution levels—a SELECT statement executed will return results to the caller. More precisely, this is true of SELECT statements that do not perform variable assignment. FETCH, READTEXT, PRINT, and RAISERROR statements also implicitly return results to the caller.
Note that "the caller" hasn't been properly defined; it will depend on the invocation context of a stored procedure.
If
a stored procedure is invoked from any of the client data access APIs
(such as ODBC, OLEDB, or SQLClient), the caller is the actual API and
whichever abstraction it provides to represent results (e.g., hstmt, IRowset, or SqlDataReader).
This means that, in general, results produced from within a stored
procedure will be returned all the way back to the invoking API,
bypassing all T-SQL frames on the stack, as in the following example:
CREATE PROC proc1 AS
SELECT col1 FROM dbo.table1;
CREATE PROC proc2 AS
EXEC proc1;
Upon execution of procedure proc2, the results produced by proc1 will go to the caller of proc2. There is only one way in which proc2 can capture the results produced: by streaming the results to disk by using INSERT or EXEC into either a permanent or temporary table, or a table variable.
CREATE PROC proc2 AS
DECLARE @t TABLE(col1 INT);
INSERT @t (col1) EXEC proc1;
-- do something with results
In the case of INSERT or EXEC the caller is the target table or view of the INSERT statement.
SQL
Server 2005 CLR stored procedures introduce a new type of caller. When
a query is executed using the ADO.NET provider from within a managed
frame, the results are made available through the SqlDataReader object and can be consumed within the stored procedure.
In Visual Basic .NET:
...
Using conn As New SqlConnection("context connection = true")
conn.Open()
Dim cmd As SqlCommand = new SqlCommand( _
"SELECT col1 FROM dbo.table1", conn)
Dim reader As SqlDataReader = cmd.ExecuteReader()
Do While reader.Read()
' Do something with current row
Loop
End Using
...
In C#:
...
using (SqlConnection conn= new SqlConnection("contect connection = true"))
{
...
SqlCommand cmd = new SqlCommand(
"SELECT col1 FROM dbo.table1", conn);
SqlDataReader reader = cmd.ExecuteReader();
while (reader.Read())
{
// do something with current row
}
...
}
...
The remaining issue is how a managed routine returns its
results to the caller. This is done differently in CLR stored
procedures and table-valued functions. A stored procedure uses a static
instance of SqlPipe available from the SqlContext
class to send data back, while a table-valued function implements an
interface which allows SQL Server to retrieve the results. Both of
these are discussed next.
CLR Stored Procedures and SqlPipe
Of the methods available in the SqlPipe class, the simplest is ExecuteAndSend(),
which takes a command object as an argument. This method executes the
command, but instead of making the results of the execution available
to the managed frame, the results are sent to the invoker of the stored
procedure. This is semantically equivalent to embedding a statement
inside a T-SQL stored procedure, and, while clumsier, it is on par with
the T-SQL equivalent in terms of performance.
A simple stored procedure to execute a SELECT In T-SQL:
CREATE PROC proc1 AS
SELECT col1 FROM dbo.table1;
The equivalent in C# would be:
...
[Microsoft.SqlServer.Server.SqlProcedure]
public static void proc1()
{
using (SqlConnection conn =
new SqlConnection("context connection = true"))
{
conn.Open();
SqlCommand cmd = new SqlCommand(
"SELECT col1 FROM dbo.table1", conn);
SqlContext.Pipe.ExecuteAndSend(cmd);
conn.Close();
}
}
...
And in Visual Basic .NET:
...
<Microsoft.SqlServer.Server.SqlProcedure()> _
Public Shared Sub VBproc1()
Using conn As New SqlConnection("context connection=true")
conn.Open()
Dim cmd As SqlCommand = new SqlCommand( _
"SELECT col1 FROM dbo.table1", conn)
SqlContext.Pipe.ExecuteAndSend(cmd)
conn.Close()
End Using
End Sub
...
SqlPipe.ExecuteAndSend() works well for scenarios
where the data to be returned is produced directly by a query being
executed. However, there may be cases in which it is desirable either
to manipulate the data before sending it, or to send data which was
obtained from sources outside the local SQL Server instance.
SqlPipe provides a group of methods that work together to enable applications to return arbitrary results to the caller: SendResultsStart(), SendResultsRow(), and SendResultsEnd(). For the most part, these APIs are similar to the srv_describe and srv_sendrow APIs available for extended stored procedures.
SendResultsStart() takes a SqlDataRecord
as an argument and indicates the beginning of a new result set. This
API reads the metadata information from the record object and sends it
to the caller. Rows can subsequently be returned by invoking SendResultsRow()once for each row that is to be sent. After all of the desired rows have been sent, a call to SendResultsEnd() is required to indicate the end of the result set.
Example: Returning a RSS Feed in a CLR Stored Procedure
The
following C# code fragment represents a portion of a stored procedure
that reads an XML document—a Really Simple Syndication (RSS) feed from
MSDN—from the web, uses System.Xml classes to parse it, and returns the information in relational form. Note that the code must be deployed in an EXTERNAL_ACCESS or UNSAFE
assembly because the Code Access Security (CAS) permissions that are
required to access the Internet are available only in these permission
sets.
...
using (SqlConnection conn =
new SqlConnection("context connection = true"))
{
// Retrieve the RSS feed
XPathDocument doc = new
PathDocument("http://msdn.microsoft.com/sql/rss.xml");
XPathNavigator nav = doc.CreateNavigator();
XPathNodeIterator i = nav.Select("//item");
// create metadata for four columns
// three of them are string types and one of the is a datetime
SqlMetaData[] rss_results = new SqlMetaData[4];
rss_results[0] = new SqlMetaData("Title", SqlDbType.NVarChar, 250);
rss_results[1] = new SqlMetaData("Publication Date",
SqlDbType.DateTime);
rss_results[2] = new SqlMetaData("Description",
SqlDbType.NVarChar, 2000);
rss_results[3] = new SqlMetaData("Link", SqlDbType.NVarChar, 1000);
// construct the record which holds metadata and data buffers
SqlDataRecord record = new SqlDataRecord(rss_results);
// cache a SqlPipe instance to avoid repeated calls to
// SqlContext.GetPipe()
SqlPipe sqlpipe = SqlContext.Pipe;
// send the metadata, do not send the values in the data record
sqlpipe.SendResultsStart(record);
// for each xml node returned, extract four pieces
// of information and send back each item as a row
while (i.MoveNext())
{
record.SetString(0, (string)
i.Current.Evaluate("string(title[1]/text())"));
record.SetDateTime(1, DateTime.Parse((string)
i.Current.Evaluate("string(pubDate[1]/text())")));
record.SetString(2, (string)
i.Current.Evaluate("string(description[1]/text())"));
record.SetString(3, (string)
i.Current.Evaluate("string(link[1]/text())"));
sqlpipe.SendResultsRow(record);
}
// signal end of results
sqlpipe.SendResultsEnd();
}
...
Note that between calls to SendResultsStart() and SendResultsEnd(), the SqlPipe is set to a busy state, and invoking any Send method other than SendResultsRow() will result in an error. The SendingResults property is set to true while the SqlPipe is in this busy state.
Table-Valued Functions
CLR
integration also enables support for table-valued functions (TVFs)
written in managed languages. Similar to the T-SQL equivalent, CLR TVFs
are primarily used to return tabular results. The most notable
difference is that T-SQL table-valued functions temporarily store
results in a work table, whereas CLR TVFs are capable of streaming the
results produced, meaning that results do not have to be fully
materialized before returning from the function.
Note that
although T-SQL also has the notion of inline TVFs which do not
temporarily store results, inline T-SQL TVFs are for the most part
syntactic sugar to specify sub-queries, possibly with parameters.
Managed TVFs return a standard IEnumerable interface. This interface provides an IEnumerator of objects representing the rows of the table, and will be retrieved by the query processor one-by-one until the enumeration's MoveNext() method returns false.
These
objects are opaque to SQL Server, and must be cracked by another
function. The method for filling the row and the schema of the table
are defined in the annotation for the TVF. This method takes an object
as input and returns the fields of the row using reference parameters,
implying that the signature of this method is not fixed: its signature
match the schema.
Example: A Table-Valued Function to Retrieve a RSS Feed
Here we present the RSS retrieval recast as a table-valued function in C#. Note the correspondence between the SqlFunction annotation on the RSS_TVF() method and the signature of the FillTVFRow() method.
using System;
using System.Data;
using System.Data.Sql;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Xml.XPath;
using System.Collections;
public partial class UserDefinedFunctions
{
[SqlFunction(FillRowMethodName = "FillTVFRow",
TableDefinition = "Title nvarchar(250), " +
"PublicationDate datetime, " +
"Description nvarchar(2000), " +
"Link nvarchar(1000)")
]
public static IEnumerable RSS_TVF()
{
return new RssReader();
}
public static void FillTVFRow(object row, out SqlString str,
out SqlDateTime date, out SqlString desc, out SqlString link)
{
// split each object array
object[] rowarr = (object[])row;
str = (SqlString)(rowarr[0]);
date = (SqlDateTime)(rowarr[1]);
desc = (SqlString)(rowarr[2]);
link = (SqlString)(rowarr[3]);
}
}
public class RssReader : IEnumerable
{
XPathDocument doc;
XPathNavigator nav;
// Construct helper class, initializing metadata for the results
// reading from the RSS feed, creating the iterator
public RssReader(string site)
{
// Retrieve the RSS feed
//doc = new XPathDocument("http://msdn.microsoft.com/sql/rss.xml");
doc = new XPathDocument(site);
nav = doc.CreateNavigator();
}
public IEnumerator GetEnumerator()
{
return new RSSEnumerator(this);
}
private class RSSEnumerator : IEnumerator
{
XPathNodeIterator i;
Object[] current;
RssReader reader;
public RSSEnumerator(RssReader reader)
{
this.reader = reader;
Reset();
}
public void Reset()
{
i = reader.nav.Select("//item");
}
public bool MoveNext()
{
if (i.MoveNext())
{
current = new Object[4];
current[0] = new SqlString((string)
i.Current.Evaluate("string(title[1]/text())"));
current[1] = new SqlDateTime(DateTime.Parse((string)
i.Current.Evaluate("string(pubDate[1]/text())")));
current[2] = new SqlString((string)
i.Current.Evaluate("string(description[1]/text())"));
current[3] = new SqlString((string)
i.Current.Evaluate("string(link[1]/text())"));
return true;
}
else return false;
}
public Object Current
{
get
{
return current;
}
}
}
}
A simple query to consume the results from this table-valued function would look like the following:
SELECT *
FROM RSS_TVF()
Naturally, richer queries can be expressed over the TVF form of this data. Assuming we have a CanonicalURL() function that returns the canonical version of a URL, the data from the RSS feed can easily be returned using canonical URLs:
select title, publicationDate, description, dbo.CanonicalURL(link)
from dbo.RSS_TVF()
order by publicationDate
Note that in this example, we are not leveraging the
streaming capabilities of the TVF because we are consuming the whole
RSS feed, building a navigator on top, and then iterating over the
individual items as calls to MoveNext() are made. However, it
is possible to consume the results from a Web source using a streaming
API, and iterate over the produced XML with an XmlReader. It
is important to note that given the execution model difference between
table-valued functions in the CLR and those in T-SQL, a major
performance difference may be observed in favor of CLR TVFs, especially
for scenarios in which it is possible to stream the results.
Example: Cracking Scalars into Rows
There
is often a need to pass multi-valued arguments in an application. For
example, an order processing system may need a stored procedure that
inserts an order into a table of orders. A desired argument of the
stored procedure is likely to be the line-items in the order; however,
this runs into the limitation that T-SQL does not support table-valued
arguments, and lacks collections and arrays.
One way around this is to encode the collection as a scalar value—as an nvarchar or xml,
for example—and pass it as an argument to the stored procedure. The
stored procedure can use a table-valued function that takes the scalar
input and converts it into a set of rows, which can then be inserted
into a table of line-items or otherwise manipulated.
While the
table-valued function can be written in T-SQL, it will have better
performance if written in the CLR. Because it can take advantage of the
string manipulation functions in the System.Text namespace, it is also exceedingly simple to implement.
The
following shows the implementation of a table-valued function that
takes a semicolon-separated input string, and returns the pieces in the
form of a set of rows.
In Visual Basic .NET:
Imports System.Collections
Imports System.Data.SqlTypes
Imports Microsoft.SqlServer.Server
Partial Public Class UserDefinedFunctions
' This needs to return an IEnumerable, but since an array
' does, in this case we do not need to define a new class
' that implements it: we can simply return the array.
<SqlFunction(FillRowMethodName:="FillRow", _
TableDefinition:="value nvarchar(60)")> _
Public Shared Function GetStrings(ByVal str As SqlString) _
As IEnumerable
Return str.Value.Split(";"c)
End Function
' This method does the decoding of the row object. Since the
' row is already a string, this method is trivial. Note that
' this method is pointed to by the annotation on the
' GetString method.
Public Shared Sub FillRow(ByVal row As Object, _
ByRef str As String)
str = CType(row, String)
End Sub
End Class
In C#:
using System.Collections;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
public partial class UserDefinedFunctions
{
/*
* This needs to return an IEnumerable, but since an array
* does, in this case we do not need to define a new class
* that implements it: we can simply return the array.
*/
[SqlFunction(FillRowMethodName="FillRow",
TableDefinition="value nvarchar(60)")]
public static IEnumerable GetStrings(SqlString str)
{
return str.Value.Split(';');
}
/*
* This method does the decoding of the row object. Since the
* row is already a string, this method is trivial. Note that
* this method is pointed to by the annotation on the
* GetString method.
*/
public static void FillRow(object row, out string str)
{
str = (string)row;
}
}
If we assume that the GetStrings() method is
registered as a TVF of the same name, then the following is a fragment
of T-SQL of a stored procedure that uses this TVF to extract the line
items from an order in tabular form.
CREATE PROCEDURE Insert_Order @cust_id int, @lineitems nvarchar(8000)
AS
BEGIN
...
INSERT LineItems
SELECT * FROM dbo.GetStrings(@lineitems)
...
END
Which One to Use?
The decision whether to use a stored procedure in conjunction with the SqlPipe—either
implicitly in T-SQL or explicitly in a CLR routine—or a table-valued
function depends on several factors that must be taken into
consideration, including composability requirements, the source of the
data, the need for side-effects, and the typing requirements for the
results. We discuss each of these in turn.
Composability Requirements
It
may desirable to reuse or further manipulate results produced inside a
TVF or a stored procedure. Table-valued functions are more versatile
from a composability perspective, as the return type of a TVF is a
relational rowset that can be used in any place where such a construct
is allowed. In particular, it can be used in the FROM clause of SELECT statements, and as such the results produced can benefit from the composability of SELECT in sub-queries, INSERT...SELECT statements, derived tables, table expressions, etc.
On the other hand, from within the T-SQL language, stored procedures can be composed only as part of the INSERT...EXEC combination that allows produced results to be stored in a permanent or temporary table. The INSERT operation represents an actual copy of the data, which will likely have a performance impact.
If
composability and reuse of results from within the server is a
requirement, TVFs are a better alternative. If the results produced
need to only be streamed back to the client- or middle-tier, either
approach is reasonable.
Source of the Data
The
source of the data being returned is another important factor in
deciding between T-SQL- and CLR-based implementations. Results can be
produced either by reading from some source in the local instance using
the ADO.NET provider, or from a source external to SQL Server. For
external sources, a CLR based implementation is a better choice than
T-SQL because of the ease with which the logic accessing the external
data can be implemented.
In the case of generating results based
on a query executed on the local instance using the ADO.NET provider, a
stored procedure would generally execute a query, iterate through the
result, and perform some operation on the rows before sending them back
through a SqlPipe.
With a TVF, one would expect to wrap a SqlDataReader in a custom implementation of IEnumerable
that transformed the results as they were read. However, SQL Server
2005 does not allow requests to be pending by the time a table-valued
function returns: any query executed through the ADO.NET provider must
be fully executed and results completely consumed before the function
body can return. An error is raised if the return statement is executed
while SqlDataReader operations from the ADO.NET provider are
pending. This implies that in most cases where data is being returned
from the local database instance it cannot be streamed through a CLR
TVF. If other factors, such as composability, require this to be
written as a TVF, writing it in T-SQL may be the only option.
Otherwise, using a managed stored procedure using SqlPipe is possible.
For
the case in which results are to be produced from within a stored
procedure based on data from the local instance, use of the SendResults
APIs makes sense only in cases where the results require some
procedural modification or manipulation. If the results are to be sent
to the caller unmodified, SqlPipe.ExecuteAndSend() should be used, as it has better performance
Operations with Side-Effects
In
general, operations which produce side-effects—operations which change
the state of the database, such as DML statements or transaction
operations—are disallowed from user defined functions, including
table-valued functions. These operations may be desired, however. For
example, one may wish to set a SAVEPOINT transaction, perform some operation, and roll back to the SAVEPOINT in the case of an error.
Given
that side-effects are disallowed from user-defined functions, such a
scenario could only be implemented through a stored procedure, and
results would have to be returned through SqlPipe. Note, however, that operations with side-effects are not allowed to execute through the ADO.NET provider while the SqlPipe
is busy sending results. These operations are allowed only before the
result set has been started or after it has been completed.
Typing of and Number of Results
The description of results produced by a CLR stored procedure through SqlPipe
differs from that of a CLR TVF, consistent with their counterparts in
T-SQL. A TVF is strongly typed, and as part of the registration (CREATE FUNCTION) statement, it must statically define the type of its return value.
On
the other hand, a stored procedure declaration makes no statement about
results produced—or even whether it produces them. This may seem a
convenience, and though it certainly provides greater flexibility,
greater care must be taken in writing applications that execute stored
procedures, which can dynamically redefine the shape of their results.
However, if the schema for the results needs to be variable across
invocations, a stored procedure should be used, since only SqlPipe provides this flexibility.
In fact, the weak typing of results produced through SqlPipe
inside stored procedures goes beyond the schema of a single result
includes the possibility of returning a variable number of result sets.
Both the types of and number of result sets can be determined
dynamically by the stored procedure.
Summary
The
following table summarizes the guidelines on how to choose whether a
particular application should be written in T-SQL or the CLR, and
whether a stored procedure or table-valued function should be used.
Condition | Yes | No |
Composability required? | TVF | Procedure or TVF |
External data source (vs. accessing only local data)? | CLR TVF or CLR procedure | (accessing only local data) T-SQL TVF or procedure |
Side effects required? | Procedure | Procedure or TVF |
Fixed results schema? | Procedure or TVF | Procedure |
More than one result set? | Procedure | Procedure or TVF |
Ability to stream results? | CLR TVF | T-SQL TVF |
For the most of this section, sending results through the SqlPipe has been tightly associated with procedures. Even though SqlPipe
and the possibility to return results is available in the body of CLR
triggers, this practice is highly discouraged, as it can lead to
unexpected results for those issuing Data Manipulation Language or Data
Definition Language statements with triggers defined on their target
objects.
Performing Custom Aggregations over Data
There
are a number of scenarios where aggregation may need to be performed on
data, including performing statistical calculations, such as finding
averages, standard deviations, etc. If the desired aggregation function
is not built in, there are several ways to add the functionality in SQL
Server 2005:
- Write the aggregation as a user-defined aggregate (UDA).
- Write the aggregate using a CLR stored procedure.
- Use a server-side cursor in T-SQL.
Let
us examine the three alternatives in the context of a simple
aggregation function which calculates the product of a given set of
values.
Example: Product Implemented as a User-Defined Aggregate Function
Here
is the code for this task written as a user-defined aggregate. The
logic for calculating the product as it comes in is in the Accumulate() method. The Merge() method defines what will happen should two such aggregates be merged.
In Visual Basic .NET:
Imports System
Imports System.Data.SqlTypes
Imports Microsoft.SqlServer.Server
<Serializable()> _
<SqlUserDefinedAggregate(Format.Native)> _
Public Structure ProductAgg
Private product As SqlInt32
Public Sub Init()
product = 1
End Sub
Public Sub Accumulate(ByVal value As SqlInt32)
product = product * value
End Sub
Public Sub Merge(ByVal group As ProductAgg)
product = product * group.product
End Sub
Public Function Terminate() As SqlInt32
Return product
End Function
End Structure
In C#:
using System;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
[Serializable]
[SqlUserDefinedAggregate(Format.Native)]
public struct ProductAgg
{
private SqlInt32 product;
public void Init()
{
product = 1;
}
public void Accumulate(SqlInt32 value)
{
product = product * value;
}
public void Merge(ProductAgg group)
{
product = product * group.product;
}
public SqlInt32 Terminate()
{
return product;
}
}
After this type is built and registered with SQL Server, it can be used just as a built-in aggregate from T-SQL:
SELECT dbo.ProductAgg(intcol)
FROM tbl
GROUP BY col
Example: Product as a Managed Stored Procedure
A
stored procedure can be created that iterates over the data to perform
the computation. This iteration is accomplished using the SqlDataReader class as shown below.
In Visual Basic .NET:
Imports Microsoft.SqlServer.Server
Imports System.Data.SqlTypes
Imports System.Data.SqlClient
Partial Public Class StoredProcedures
<SqlProcedure()> _
Public Shared Sub VBProductProc(ByRef value As SqlInt32)
' The empty product is 1
value = 1
Using conn As New SqlConnection("context connection = true")
conn.Open()
Dim cmd As SqlCommand = New SqlCommand()
cmd.Connection = conn
cmd.CommandText = "SELECT intcolumn FROM tbl"
Dim r As SqlDataReader = cmd.ExecuteReader()
Using r
Do While r.Read()
value = value * r.GetSqlInt32(0)
Loop
End Using
conn.Close()
End Using
End Sub
End Class
In C#:
using Microsoft.SqlServer.Server;
using System.Data.SqlTypes;
using System.Data.SqlClient;
public partial class StoredProcedures
{
[SqlProcedure]
public static void ProductProc(out SqlInt32 value)
{
// Ensure that we write to value.
// Empty product is 1.
value = 1;
using (SqlConnection conn =
new SqlConnection("context connection = true"))
{
conn.Open();
SqlCommand cmd = new SqlCommand();
cmd.Connection = conn;
cmd.CommandText = "SELECT intcolumn FROM tbl";
SqlDataReader r = cmd.ExecuteReader();
using (r)
{
while (r.Read()) //skip to the next row
{
value *= r.GetSqlInt32(0);
}
}
}
}
}
This can then be invoked using the T-SQL EXEC statement:
EXEC Product @p OUTPUT
Example: Product as a T-SQL Stored Procedure that Uses a Cursor
Finally,
a T-SQL stored procedure can be created that executes a query and
performs the calculation using a T-SQL cursor to iterate over the data.
T-SQL:
create procedure TSQL_ProductProc (@product int output)
as
begin
declare @sales int
declare c insensitive cursor for select intcolumn from tbl
set @product = 1
open c
fetch next from c into @sales
while @@FETCH_STATUS = 0
begin
set @product = @product * @sales
fetch next from c into @sales
end
close c
deallocate c
end
go
Summary
The decision of whether to use
a UDA or one of the other solutions to produce results depends on
several factors, including the composability requirements, specifics of
the aggregation algorithm, and the need for side-effects.
A UDA
is actually a standalone object that can be used from any T-SQL query,
generally in the same places that a system aggregate can be used. There
are no assumptions about the query that it operates on. For example, it
can be included in view definitions (although not in indexed views) and
in scalar sub-queries.
UDAs may be evaluated before an ORDER BY
clause of a query, so there are no guarantees about the order in which
values are presented to the aggregation function. Therefore, if the
aggregation algorithm must consume values in a particular order, a UDA
cannot be used. Similarly, a UDA consumes the values from an entire
group and returns a single value. If this does not fit with the problem
then another technique must be used.
Also, a UDA can perform no
data access, nor have side-effects; if either of these are necessary
then a stored procedure should be used.
Although UDAs have a
number of restrictions, they are likely to provide the best performance
of the options presented, so aggregation should generally be performed
through a UDA unless other requirements prevent it.
User Defined Types
Now
we come to one of the more powerful, but often misunderstood, features
of SQL Server 2005. With user-defined types (UDTs), one can extend the
scalar type system of the database. This goes beyond just defining an
alias for a system type, which has been available in previous releases
of SQL Server. Defining a UDT is as simple as writing a class in
managed code, creating an assembly, and then registering the type in
SQL Server by using the CREATE TYPE statement. The following is the skeleton of code illustrates the contract that a UDT must satisfy:
In Visual Basic .NET:
<Serializable()> _
<Microsoft.SqlServer.Server.SqlUserDefinedType(Format.Native)> _
Public Structure SimpleType
Implements INullable
Public Overrides Function ToString() As String
...
End Function
Public ReadOnly Property IsNull() As Boolean Implements _
INullable.IsNull
...
End Property
Public Shared ReadOnly Property Null() As SimpleType
...
End Property
Public Shared Function Parse(ByVal s As SqlString) As SimpleType
...
End Function
End Structure
In C#:
[Serializable]
[Microsoft.SqlServer.Server.SqlUserDefinedType(Format.Native)]
public struct SimpleUdt : INullable
{
public override string ToString() { ... }
public bool IsNull { ... }
public static SimpleUdt Null
{
get { ... }
}
public static SimpleUdt Parse(SqlString s) { ... }
}
This can then be loaded and used in T-SQL:
CREATE TYPE simpleudt FROM [myassembly].[SimpleUdt]
CREATE TABLE t (mycolumn simpleudt)
When to Create a UDT
UDTs
in SQL Server 2005 are not an object-relational extensibility
mechanism; they are a way to extend the scalar type system of the
database. The scalar type system includes the columnar types that ship
with SQL Server (e.g., int, nvarchar, uniqueidentifier, etc.). With UDTs, a new type can be defined that can be used in place of a built-in scalar type. Create a UDT if the type is an atomic value that is appropriate to be modeled as a column.
Good
candidates for implementation as a UDT include custom date or time data
types in various calendars, and currency data types. A UDT is a single
class that exposes all behaviors available on the type and encapsulates
the underlying data stored by the type: all data access uses the
programmatic interface of the UDT. Often one can leverage existing
functionality in the .NET framework—such as the internationalization or
calendar functionality—to provide functionality that would be hard to
provide otherwise.
When Not to Create a UDT
A
UDT should not be used to model complex business objects, such as
employees, contacts, or customers. A UDT is treated as a unit by SQL
Server, to which it is opaque. Some issues with complex UDTs include
the 8KB size limitation on types, indexing limitations, and the fact
that the entire value must be updated when any value in the UDT is
updated.
Even in cases where the type is relatively simple, such
as a point or color class, it is better to use a mid-tier object
relational mapping technology rather than a UDT. UDTs should generally
be saved for cases where the data is truly atomic.
Factors to Consider When Designing a UDT
Since
UDTs are columnar, indexes can be defined over entire UDT values, as
can referential integrity constraints, such as uniqueness. UDTs can
also be used in comparison and ordering scenarios.
Comparing UDT values is accomplished by comparing the underlying binary representation of the type. If Format.Native
is used as the persistence mechanism, then the persisted form is
created using the same field order as what is defined in the type, so
care should be taken to ensure that they are laid out in the correct
order.
Except for comparison, every operation on a UDT requires
that the UDT value be de-serialized and a method invoked. This pattern
has a cost associated with it, which should be accounted for when
assessing whether to model a type as a UDT. UDTs are best used when
there are complex behaviors of the type which need to be modeled. If
the type is relatively simple then it may be better to avoid the UDT
construct.
Finally, one can use static methods of a UDT as a
convenient packaging mechanism to store a library of related functions.
Static methods can be invoked from T-SQL using the following syntax:
select Type::Function(@arg1)
Example: Non-Western Dates
We
would like to store date and time values using the Um Al Qura calendar,
which is different from the Gregorian calendar used by the SQL Server datetime
data type. We would like this data type to have the same basic set of
behaviors, namely string conversion, the ability to retrieve date
parts, perform date arithmetic, etc.
The following example of a user defined type is a simple implementation of this data type. It uses the UmAlQuraCalendar
type, which is new in the .NET Framework, version 2.0. It would be
straightforward to augment this example to provide the full compliment
of methods needed.
Um Al Qura UDT in C#:
using System;
using System.Data;
using System.Data.Sql;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Globalization;
[Serializable]
[SqlUserDefinedType(Format.Native, IsByteOrdered = true)]
public struct UmAlQuraDateTime : INullable
{
/*
* Private state.
*/
private long dtTicks;
private bool isNull;
// Calendar object used for all calendar-specific operations
private static readonly UmAlQuraCalendar s_calendar =
new UmAlQuraCalendar();
// For correct formatting we need to provie a culture code for
// a country that uses the Um Al Qura calendar: Saudi Arabia.
private static readonly CultureInfo ci =
new CultureInfo("ar-SA", false);
/*
* Null-Handling
*/
// get a null instance
public static UmAlQuraDateTime Null
{
get
{
UmAlQuraDateTime dt = new UmAlQuraDateTime();
dt.isNull = true;
return dt;
}
}
public bool IsNull
{
get
{
return this.isNull;
}
}
/*
* Constructors
*/
public UmAlQuraDateTime(long ticks)
{
isNull = false;
dtTicks = ticks;
}
public UmAlQuraDateTime(DateTime time) : this(time.Ticks)
{
}
/*
* Factory routines.
*/
public static UmAlQuraDateTime Parse(SqlString s)
{
if (s.IsNull) return Null;
DateTime t = DateTime.Parse(s.Value);
return new UmAlQuraDateTime(t);
}
public static UmAlQuraDateTime ParseArabic(SqlString s)
{
if (s.IsNull) return Null;
DateTime t = DateTime.Parse(s.Value, ci);
return new UmAlQuraDateTime(t);
}
public static UmAlQuraDateTime FromSqlDateTime(SqlDateTime d)
{
if (d.IsNull) return Null;
return new UmAlQuraDateTime(d.Value);
}
public static UmAlQuraDateTime Now
{
get
{
return new UmAlQuraDateTime(DateTime.Now);
}
}
/*
* Conversion Routines
*/
public DateTime DateTime
{
get { return new DateTime(this.dtTicks); }
}
public SqlDateTime ToSqlDateTime()
{
return new SqlDateTime(this.DateTime);
}
public override String ToString()
{
return this.DateTime.ToString(ci);
}
public String ToStringUsingFormat(String format)
{
return this.DateTime.ToString(format, ci);
}
/*
* Methods for getting date parts.
*/
public int Year
{
get
{
return s_calendar.GetYear(this.DateTime);
}
}
public int Month
{
get
{
return s_calendar.GetMonth(this.DateTime);
}
}
public int Day
{
get
{
return s_calendar.GetDayOfMonth(this.DateTime);
}
}
/*
* Date arithmetic methods.
*/
public UmAlQuraDateTime AddYears(int years)
{
return new
UmAlQuraDateTime(s_calendar.AddYears(this.DateTime, years));
}
public UmAlQuraDateTime AddDays(int days)
{
return new
UmAlQuraDateTime(s_calendar.AddDays(this.DateTime, days));
}
public double DiffDays(UmAlQuraDateTime other)
{
TimeSpan diff = DateTime.Subtract(other.DateTime);
return diff.Days;
}
}
Once this type is loaded in SQL Server, the type can be used
via T-SQL. Here are some T-SQL examples using this UDT along with the
results they produce.
First we parse an Um Al Qura date and print it out in two formats along with the western equivalent:
DECLARE @d UmAlQuraDateTime
SET @d = UmAlQuraDateTime::ParseArabic('01/02/1400')
PRINT @d.ToString()
PRINT @d.ToStringUsingFormat('F')
PRINT @d.ToSqlDateTime()
This produces:

We can convert western dates into Um Al Qura as well:
DECLARE @n DateTime
SET @n = 'March 20, 2005'
DECLARE @d UmAlQuraDateTime
SET @d = UmAlQuraDateTime::FromSqlDateTime(@n)
PRINT @n
PRINT @d.ToString()
The result:

Finally, we can create and modify tables with columns of this type:
CREATE TABLE dates (
western DateTime,
umalqura UmAlQuraDateTime
)
INSERT INTO dates(western) VALUES ('June 1, 2005')
INSERT INTO dates(western) VALUES ('July 1, 2005')
UPDATE dates
SET umalqura = UmAlQuraDateTime::FromSqlDateTime(dates.western)
SELECT western, umalqura.ToString() as umalqura FROM dates
This outputs a table:

Conclusion
This
paper has presented guidelines, concrete usage scenarios, and samples
using CLR integration features in SQL Server 2005. Database application
developers and architects should use this paper in conjunction with
documentation on other SQL Server 2005 features, such as Transact-SQL,
XML, and Service Broker.
In the future, we plan to provide more
articles in this area with guidelines addressing manageability,
monitoring, and troubleshooting aspects of CLR integration.