SQL Server 2005 Integration Services, Part 2: Developing Custom Components for Project REAL
Jyoti Jacob, Scalability Experts, Inc.
Donald Farmer, Microsoft Corp.
Len Wyatt, Microsoft Corp.
October 2005
Applies to:
Microsoft SQL Server 2005 Beta 2
IDW 9 Community Preview
Summary:
The first phase of Project REAL included the migration of existing SQL
Server 2000 DTS packages to SQL Server 2005 Integration Services (SSIS)
for a large electronic retailer. One source of information for the data
warehouse was from TLog files, which contained data from point-of-sale
cash registers in stores. TLog files store data in packed decimal
format. To extract and transform this data into a format suitable for
loading into the relational data warehouse tables, it was necessary to
unpack and parse the files. To achieve this, custom pipeline components
were implemented to read and transform the data by extending the SSIS
object model. This paper discusses what we learned while implementing
the pipeline components for extracting and parsing data from the TLog
parser. (44 printed pages)
You can
download
[
http://download.microsoft.com/download/2/0/e/20ea2a17-949b-44eb-87cf-c8f982e70f37/project_real_custom_components.doc
] the Microsoft Word version of this document.
Contents
Introduction: Project REAL
Phase One Implementation
Core ETL Processing
Why Custom Components?
Proposed Design for Loading the TLog Data into the Data Warehouse
SSIS Object Model and Pipeline Components
Implementation of the Custom Components
The Custom Source Adapter
The Custom TLog Transformation
Deployment
Generating a Strong Name for the Assembly
Compiling the Task
Moving the Task to the Appropriate Directories
Adding the Task to the SSIS Toolbox
Creating the Package Using the Custom Components
Debugging
Lessons Learned
Conclusion
For more information
Introduction: Project REAL
Project
REAL is an effort to discover best practices for creating business
intelligence applications based on Microsoft SQL Server 2005 by
creating reference implementations that are based on actual customer
scenarios. This means that customer data is brought in-house and is
used to work through the same issues that the customers face during
deployment. These issues include:
- The design of schemas
- The implementation of a data extraction, transformation, and loading (ETL) process
- The sizing of systems for production
- The management and maintenance of the systems on an ongoing basis
By
working with real deployment scenarios, we gain a complete
understanding of how to work with the tools. Our goal is to attempt to
address the full range of concerns that a large company would face
during its own real-world deployment. This paper focuses on the
experience of developing custom components while working on the ETL
portion of Phase 1 of Project REAL.
Project REAL uses data from
two Microsoft business intelligence customers. Phase 1 of the project
was modeled on a large electronics retailer that keeps sales and
inventory data in an SQL Server 2000 data warehouse. SQL Server 2000
Data Transformation Services (DTS) is used to manage the flow of data
into the relational database and then into SQL Server 2000 Analysis
Services cubes for reporting and interactive querying. This customer
maintains approximately 200 GB of data in their relational store. All
of this data is subsequently processed into Analysis Services cubes.
The Phase 1 implementation focuses primarily on the concerns that an
existing SQL Server 2000 customer might have when carrying out a
migration to SQL Server 2005. Our results largely represent the
migration of existing functionality, with a few new capabilities used
where appropriate. In the area of ETL, there was a substantial amount
of work to do to create packages with SQL Server 2005 Integration
Services (SSIS) based on the existing SQL Server 2000 DTS packages.
Note SQL
Server Integration Services (SSIS) is the new name assigned to the
component formerly branded DTS. The product was renamed after Beta 2;
thus many screen shots of the new SSIS in this paper still use the old
name.
Phase 2 of Project REAL is based on a larger data set from
a different customer, and exercises more of the new capabilities of SQL
Server 2005 than does Phase 1. This is because Phase 2 is primarily a
new implementation of a SQL Server 2005 solution. Look for more papers
about Project REAL on the
Project REAL [ http://www.microsoft.com/sql/bi/projectreal/ ] Web site.
Project
REAL is a joint venture between Microsoft Corporation, Unisys, EMC,
Scalability Experts, Panorama, Proclarity, and Intellinet. Scalability
Experts and Microsoft Corporation performed the work described in this
paper.
Phase One Implementation
The Phase 1 customer has two primary sources of information that flow into the data warehouse. These are:
- TLog files, which are the output of
point-of-sale (POS) cash registers in stores (not to be confused with
SQL Server transaction log files).
- Flat file extracts from a JDA system (a packaged retail software application), which are used to manage the business.
The overall flow of data, which is managed by SQL Server 2000 DTS, is shown in Figure 1.

Figure 1. Phase 1 customer data flow
Core ETL Processing
- The core ETL processing that is performed to load this customer's data warehouse is as follows.
- TLog
files come from point-of-sale (POS) cash registers in a special highly
compressed format that must be decoded before the files can be loaded
into a database. The customer's application compresses these POS
transactions into a packed decimal format based on predefined
specifications.
- The application assigns each file a name by
using a numeric sequence. It stores the name in the appropriate
directory according to the store number. This naming convention is
necessary to support files that span days as well as to support
situations where several files are written per day.
- The
customer uses Perl scripts to parse the TLog binary files into multiple
text files before the data is loaded into the database. The scripts
unpack the data using a predefined template. The scripts then
uncompress the data according to a set of rules defined in a format
specification (.ini) file.
- The output from the scripts is
stored in flat files. The flat files are then read by a DTS package and
the data is loaded into the database. Thus, two extra steps are
required before the data can be processed—one to parse the data and one
to load the flat file output.
To take advantage of the
new features and performance improvements in SSIS, we decided to
redesign the existing DTS package in SQL Server 2000. This would allow
us to measure performance improvement over a simple migration of the
existing methodology. As part of this migration, Scalability Experts
created a TLog parser that runs in the SSIS pipeline. This effectively
eliminates a costly extra step and helps reduce the storage
requirements for processing TLog files. Each TLog file, containing
compressed data, is read directly into the SSIS pipeline by using a custom source component and parsed by using a custom transformation component. The next section details the advantages of changing the architecture.
Note This
paper describes what we learned from writing the custom components that
were required to implement the TLog parser. For general high-level
lessons that were learned about SSIS in Phase 1 of Project REAL, see
SQL Server 2005 Integration Services: Lessons from Project REAL [ http://msdn2.microsoft.com/en-us/library/ms345141.aspx ] .
So
that we could fully exercise the system, the customer provided not only
the initial state of the data warehouse, but also three months of daily
incremental updates from both the sales and inventory systems. This
allowed us to simulate a full processing cycle, as if run over time,
including automated partition management in both the relational and
OLAP databases.
In addition to the work on ETL that is
described in this paper, the team executed various other activities in
Phase 1 of Project REAL including:
- Migrating the relational data from SQL Server 2000 to SQL Server 2005, while preserving the schema precisely
- Masking the customer's data to protect confidential information
- Migrating the Analysis Services 2000 database to SQL Server 2005 Analysis Services (SSAS)
- Verifying client connectivity using the customer's preferred front-end tools (Microsoft Excel and Proclarity)
- Creating sample reports against the new cubes using SQL Server 2005 Reporting Services (SSRS)
- Fully
implementing a new Inventory cube in SSAS. The customer previously had
encountered difficulties working with semiadditive measures because of
the high volume of data. Because of this, the customer had stopped work
in this area. New capabilities in Analysis Services 2005 now make
working with additive measures feasible even when working with large
data volumes.
The remainder of this paper documents
what we learned during the process of moving the customer's ETL
processing to SQL Server 2005 Integration Services. The paper also
highlights numerous best practices.
Why Custom Components?
It
is common for sales and distribution applications to compress raw
transactions into compressed files to save space. In Project REAL, as
discussed in Core ETL Processing,
processing one batch of data consisted of reading compressed TLog files
from point-of-sale (POS) applications. To see why it was necessary to
implement custom components, it is helpful to understand the
requirements for loading the POS data. The POS data is stored in packed
decimal files as shown in Figure 2.

Figure 2. Sample TLog packed file
This
file can be unpacked using the template by which it was packed. The
template is a sequence of characters that define the order and type of
the values that were used to pack the values into a binary file.
Following is a sample portion of the template file that was used for
unpacking the packed decimal files in this implementation.
"00" => ["H2","H4","H5","H6","H4","H8"]
"01" => ["H2","A12","H4"]
"03" => ["H2","A5","A12"]
Where
H: A hex string (high nybble first)
A: An ASCII string
Digit: number of bytes that contain the particular encoding
The
first two hexadecimal values of each row of the packed data represent
the row type of the template. That is, these first two numbers define
the template for the rest of the row. In the sample template there are
three row types defined. These are "00", "01", and "03". For rows
starting with "03", the first 2 bytes will be a hexadecimal value,
followed by 5 bytes of ASCII characters, followed by 12 bytes of ASCII
characters.
Using this template, the file is unpacked.
After
the file is unpacked, the data needs to be parsed further to extract
detailed information about the transactions. Parsing the file is a
challenge because there are multiple row delimiters, and each row is a
separate transaction with an unequal number of columns. So, the field
types that are required to parse the file are defined in a
specification file. Following is a sample specification file for
parsing the TLog file.
[00]Filename="Header Item"DelimiterCharacter=","Outputs="%store,0-2-2,0-1-0,0-2,-------------------
[03]Filename="Sales Receipt"
DelimiterCharacter=","Outputs="%store,3-0,3-1,............................
The
first two hexadecimal values of each unpacked transaction define the
type of transaction information. This might be payment information such
as discounts and rebates applied, or it might be user information. For
example, in the sample template, transaction type "00" defines the
transaction as a Header. Based on the type of transaction, the
specification file contains the values and methods necessary to parse
the data and extract meaningful data.
The customer used Perl
scripts to process this information. The extensible SSIS object model
and support for .NET languages presented us the opportunity to
implement the ETL process of unpacking, parsing, and extracting as a
set of custom components in the SSIS pipeline itself. This in turn
helped us to improve performance and manageability. This would not have
been possible had we used the existing Perl scripts to parse the TLog
files outside of the SSIS pipeline.
The perceived advantages that would result from the implementation of this process were as follows.
- Efficient and robust ETL because the
extraction of the data from packed decimal TLog files would be included
in the same SSIS pipeline instead of in a separate manual process. The
TLog data that would be extracted in this process could be further
processed using built-in SSIS transformation tasks and stored in any
destination of choice (such as SQL Server or flat files) in the same
pipeline.
- Better manageability because the Perl parser was
replaced by the custom components in the SSIS pipeline. Hence, it was
not necessary to implement or execute any additional applications
outside of the SSIS pipeline. Also, additional storage is not required
to store the interim parsed files before loading them into the
relational tables in the data warehouse.
- Faster ETL process
because additional I/O is not incurred as the data does not need to be
written to or read from an intermediate format on disk between each
transformation.
- More flexibility and control over the data
flow because the end user determines the specific outputs (subtables)
that are to be extracted.
- The TLog files contain approximately
24 different kinds of output information such as Rebates, User
Description, Discounts, and Header Information. Previously, the Perl
parser extracted all of these subtables into different files (24 files
for each TLog file), irrespective of whether or not this was required
by the data warehouse application. The proposed design allows the user
to pick and choose the required outputs before the data is transferred
to the downstream destination. Hence, unnecessary data flow is avoided,
leading to improved performance and manageability.
- Ease of
modification because the custom components were implemented in managed
languages such as Visual C# .NET, which offers better debugging and
easier modification of code than low-level languages like Perl.
In addition, the proposed design provides the template file for
unpacking the compressed files and the specifications for parsing the
data as separate input to the custom components. Hence, any changes to
the format specifications could be easily applied without modifying the
custom component themselves.
Because the cost and performance benefits of
implementing custom components far out-weighed the cost, we decided to
go ahead with this project. Because this process would manage the flow
of data from packed decimal files to the data warehouse, it was decided
to implement this task as a pipeline component in the data flow. For
better manageability and to implement modular programming, the
unpacking of the packed decimal file and the parsing of the file to
extract meaningful data were implemented as separate components. The
following two separate custom components were implemented.
- A source adapter that would unpack the compressed file and convert it into its associated encoding.
- A custom transformation that would parse the file based on a given specification.
Proposed Design for Loading the TLog Data into the Data Warehouse
This
section discusses the design that was developed to extract and
transform the point-of-sales data stored in the TLog files that
originated from different stores.
The TLog data files for each
store are grouped together in a directory. Directories are named
according to the store number. In each directory, the TLog files are
arranged in numerical sequence. An additional Start file is included in
the directory that contains the name of the last processed file. To
access a TLog file, the particular store directory must first be
accessed by looping through the collection of store directories. Then,
another looping process loops through each TLog store directory to
extract a particular TLog file. This TLog file must be unpacked and
parsed before it can be further transformed into a format that is
suitable for loading in the SQL Server data warehouse.
Before
the data in the TLog files can be processed, the files must be accessed
for processing. To access the source files: 1) loop through the store
directories, and 2) for each store directory, loop through the TLog
files. In SSIS, these processes are often handled by different control
flow tasks.
Once the TLog file is accessed, it is necessary to
unpack the data. The unpacked data must be further parsed based on
defined specifications to extract meaningful information. This must be
done before additional transformations can be applied and the data
loaded into the destination. Currently, there are no built-in tasks
available to unpack and parse packed decimal files. Hence, we created
custom components to unpack and parse the TLog files based on the
defined specifications in the SSIS pipeline. For better manageability,
two separate custom components were implemented—one to read and unpack
the file and another to parse the file.
To achieve the above design, the following control flow and Data Flow tasks were used.
Control flow tasks
The following control flow tasks were used to access the source TLog files.
- A custom For Each Directory container that looped through the different store directories containing the TLog files.
Note A custom For Each Directory task is
included in the sample SSIS applications included in September CTP.
(Filepath: <installedsamplesdirectory>\Integration
Services\Programming Samples\Control Flow\ForEachDirectory Sample
directory)
- For each extracted directory, the For Each File built-in container was used to loop over the TLog files in each directory.
- The file path of each TLog file was extracted to be set dynamically as the source table in the data flow source adapter.
- A
Data Flow task read, unpacked, and parsed the data and loaded it into
the required relational destination tables in the data warehouse.
Data Flow tasks
The following Data Flow tasks were used to extract, transform, and load data.
- A custom source adapter was implemented to read
the packed decimal TLog file and unpack it into its equivalent encoding
based on the provided template.
- A custom transformation task was used to transform the data using the specification provided in the .ini input file.
- Additional transformations were done as required.
- SQL Server Destination Adapter, which is the built-in destination component, was used to load the data into SQL Server.
SSIS Object Model and Pipeline Components
An
understanding of the new SSIS object model is prerequisite to a
discussion of the implementation of custom components. This section
discusses the SSIS object model and its various components.
In the SSIS object model, the main runtime components are tasks and containers.
Data management and processing are handled by individual tasks. As
shown in Figure 3, these tasks are grouped together in containers.
Containers provide iterative and branching control flow in the package.
The order in which the DTS Runtime Engine executes tasks and containers
is defined by precedence constraints. The precedence constraints define
the conditions for the execution of the tasks and containers.

Figure 3. SSIS object model
Data
flow tasks are those tasks that extract data from different sources and
then transform, cleanse, and modify the data and store it in the
appropriate destination. You can think of a Data Flow task as an
acyclic directed graph, and the source adapters, transformations, and
destination adapters as the graph nodes. Each of these components is
connected by directed paths, which determine the flow of data.
One
of the biggest improvements that SSIS 2005 has achieved over DTS 2000
is in performance. For Data Flow tasks, in-memory data stores called buffers handle
most of the data flow between components. The data flow engine creates
the execution plan and defines the required buffers for each component.
When the buffers are full, the data is pushed to the downstream
component. Data flow is managed in memory, rather than through
expensive I/O operations, and the result is faster data movement and
transformation.
One of the challenges in this project was
extending the object model to implement the custom components. However,
the entire underlying SSIS object model has been reengineered with
extensibility in mind. The new SSIS object model provides support for
the core Microsoft .NET Framework. It leverages Microsoft Visual
Studio® to provide a powerful integrated development environment (IDE)
for both design and debugging purposes.
To extend the SSIS
object model, the developer needs to consider two stages of development
for a SSIS package. The SSIS architecture handles the design and
runtime of the package separately.
Note: In
this paper, the person who implements the source code of the custom
components is referred to as the developer and the person who creates
the SSIS package by using these custom components in Business
Intelligence Development Studio is the user or end user.
Each
component in the Data Flow task is connected in a graphical format with
data paths that control the flow of data as shown in Figure 4. Each
upstream component extracts and processes the data and pushes it to the
next downstream component. Each component consists of two important
subelements: the user interface (UI) and metadata.
Metadata
defines the role of the component in the layout. The metadata consists
of component information such as custom properties, the input and
output collection, and the columns in each input and output collection.
The data is transferred between the components using in-memory stores
called pipeline buffers as shown in Figure 4.

Figure 4. Data Flow task
In
the design state, the user selects the data flow components and the
appropriate data paths. The user is allowed to change the metadata of
each component in this stage. For instance, a user might define custom
properties for the different components and determine the flow of data.
The user can also decide the number of inputs and outputs for each
component and their associated column collection. In addition, the user
can specify the flow of data from each output to the next downstream
component. During the design stage, the component must be able to react
to these metadata changes. For instance, the components may allow or
disallow adding or removing inputs and outputs. The components should
be flexible enough to accommodate these changes.
In the
runtime phase, the data flow components are executed. During this
phase, the data flow engine defines the execution plan for the data
flow, then creates and assigns the data buffers. It assigns the worker
threads and calls the different components such as source, transform,
and destination in the order in which the user arranged them in the
graphical format. In addition, the data flow engine handles errors and
completion of the tasks.
To create a custom data flow component, you need to derive from the PipelineComponent base class. The PipelineComponent implements the IDTSDesigntimeComponent90 interface for defining the design-time behavior of the Data Flow task. To define the runtime behavior, the IDTSRuntimeComponent90
interface must be defined. Both interfaces expose methods and events
that you can override for your component. To implement and override
virtual methods for the data flow components, add the following
references to your project.
- Assembly: microsoft.sqlserver.dtspipelinewrap
- Default path: %program files%\microsoft sql server\90\sdk\assemblies\microsoft.sqlserver.dtspipelinewrap.dll
- Assembly: microsoft.sqlserver.dtsruntimewrap
- Default path: %program files%\microsoft sql server\90\sdk\assemblies\microsoft.sqlserver.dtsruntimewrap.dll
- Assembly: microsoft.sqlserver.manageddts
- Default path: %program files%\microsoft sql server\90\sdk\assemblies\microsoft.sqlserver.manageddts.dll
- Assembly: microsoft.sqlserver.pipelinehost
- Default path: %program files%\microsoft sql server\90\sdk\assemblies\microsoft.sqlserver.pipelinehost.dll
Before
the package is executed, the pipeline component is in the design state.
In this state, the metadata can continuously change. In this state, the
user can determine the layout by adding or removing components, and by
setting the data flow path and metadata for each component. The SSIS
designer in Business Intelligence Development Studio interacts with the
component through the design-time IDTSDesigntimeComponent90 interface. In the design state, each node or component that is selected by the user contains a Microsoft.SqlServer.Dts.Pipeline.Wrapper.IDTSComponentMetaData90
interface and a reference to the design-time interface of the
corresponding user component. The data paths define the movement of
data from the upstream component to the downstream components. This
data path is defined by the collection of Microsoft.SqlServer.Dts.Pipeline.Wrapper.IDTSPath90 objects, which define the movement of data between components.
The
runtime methods are executed by the worker threads, which are also
managed by the Data Flow task. The worker threads call the runtime IDTSRuntimeComponent90 interface to execute the runtime methods.
During
runtime, the Data Flow task generates an execution plan for the
execution of the data flow components. The execution plan contains an
execution tree for each sequence of components that start from either a
source or an asynchronous component and continue until either the
destination or another asynchronous component is encountered. Based on
the execution plan, a pipeline buffer is created for the output column
collection for each sequence of components. This pipeline buffer
contains the column collection of all the components until an
asynchronous component or destination component is encountered. If an
asynchronous component is encountered, a pipeline buffer that contains
the columns of the next sequence of components is created. Hence, the
buffer is reused for a sequence of components in a particular execution
tree.
This provides optimization by allowing multiple
components to reuse the same buffer objects, which increases efficiency
by reducing the number of expensive memory copies of the buffer.
To
achieve this, the engine maintains a constant buffer definition in the
runtime execution plan. When the upstream component creates the
pipeline buffer, it allocates space for the columns that will be
created by the downstream components. Since upstream components only
write to the space that is reserved for their columns, there can be
more columns at the component level at runtime than are selected by the
end user at design time.
The runtime engine assigns the
pipeline buffers. The output rows are stored in the buffer until it is
full. Then the data is pushed to the next component. The buffer cells
hold the value of each column for auditing purposes. The columns are
provided with unique lineage identifiers. These lineage identifiers map
to the column index in the buffers. During execution, the columns are
located by using methods such as FindColumnbyLineageID.
The initial programming includes deriving from the PipelineComponent base class. You need to define the namespace and assign the class attributes of the DtsPipelineComponent. For the custom source adapter, the namespace and class was defined as follows by deriving from the PipelineComponent class.
using System;
using Microsoft.SqlServer.Dts.Pipeline;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using Microsoft.SqlServer.Dts.Runtime;
#endregion
namespace TLogSourceComponent
{ #region ClassAttribute
[
DtsPipelineComponent
(
DisplayName = "TLOG Source",
ComponentType = ComponentType.SourceAdapter,
IconResource = "TLogSourceComponent.TLogSource.ico",
Description = "This source component unpacks the packed
decimal fields in a POS TLOG and returns a single column as
record/string/substring structure."
)
]
#endregion
public class TLGSrc : PipelineComponent
{
\\ write the code here
}
}
Implementation of the Custom Components
The
object model can be extended to create custom tasks by using a language
in .NET Framework. This particular solution was implemented in managed
code by using C#. The project was implemented in the Visual Studio 2005
development environment. The rest of the implementation and API usage
are explained in C#.
The custom components were created as
separate projects for better manageability and debugging. Each project
was compiled to generate the required assemblies (.dll) files. These
assemblies can be shared with hosting applications such as Business
Intelligence Development Studio.
The remainder of this section discusses the implementation details of the custom source adapter and custom transformation.
The Custom Source Adapter
The
source adapter allows the user to extract data from different data
sources. The source adapter can contain several data and error outputs.
As the source adapter is the first component in the Data Flow task, it
does not contain any inputs. Because the source adapter does not
contain inputs, it connects directly to the data source and extracts
the required data.
The output can contain one or more columns of
a defined data type to transfer the data downstream in the data flow
pipeline. The same is true for the error output columns. However, in
the implementation, the developer can explicitly deny the user the
ability to add or remove columns or outputs by overriding different
methods, such as DeleteOuputColumn() and InsertOutputColumn(),
and then displaying error messages. If the implementation allows the
removal of columns, at design time the end user can choose the
different required outputs and their associated columns.
For the TLog source adapter, the following behavior was required.
- Read from the packed decimal source file.
- Unpack the file based on the specification specified in the associated input template file.
- Combine the unpacked data into a text data type column and send it to the next component.
As
discussed earlier, there are two important phases for creating custom
components. These are the design-time methods and the runtime methods.
Following is a discussion of the virtual methods that need to be
overwritten for design and runtime.
Designing a source adapter
In
the design phase of a source adapter, the first step is to define and
initialize the metadata of the component. The metadata of a component
consists of:
- The name, description, and custom properties of the component.
- The input and output collection of the component, including the name, description, and properties of each output and input.
- The
column collection that is included in each input and output of the
component, including the data type, size, precision, and length of each
column, as well as any column-level custom properties that have been
defined.
During design time, the metadata for a component is defined by using the ProvideComponentProperties() method. The package calls the ProvideComponentProperties()
method to initialize the output/input collection and the custom
properties. This is the first method that is called when the end user
adds a component to the Data Flow task in the design state. This method
initializes component properties such as inputs, outputs, and custom
properties. Though this method sounds like a constructor, it is not
exactly a constructor. It is not called each time the design-time
instance or runtime instance of the component is instantiated.
Following is the code for the ProvideComponentProperties method.
public override void ProvideComponentProperties()
{
///Support resetting the component.
ComponentMetaData.RuntimeConnectionCollection.RemoveAll();
RemoveAllInputsOutputsAndCustomProperties();
// Add the Custom properties
IDTSCustomProperty90 tlogFile = ComponentMetaData.CustomPropertyCollection.New();
tlogFile.Description = "Specify the source TLOG file name
including the full path.";
tlogFile.Name = "TLogFile";
IDTSCustomProperty90 templateFile = ComponentMetaData.CustomPropertyCollection.New();
templateFile.Description = "Specify the template file name
including the full path.";
templateFile.Name = "TemplateFile";
IDTSOutput90 output = ComponentMetaData.OutputCollection.New();
output.Name = "TLOGOutput";
output.ExternalMetadataColumnCollection.IsUsed = true;
}
The first two methods, RemoveAllInputsOutputsAndCustomProperties() and ComponentMetaData.RuntimeConnectionCollection.RemoveAll(),
are used to reset an instance of the existing component by deleting the
defined inputs, outputs, custom properties, and connections.
To
connect to the external source, new connections need to be defined. To
achieve this, custom properties were defined. The TLog Source adapter
contains the following two custom properties.
- The connection string of the TLog packed decimal file.
- The path to the template files that contain the specification that is required to unpack the packed decimal file.
A
source adapter connects directly to the data source to extract the
data. This is usually the first component in the Data Flow graph
layout. Hence, there are no inputs associated with an upstream
component.
The unpacked file is sent from the custom source adaptor to the downstream component as a single row with a single column of DT_TEXT
data type. The design to send the entire unpacked data as a single row
was used as the unpacked data output contained different row delimiters
with an unequal number of columns. Transferring the data as a single
row made processing the data easy in the transformation phase. Hence,
there is only one output defined. It is important that you define the DataType, Length, Precision, and CodePage
properties of any output columns. All of these properties need to be
set because these properties are read-only and each is dependent on the
setting of the other. These properties can be set by calling the SetDataTypeProperties() method.
Note If you are not sure of the correct data type for the output, the base class provides the following three helper methods.
- ConvertBufferDataTypeToFitManaged: Gets the DTS data type that is best used from managed code.
- BufferTypeToDataRecordType: Returns a managed data type based on a DTS data type.
- DataRecordTypeToBufferType: Gets the DTS data type that corresponds to a managed type.
Two other methods that needed to be overridden for the custom source adapter were Validate() and ReinitializeMetadata().
The Validate() method of the PipelineComponent
class provides the validation necessary to ensure that the component
has been configured correctly. It checks for the input and output
objects, including the data types of the columns. In addition, it
checks properties to make sure they are correct. It checks the custom
properties to ensure that input values and connection strings have been
assigned as required. This method is called every time the user edits
the component in the design state. This ensures the validity of the
component. This method is also called during the pre-execute phase at
runtime. To see if there are errors, check the status of the DTSValidationStatus enumeration return value. You can inform the end user if the input values or configuration are not correct.
For
the TLog source adapter component, this method checks to see if correct
connection paths are defined for the TLog packed decimal file and the
template file. It also checks to make sure these files exist. Based on
the type of error it finds, the Validate method can return one of following values. The end user can then be warned or errors can be rectified.
- VS_NEEDSNEWMETADATA: An error exists in the component metadata and can be fixed by the component itself.
- VS_ISBROKEN:
The component contains an error, which the end user can rectify in the
SSIS designer in Business Intelligence Development Studio, by editing
the components.
- VS_ISCORRUPT: Errors exists which are due to
modifications to the source code or to the package XML. To rectify such
errors, the developer must delete and re-create the component.
- VS_ISVALID: The component is properly configured.
The developer can post warnings and errors back to the user by using the FireWarning() and FireError() methods.
The following code is used to validate the TLog file custom property.
#region Validate
[CLSCompliant(false)]
public override DTSValidationStatus Validate()
{
IDTSOutput90 output = ComponentMetaData.OutputCollection[0];
IDTSCustomProperty90 tlogFile = ComponentMetaData.CustomPropertyCollection["TLogFile"];
if (tlogFile.Value == null || ((String)tlogFile.Value).Length == 0)
{
ComponentMetaData.FireError(0, ComponentMetaData.Name,
"TLOG file is required.", "", 0, out Cancel);
return DTSValidationStatus.VS_ISBROKEN;
}
else if (!File.Exists((String)tlogFile.Value))
{// TLog file name specified, verify it exists and is accessible
ComponentMetaData.FireError(0, ComponentMetaData.Name, String.Format("TLOG
file '{0}' does not exist or is not accessible.", (String)tlogFile.Value), "", 0, out Cancel);
return DTSValidationStatus.VS_ISBROKEN;
}
// Validate the other properties
}
#end region
Similar
code validates the path for the template file. The code also checks to
see if the metadata is correct by checking the mapping of the output
column.
if (ComponentMetaData.OutputCollection[0].OutputColumnCollection.Count == 0)
{
return DTSValidationStatus.VS_NEEDSNEWMETADATA;
}
/// Validate the output columns against the external metadata
if (ComponentMetaData.ValidateExternalMetadata)
{
if (!ComponentValidation.DoesExternalMetaDataMatchOutputMetaData(output))
{
ComponentMetaData.FireWarning(0, ComponentMetaData.Name, "The ExternalMetaDataColumns do not match the output columns.", "", 0);
return DTSValidationStatus.VS_NEEDSNEWMETADATA;
}
If
the metadata is not correct and can be rectified by the component, then
the VS_NEEDSNEWMETADATA enumeration value is returned. Remember that
the Validate() method should be used only to raise errors. The
state of the component should not be changed in this method. Since the
designer calls this method frequently by the designer, changes to the
component in this method can result in unpredictable behavior. Hence,
errors should be repaired in the ReintializeMetaData() method. For example, if the output column properties are not correct, the ReintializeMetaData() method gets the IDTSOutputColumn90 object,
sets the appropriate data type, and maps it correctly with the metadata
of the external input columns. The code is as follows.
public override void ReinitializeMetaData()
{
IDTSOutput90 output = ComponentMetaData.OutputCollection[0];
/// Start clean.
output.OutputColumnCollection.Removedll();
output.ExternalMetadataColumnCollection.RemoveAll();
IDTSOutputColumn90 outColumn = ComponentMetaData.OutputCollection[0].OutputColumnCollection.New();
IDTSCustomProperty90 dataSourceColumnName = outColumn.CustomPropertyCollection.New();
dataSourceColumnName.Name = "DataSourceColumnName";
dataSourceColumnName.Value = "DecodedString";
dataSourceColumnName.Description = "The name of the column at the data source.";
/// Set the output column's properties.
outColumn.Name = "DecodedString";
outColumn.SetDataTypeProperties(DataType.DT_TEXT, 0, 0, 0, 1252);
IDTSExternalMetadataColumn90 eColumn = output.ExternalMetadataColumnCollection.New();
eColumn.Name = outColumn.Name;
eColumn.DataType = outColumn.DataType;
eColumn.Precision = outColumn.Precision;
eColumn.Length = outColumn.Length;
eColumn.Scale = outColumn.Scale;
outColumn.ExternalMetadataColumnID = eColumn.ID;
}
Runtime
The data flow engine moves the
data between different components by using in-memory buffers. The data
flow engine creates and manages the pipeline buffers that are used for
moving data. At execution time, the input and output collections are
assigned and the associated buffers are created. For the custom source
adapter, as there are no inputs, only the outputs are added. The output
is assigned by the Data Flow task by allocating the Microsoft.SqlServer.Dts.Pipeline.PipelineBuffer.PipelineBuffer objects that contain all of the required columns in the data flow.
Pre-Execute, PrimeOutput, and ProcessInput are runtime methods that should be used to manage the outputs and inputs. Pre-Execute is called before PrimeOutput and ProcessInput.
This method is your first chance to access the output and input
collections. Each input and output collection contains the column
collection. Columns in the collection are identified by their unique
lineage identifiers (lineage IDs). The pipeline buffers store the
column data in the associated buffer columns. These buffer columns are
identified by their column index. If there are multiple columns in the
input or output collection, to locate the column of a particular input
or output in the buffer, you must map the lineage ID of a column in the
column collection to the column index of the associated buffer. This is
because the order of the columns in the buffer is not maintained in the
same order as the columns in the input/output collection. To achieve
the mapping, you must locate the lineage ID of the columns in the
output column collection. The lineage ID can be accessed by using the Microsoft.SqlServer.Dts.Pipeline.Wrapper.IDTSBufferManager90.FindColumnByLineageID(System.Int32,System.Int32) method of the Microsoft.SqlServer.Dts.Pipeline.PipelineComponent.BufferManager property. This lineage ID and buffer type can be used to find the column index.
As there is no input associated with this component, the ProcessInput() method is generally not used when creating source adapters.
The next important method for the custom source adapter is the PrimeOutput()
method. This method is used for calling external methods to process and
manage the outputs. It adds rows to the output buffers to send data to
the downstream component. This method initializes an object of the TLogReader
class to call different methods to unpack the file. The TLog file and
the template file that are defined through the custom properties of the
source adapter are sent as input parameters to the methods in the TLogReader
class, which unpacks the packed decimal file. The return output is
inserted into the output buffer. As the buffer becomes full, the data
is pushed to the downstream component and processed further.
public override void PrimeOutput(int outputs, int[] outputIDs, PipelineBuffer[] buffers)
{
IDTSOutput90 output = ComponentMetaData.OutputCollection[0];
PipelineBuffer buffer = buffers[0];
try
{
buffer.AddRow();
TLOGReader reader = new TLOGReader();
IDTSCustomProperty90 tlogFile = ComponentMetaData.CustomPropertyCollection["TLogFile"];
IDTSCustomProperty90 templateFile = ComponentMetaData.CustomPropertyCollection["TemplateFile"];
reader.ProcessTLogFile(((String)tlogFile.Value), ((String)templateFile.Value));
buffer[0] = reader.ReaderOutout.ToString();
}
catch
{
}
finally
{
/// Notify the data flow that we are finished adding rows to the output.
buffer.SetEndOfRowset();
}
}
The TLogReader
class contains methods to unpack the packed decimal file. The data in
the packed decimal file is extracted as a stream of binary data. The
Perl language contains functions called Pack() and Unpack(). Pack() takes
an array or list of values and packs it into a binary structure,
returning the string containing the structure. Unpack() does the
reverse of Pack(). It takes the binary structure and unpacks it by
using the template that was used to pack the data. The template is a
sequence of characters that give the order and type of the values that
were used to pack the values into a binary structure. Unfortunately,
the .NET Framework does not contain any equivalent functions to unpack
binary files by using a template. Hence, a custom method was
implemented called Unpack. This method accepts the
binary data and the template specification file as input. The template
contains a sequence of characters that give the order and type of
values. Based on the encoding that is provided in the template file,
this custom Unpack method converts the binary data into appropriate values.
The
template file that was used to unpack the packed decimal file that was
provided as input to the TLog source adapter is as follows.
"00" => ["H2","H4","H5","H6","H4","H8"]
"01" => ["H2","A12","H4"]
"03" => ["H2","A5","A12"]
Where
H: A hex string (high nybble first).
A: An ASCII string.
Digit: Number of bytes that contain the particular encoding.
As
discussed earlier, the first two hexadecimal numbers of each row
represent the row type and define the rest of the specification to
parse the file. Based on this specification, the file is unpacked. For
example, if the template file specifies that the first two bytes are
hexadecimal (high nybble first), the unpack function unpacks the file
in the following manner.
switch (FormatSpecifier[0])
{
case 'H': //Hex
String CurrentHex;
int HexStrLen = 0;
foreach (Byte CurrentByte in Bytes)
{
if (ExpectedCharCount != -1 && HexStrLen >= ExpectedCharCount)
break;
CurrentHex = String.Format("{0:X2}", CurrentByte).ToLower();
Result.Append(CurrentHex);
HexStrLen += 2;
}
break;
Similarly, ASCII characters are also unpacked by using the proper encoding. The result set is appended as a DT_TEXT data type. The complete file is unpacked in this manner and the result set is sent to the output pipeline buffer.
A sample output of the unpacked file looks like the following:
00:0006:0006:0408021033:00:32:101254:f42587:3297::f9:38::f433:","","99:ZIPCODE:11999","90:DESC:XYZ-24C
BW","01:288206:1999:f130:000000:f128::00","90:DESC:XYZ
ITEM,"01:112811:f790:f130:000011:f128::00",
"20:0023:0011:0408020945:10:101112:I:"
"20:0019:0011:0405020945:10:1012:I:"
The Custom TLog Transformation
The
data is read and extracted from the packed decimal file by the custom
source adapter. It is then unpacked. The unpacked data is sent to the
custom transformation component where it can be parsed. The custom
transformation receives a single row with a single column as the input
and parses the data based on the specification file into a format that
is meaningful and can be understood. The specification file contains
the template that is used to parse the hexadecimal and ASCII characters
into a recognizable format, which can be processed or cleansed further
in the data flow pipeline. This specification file is loaded as an .ini
configuration file.
The unpacked input to the TLog transformation component contains rows with the following characteristics.
- The rows can either be delimited by a comma or EndOfLine character.
- The columns are delimited by colons.
- The
first two hexadecimal digits in each row of the unpacked data define
the type of transaction and hence the format of the rest of the row.
Based
on the specification, the data is read and the transactions are
separated into several outputs. Example outputs might be header
information, rebates information, sales information, or user
information.
As discussed previously, the POS application can
keep a file open for several days. Or, several files can be opened in a
single process. The close of a file is defined by a close transaction.
Designing an asynchronous transformation component
The
transformation component can be either synchronous or asynchronous. The
transformation component is defined as asynchronous if a component
cannot output rows until it has received all its input rows, or when
the transformation does not produce exactly one output row for each row
received as input. In this project, the custom source adapter sends the
data as a single row with a single column. This data needs to be parsed
into multiple different outputs based on the specification. The entire
input is received from the custom TLog source adapter and processed to
generate multiple different rows. These rows are redirected to separate
outputs. Hence, the output in the TLog transformation component is
asynchronous.
The input, output, and custom properties are defined in the ProvideComponentProperties method.
By default, the base class creates an input and an output and sets the
output to be synchronous with the input. In this project, since the
output is asynchronous in this component, this property is defined by
setting the SynchronousInputID property to 0 as shown in the following code.
public override void ProvideComponentProperties()
{
/// Support resettability.
/// The base class calls RemoveAllInputsOutputsAndCustomProperties to reset the
/// component. Used here to highlight the functionality of the base class.
base.RemoveAllInputsOutputsAndCustomProperties();
// Let the base component add the input and output.
base.ProvideComponentProperties();
// Name the input and output, and make the output asynchronous.
ComponentMetaData.InputCollection[0].Name = "TLOG Input";
IDTSOutput90 headerOutput = ComponentMetaData.OutputCollection[0];
headerOutput.Name = "HeaderOutput";
headerOutput.Description = "Header rows are directed to this output.";
headerOutput.SynchronousInputID = 0;
/// Add the other outputs.
IDTSOutput90 closeOutput = ComponentMetaData.OutputCollection.New();
closeOutput.Name = "CloseOutput";
closeOutput.Description = "Close rows are directed to this output.";
.....
.....
......//Other outputs are defined ....
.....
.....
///Add the custom property to the component
/// Config File is the ini template file. The TLog Parser will parse the files ///based on this template.
IDTSCustomProperty90 INIproperty = ComponentMetaData.CustomPropertyCollection.New();
INIproperty.Name = "Config File";
INIproperty.Description = " Ini file to process the tlog information";
/// Each TLog file is associated with a store.
IDTSCustomProperty90 StoreName = ComponentMetaData.CustomPropertyCollection.New();
StoreName.Name = "Store Name";
StoreName.Description = "Name of the store to be processed";
}
For
each output, the processed rows contain different numbers of columns.
Because the output contains an unequal number of columns for each row,
for ease of implementation, each row is pushed with all the columns
appended as a single string. Hence, there is a single column per
output. The output data can be further transformed by using other
built-in components such as the Derived Column component, before the
data is loaded into the data warehouse relational tables.
Because
there is only one column as output, the user is not allowed to add or
remove columns from the outputs at design time. However, users are
allowed to process any number of outputs that they want to. Hence,
users can delete any outputs that they do not require. To prevent users
from adding or removing columns from the output, methods such as InsertOutput() and DeleteOutputColumn() were overridden to throw appropriate error messages.
public override void DeleteOutputColumn(int outputID, int outputColumnID)
{
throw new Exception("Columns cannot be deleted. Remove the output instead");
}
At design time, the output columns are defined based on the input column properties. The PipelineComponent.SetUsageType()
method is used to create a mapping between the input and output
columns, as well as to define the data type and other properties of the
output column.
public override IDTSInputColumn90 SetUsageType(int inputID, IDTSVirtualInput90 virtualInput, int lineageID, DTSUsageType usageType)
{
IDTSVirtualInputColumn90 vCol = virtualInput.VirtualInputColumnCollection.GetVirtualInputColumnByLineageID(lineageID);
IDTSInputColumn90 col = null;
/// Only support for text columns.
if ((vCol.DataType != DataType.DT_TEXT))
throw new Exception("Only Text datatypes are supported.");
/// If the usageType is UT_IGNORED, then the column is being removed.
/// So throw the appropriate error
if (usageType == DTSUsageType.UT_IGNORED)
{
throw new Exception("columns cannot be removed");
}
else if (usageType == DTSUsageType.UT_READWRITE)
{
throw new Exception("Read write not supported.");
}
else
{
/// Let the base class add the input column.
col = base.SetUsageType(inputID, virtualInput, lineageID, usageType);
/// Add an output column
AddOutputColumn(ComponentMetaData.OutputCollection[0].ID, col);
AddOutputColumn(ComponentMetaData.OutputCollection["CloseOutput"].ID, col);
.....
.....
.....................// to add additional output columns to the outputCollection
.....
.....
}
return col;
}
Runtime methods for an asynchronous component
The
runtime methods are called when the package is executed. The first time
that the output and input collection can be accessed during execution
is in the PreExecute() method. The difference between the
custom source adapter and the custom transformation component is in the
input rows from the upstream component. There are no inputs for the
source adapter. It directly connects to the external source. The custom
transformation component receives inputs from the upstream component,
processes the data, and sends it on to the downstream component(s). If
the output is synchronous, it follows the simple 'row in and row out'
rule. For asynchronous outputs, because the input collection is
different from the output collection, different pipeline buffers are
created for the inputs and outputs.
Note If
the transformation component is synchronous, the same input row is sent
as an output row after the data is transformed. As there is an output
for every input, the same input buffer can be re-used to store the
output as the buffer contains the entire column collection. A single
buffer is shared between the input and the output. The buffer contains
the columns for both the input and output.
For the
asynchronous component, the input rows (the output rows from the
upstream component) are received by the input buffer when the output
buffer in the upstream component is full during the ProcessInput() method.
This method is repeatedly called during the input process whenever the
buffer is passed to it. The difference between the ProcessInput() and PrimeOutput() methods is that, apart from processing input and output rows, ProcessInput() is repeatedly called whenever a buffer is passed to it, while PrimeOutput() is called only once. The data is pushed to the downstream component whenever the pipeline buffer is full. The EndofRowset
property is set true after the last row. This indicates to the
downstream component that the input rows are finished. Similarly, the
input rows are processed until the EndofRowset is reached. For the TLog transformation, there is only one input from the upstream source adapter.
Because the EndofRowset
property defines the end of the input rows, the developer needs to set
this property for the output rows when all the rows have been
processed. This is achieved by calling the SetEndofRowset() method. The following code shows the implementation of the ProcessInput() method .
public override void ProcessInput(int inputID, PipelineBuffer buffer)
{
try
{
//Read all the rows of the input before processing
if(!buffer.EndOfRowset)
{
IDTSInput90 input = ComponentMetaData.InputCollection.GetObjectByID(inputID);
while (buffer.NextRow())
TinputBuffer = buffer.GetString(0);
}
if (!buffer.EndOfRowset)
{
//Extract the custom property for the config file
IDTSCustomProperty90 INIFile = ComponentMetaData.CustomPropertyCollection["Config File"];
//Extract the custom property of store name
//Call the TLog parser
PipelineBuffer[] buffers = new PipelineBuffer[_Buffers.Count];
IDictionaryEnumerator myEnumerator = _Buffers.GetEnumerator();
//Set the endofrowset for all the output buffers
while (myEnumerator.MoveNext())
{
PipelineBuffer NextBuffer = (PipelineBuffer)myEnumerator.Value;
if (NextBuffer != null)
NextBuffer.SetEndOfRowset();
}
}
}
catch
{
throw new Exception("Error in processing the file");
}
The TLog parser accepts the unpacked string from the
source adapter as input and accepts the .ini configuration file as
additional input by setting the custom property. It processes the rows
based on the specifications provided in the .ini configuration file.
Following is a sample specification in the .ini file.
[00]Filename="Header Item"DelimiterCharacter=","Outputs="%store,0-2-2,0-1-0,0-2,----------------------------
[03]Filename="Sales Receipt"
DelimiterCharacter=","Outputs="%store,3-0,3-1,.............................
The
second parameter in the .ini file (as shown in the sample) indicates
that the row delimiter is a comma. Based on this delimiter, several
rows are extracted from the single row that is received from the
upstream source adapter. Each row is further divided based on the
EndOfLine character. Each column is delimited by a colon. The
hexadecimal digits in the first column of each row classify the type of
transaction, such as header or sales. Based on the transaction type,
additional specifications such as subtypes or field value can be
extracted and used for parsing the row. The specifications such as the
type and field information of the data are indicated in the Outputs option of the specification for each type of transaction.
Once the rows are parsed by the TLog custom transformation, the PrimeOutput()
method is called to assign the output buffers to each of the selected
outputs. Each output is assigned an equivalent buffer. For the TLog
Parser transformation, the output is pushed to the downstream component
as a single column. As discussed in Designing an asynchronous transformation component, this design logic was adopted as each row contained an unequal number of output columns.
#region PrimeOutput
#region PrimeOutput
/// <summary>
/// The data flow task provides a buffer for each output
/// that is connected to a downstream component.
public override void PrimeOutput(int outputs, int[] outputIDs, PipelineBuffer[] buffers)
{
_Buffers = new Hashtable();
for (int x = 0; x < outputIDs.Length; x++)
{
IDTSOutput90 output = ComponentMetaData.OutputCollection.GetObjectByID(outputIDs[x]);
switch (output.Name)
{
case "HeaderOutput":
HeaderBuffer = buffers[x];
_Buffers.Add("HeaderOutput", (PipelineBuffer)this.HeaderBuffer);
break;
.......
.......
.................//TODO: Add the other output selected by the user during design
// stage and assign the buffer
.....
.....
}//end of switch
} //end of loop
} //end of PrimeOutput
The unpacked and transformed data looks like:
1,00,0004,0402030853,03,6,0,0,0,0,0,0,2628,0,0,1,0,0,0
1,1,01,0004,0005,0402030910,222222,1500,000000,128,0,0,0,0
Deployment
Once
the coding is complete, it is time to deploy the components in SQL
Server Integration Services. Once the code is verified for correctness,
each component is compiled into separate DLLs. It is important
to ensure that the developers have proper security permissions to
access the reference libraries, copy the compiled assembly into the
global assembly cache and the Pipeline Component directory, and create
packages in SSIS. As this assembly needs to be saved in the global
assembly cache, each assembly should be assigned a strong name for
deployment.
Following are the requirements for deploying a pipeline component.
- Generate a strong name for the assembly.
- Compile the task.
- Move the task to the appropriate directories.
- Add the task to the SSIS Toolbox.
- Create the package using the custom components.
This section describes each of these requirements.
Generating a Strong Name for the Assembly
As
this assembly needs to be shared among applications, apart from the
component directory, it must also be copied into the global assembly
cache. Assemblies can be copied into the global assembly cache only if
they have a strong name. This must be a fully qualified name that
includes the assembly's name, culture, public key, and version number.
This guarantees that the assembly is globally unique. You can use the
Strong name tool (sn.exe) to generate a strong name.
The Strong
name tool generates a public/private key. This key pair is used during
compilation to create a strong-named assembly. The tool can be executed
from the command prompt by specifying the name of the output file
containing the key pair.
Specifying the location of the output
file that contains the key file pair can be done in two ways. You can
edit the AssemblyInfo.cs file or you can use the Assembly Linker Tool
(Al.exe). The following code is in the AssemblyInfo.cs file. It
includes the path of the file containing the key pair.
[assembly:AssemblyKeyFileAttribute(@"..\..\TLogprocess.snk")]
In
addition, the version number must not change when the project is
compiled. To achieve that, specify a constant version number in the
AssemblyInfo.cs file. The following code indicates that the version
number remains constant.
[assembly:AssemblyVersion("1.0.0.0")]
Compiling the Task
The projects can be compiled in either Visual Studio 2005 or from the command prompt.
To compile the TLog transformation task in Visual Studio 2005
- On the File menu, select Open, and then select Project.
- Open the TLogProcess.csproj project.
Note: The TLog transformation task project is named TLogProcess.
- Press F5, or click Start on the Debug menu to compile and run the project.
To compile the file using the command prompt
- Use the Change Directory command to change to the directory containing the custom component projects.
- Type the following command.
for /r %f in (*.sln) do msbuild.exe "%f"
Moving the Task to the Appropriate Directories
During
installation of SSIS, separate directories are created to hold the
assemblies of tasks and pipeline components. For the pipeline
components, the default path of the directory is %SystemDrive%:\Program
Files\Microsoft SQL Server\90\DTS\PipelineComponents. After the tasks
are compiled, copy the assembly files (.dll) into this directory.
As
these assemblies must be shared, it is necessary to save a copy of the
assembly files in the global assembly cache. These assemblies have
already been signed with a signed name. To copy the file into the
global assembly cache, you can use the command-line tool gacutil.exe or
drag-and-drop the assembly to the %system%\assembly directory.
Use the following command to use the command-line tool, gacutil.exe, to copy the file.
gacutil.exe -iF "<assembly path>"
Adding the task to the SSIS Toolbox
Once
the assemblies have been compiled and copied to the appropriate
directories, the last task is to add them to the SSIS Toolbox so that
users can access the custom components. Placing the components in the
directories does not make them explicitly visible in the Toolbox. SQL
Server Integration Services (formerly the DTS Server) caches
information about tasks and components because it is expensive to
enumerate them. Hence, developers will need to restart the service to
get the new list of components. Once the service is restarted, access
the task.
To access the custom components
- Right-click Tools on the main menu. Select Toolbox items.
- In the Choose Toolbox Items dialog box, select the SSIS Data Flow Items tab.
- Select
check boxes for the appropriate data flow components that you need to
view in the SSIS Designer in Business Intelligence Development Studio.
As shown in Figure 5, selecting the TLogSource (source adapter) and
TLogProcess (transformation) components allows the end user to view
these components in the SSIS Designer.
Figure 5. Selecting the custom components in Toolbox for display
Now
users can view the components in the Toolbox when they open the SSIS
Designer. Of course, they must have appropriate permissions to work
with the SSIS Designer and to access tasks on the machine. The user can
drag and use these components the same as any other built-in component.
Once the custom properties are assigned, the components are ready to
execute.
Creating the Package Using the Custom Components
This
section covers how to create a Data Flow task by using the custom
components. In this example, we create a sample package by using the
custom source adapter to unpack a packed decimal file, parse it with
the custom TLog transformation component, and write it to flat files.
To add the custom components in a package
- To start Business Intelligence Development Studio, open the Programs menu and select Microsoft SQL Server 2005 CTP. Select Business Intelligence Development Studio.
- On the File menu, select New, and then select Project. For the project type, select Business Intelligence Projects.
- In the New Project Templates dialog box, select Integration Services Project.
- Provide the Name, Location, and Solution Name of the project. The Package Design Editor will open.
- Add and drop the Data Flow task from the Toolbox onto the SSIS designer editor in Business Intelligence Development studio.
- Open
the Data Flow Task Editor. From the Toolbox in the Data Flow Task
Editor, select and add the TLog Source Adapter to the window.
- Right-click TLog Source Adapter component and select Show Advanced Editor as shown in Figure 6.
Figure 6. Adding the TLog source component
- In
the Advanced Editor for TLog Source window, assign the custom
properties. Set the connection path of the packed decimal file and the
path of the template file as shown in Figure 7.
Figure 7. Setting custom properties for the TLog source adapter
- Ensure that the settings on the Column Mappings tab and the Input and Output Properties tab of the TLog Source Adapter are set correctly. Click on Column Mappings on the Source Adapter and select the input columns to include. (See Figure 8.)
Figure 8. Mapping the input columns to the output collection
- Next, select the ProcessTLog
task (ProcessTLog is the name given to the TLog transformation
component). Connect the output of the TLog Source Adapter to the
ProcessTLog component.
- Open the advanced property editor for the ProcessTLog component. Under Custom Properties, set the Config File property to the path of the.ini configuration file and set the Store Name property, as shown in Figure 9.
Figure 9. Properties for the TLog transformation task
- Select the Input and Output Properties
tab. This tab contains all the outputs. The custom transformation
component has been implemented in such as way as to prevent users from
removing output columns. If the user tries to remove the output
columns, the error message in Figure 10 is displayed. Note however,
that although you are not allowed to remove the columns from the
defined outputs, you are allowed to remove the complete output.
Figure 10. Error message thrown when the user attempts to remove a column
As
shown in Figure 11, out of 24 outputs, only the Header and Close output
are selected to send the data downstream. Also shown in Figure 11, the
property window of the output column displays the data type of the
columns and other properties.
Figure 11. Properties of the output column for the transformation component
- To extract the data from the packed decimal file and save to destination flat files, the Flat File Destination Adapter
is selected. When the output of the custom TLog transformation is
connected to the Flat File destination, the user is allowed to select
which output to connect to as shown in Figure 12.
Figure 12. Selecting the output to map to the destination
- To store the Header information in a file, assign the file connections and the mapping for the Flat File Destination.
- Run the package to execute it. The output is stored in the Header flat file as shown in Figure 13.
Figure 13. Successful completion of the data flow for TLog packed decimal files
Debugging
It
is imperative that your code is tested thoroughly before it is put in
production. The developer should initially test the code for syntax
correctness and general logic. Compiling the product ensures that the
code is syntactically correct. For semantic validation, the developer
needs to validate the correctness of the code by attaching it to the
process that is hosting the task and by setting appropriate
breakpoints. The implemented components are hosted by applications such
as Business Intelligence Development Studio and can be debugged by
associating the source code with the process of the environment that is
hosting the task. Breakpoints can be set on different locations in the
Visual Studio development environment and output data checked to ensure
that the logic implemented is working as expected.
One way to debug the runtime methods of implemented custom components is to use the DTEXEC command utility to run the package.
To use DTEXEC to run the package
- Create a SSIS package using the implemented custom components.
- Open Visual Studio and open the project for the Custom Component.
- Select Project, and then select <projectname> Properties.
- Set the project in Debug mode.
- Select DEBUG property.
- Set the following properties.
Mode: Debug
Start external program: <path to the exe>\DTEXEC.exe
Command line arguments: /f "<pathtoyourSSISpackage>"

Figure 14. Using DTEXEC to run package
DTEXEC
is the command-line utility for running the DTS package. Set the
breakpoint and watch points on appropriate methods in the implemented
code. When the code is executed using F5 or Start in
Debug mode, DTEXEC launches the associated SSIS package. You can debug
the output values to ensure the validity and correctness of the code.
You
can also debug the runtime methods of a running package by attaching
the source code to the DtsDebugHost.exe process. For this approach, a
pause needs to be introduced during execution of the package. A pause
can be introduced by setting an OnPreExecute breakpoint on the task.
Once the package execution is paused, the DTSDebugHost.exe process can
be attached and the source code can be debugged by inserting
breakpoints at the appropriate places.
You can also set
debugging properties in Business Intelligence Development Studio. For
example, you can set data viewers between two Data Flow tasks to
monitor the data flow to the downstream component and to monitor the
number of rows/ columns that are transferred. In this case, the TLog
Process transformation task parses and sends around 338 rows to the
Header file. As shown in Figure 15, you can also see the content of the
rows to validate the data. In addition, the data viewer indicates the
number of rows transferred to the next component. The data viewer can
be used as a good debugging tool.

Figure 15. Output displayed in the data viewer
When
the custom components have been coded, tested, and deployed properly,
they can be used as any other built-in data flow component.
Lessons Learned
For
Project REAL, to improve the efficiency and performance of the ETL
package, we decided to include the parsing phase of the packed decimal
files in the SSIS pipeline itself. To achieve this, it was necessary to
implement custom components.
For designing the custom
components, the initial challenge was to understand the new SSIS object
model and programmability features. Understanding which interfaces and
methods to override is the first key to developing a custom component.
One striking thing that we learned while implementing these custom
components was how easily you can extend the current model to create
your own custom tasks and components. In addition to understanding and
extending the SSIS object model, for the successful implementation of
these custom components, it was necessary to decompress and parse the
packed decimal TLog files by using complex specifications. We had to
implement custom methods to unpack and parse the data into the required
format.
Following is a summary of some of the lessons we learned during the process.
- To ensure proper execution, the component
should be configured properly. The developer should ensure that the
proper input and output collections are configured. Any additional
properties that are required for proper configuration should be
configured using custom properties. These properties should be defined
clearly in the ProvideComponentProperties() method and validated using the Validate() method.
- The external data source for the source adapter needs to be defined by using either the ConnectionManagers class or a custom property.
- Implementing
a custom component involves designing for both the design-time and the
runtime methods. During design time, the metadata may change
frequently. Hence, flexibility is an important criterion to include in
the design when you implement design-time methods. Adding or removing
input or output columns or changing the metadata should not break the
component. Proper validation methods and error messages must be
incorporated in the design to ensure that the input and external
sources are validated and that users do not remove the required outputs
when they create the package.
- One of the critical functions is the Validate() method, as it ensures that the component is configured correctly for execution. One important thing to remember is that the Validate( )
method should only be used to validate the correct configuration and to
raise error events. The errors should not be fixed by using this
method. One rule to remember is that the state of the component should
not be changed in the Validate( ) method. This rule is
necessary because the designer assumes that this method is used only
for raising events on error and calls it at will. If the component
changes during this process, the behavior of the application can be
unpredictable.
- Pipeline buffers are a very important factor to
consider when designing runtime methods. The first step is to determine
whether your component is synchronous or asynchronous. If the component
is synchronous, then the same input buffer is used as output. If the
component is asynchronous, then the PrimeOutput() method should
specify the pipeline buffers that are associated with the outputs.
Hence, an asynchronous component is the only data flow component to
receive the input buffer from the ProcessInput() method and the output buffers from the PrimeOutput() method.
- The
main challenge that we faced in this project was the fact that the TLog
transformation component has 24 outputs (if the user chooses to use all
24 outputs). To ensure proper data flow, 24 pipeline buffers had
to be initialized and assigned to the appropriate outputs. In addition,
in order to extract the correct data, it is imperative that the columns
of the required outputs are matched correctly with the associated
column index in the buffers.
- SQL Server Integration Services
(formerly DTS Server) caches information about all the tasks and
components in the Toolbox. This prevents costly enumeration. For
deployment, if the SQL Server Integration Service is enabled, then the
service must be restarted to include the newly deployed custom
components.
- For deployment, make sure that the developer of
the custom component has the necessary privileges to copy the assembly
in the global assembly cache and the Pipeline component directory. The
custom component needs to be copied to both the directories.
Conclusion
The
new SSIS object model has been re-engineered with extensibility in
mind. Because the object model can be extended using managed languages,
you can easily create your own custom tasks, components, or customize
any object in the SSIS model.
This paper illustrates one
application of the extensibility mechanisms—that of loading data from
an unusual format directly into a production data warehouse. The new
mechanism eliminates the intermediate storage and processes when
handling the data. This results in higher performance and a more
maintainable system. These mechanisms are easy to learn. You can
optimize data flow performance by optimal usage of the pipeline
buffers. This ensures more manageability, performance, flexibility, and
a robust ETL process for loading data into the data warehouse.
For More Information
To learn more about SQL Server 2005, including Integration Services, visit the
Microsoft SQL Server 2005 Developer Center [ http://www.microsoft.com/sql/ ]