Elizabeth Vitt, Intellimentum
Hitachi Consulting
Technical Reviewers:
Donald Farmer
Microsoft Corporation
Stacia Misner
Hitachi Consulting
March 2006
Applies to:
SQL Server 2005
Master Data Management (MDM)
SQL Server Integration Services (SSIS) 2005
Summary:
This white paper describes how application developers can leverage the
functionality of Microsoft SQL Server 2005 Integration Services to
address the data integration challenges of Master Data Management
applications. (19 printed pages)
Click
here
[
http://download.microsoft.com/download/f/1/c/f1cf7b8d-7fb9-4b71-a658-e748e67f9eba/data_sol_for_mdm.doc
] to download the Word document version of this article.
Contents
Introduction
SSIS Solution Overview
Accessing Diverse Data Sources
Managing Unique Identifiers
Cleansing Attributes
Categorizing Entity Activity
Extracting Attributes
Conclusion
Introduction
Data
management is one of the most complex and expensive challenges facing
the agile enterprise. Frustrated with the difficulty and inaccuracy of
manually assembling inconsistent, redundant, and outdated data, many
organizations are seeking a new generation of data management solutions
to seamlessly convert hundreds of data sources into powerful data
assets that can be shared across the enterprise.
To guide their
data management efforts, many enterprises adopt Master Data Management
(MDM), a lifecycle strategy for creating, organizing, and managing
comprehensive data assets across the enterprise. MDM solutions provide
a master system of record for a particular business entity and then
publish this master view to a variety of consumer applications. MDM
applications can be built around virtually any data entity—customers,
partners, products, vendors, employees, and so on—wherever a master
data view is needed. In fact, many MDM applications have distinct names
to describe their solution area. For example, Product Information
Management (PIM) is used to describe product-focused MDM applications
and Customer Data integration (CDI) is used to describe
customer-focused MDM applications. Since MDM solutions span a variety
of business and technical functions, creating a successful MDM
application requires a multi-disciplinary approach to data management
that includes data integration, data modeling, data publishing, and
data synchronization.
Of all these components, data
integration continues to be the most critical and expensive aspect of
an MDM solution. Master views are created by integrating data from a
variety of internal data sources such as enterprise resource planning
(ERP), customer relationship management (CRM), business intelligence
(BI), and legacy systems, as well as external data from partners,
suppliers, or syndicators. These integration processes typically
present a variety of data integration challenges that can result in
high development costs and extended project timelines. To work through
these challenges in a cost-effective manner, IT requires a flexible
data integration architecture that offers a broad set of technologies
and tools to create customized integration solutions.
Microsoft
SQL Server Integration Services (SSIS) 2005 provides a full-featured
enterprise data integration platform and rich development environment
for creating, organizing, and managing comprehensive data assets across
the enterprise that can be leveraged specifically for MDM applications.
SSIS also provides an expansive ecosystem of partners who have built
specialized MDM application solutions on the core SQL Server platform.
Using SSIS 2005, an enterprise can successfully create a broad range of
data integration solutions that support master business views, quickly
overcome data management challenges, and reduce overall master data
management costs.
SSIS Solution Overview
MDM Data Integration Requirements
While
every MDM solution will likely have distinct business and technical
requirements, as you integrate data for your MDM solutions you will
encounter three general categories of data: unique identifiers,
attributes, and transactions. Each of these data categories has
specific data integration challenges that you will need to overcome as
you build your MDM solution.
- Unique identifiers—Unique identifiers
are the building blocks for defining a business entity's master system
of record. As you bring together data from multiple data sources, you
must have a consistent mechanism to uniquely identify an entity across
different business functions. Unfortunately, there is not always a
common unique identifier that identifies a particular entity in the
same way across all of the source systems. For example, when creating
an employee MDM application, you discover that your organization's
human resources (HR) system, sales system, sales forecasting system,
Microsoft Active Directory directory service, and CRM system each use
different unique identifiers for sales representatives. To build the
MDM solution, you must establish an alternate method for successfully
combining this data together without duplicating data. Duplicating data
may sound like something that is easy to avoid; however, you will find
that some types of duplicate data are more difficult than others to
uncover.
- Attributes—Once you determine the unique
identifier for an entity, you can add attributes that provide
meaningful business context, categorize the business entity into one or
more groups, and describe the entity's relationship to other business
entities. These attributes may be directly obtained from source
systems. For example, you may bring together employee profile
attributes from your HR system with employee e-mail addresses from your
organization's Active Directory. Alternatively, attributes can be
derived once the data is combined. As you combine attributes, you need
to manage conflicts between attributes from various systems. For
example, you may have employee address information that differs between
source systems. To correct the problem, you need to define business
rules that determine which address provides the best system of record
for each employee.
- Transactions—Transactions describe
the activities that a particular business entity engages in. For
example, a sales representative might participate in sales
opportunities, sales transactions, and sales forecasts. Within MDM
applications, transaction data frequently requires some degree of
aggregation to summarize the source detail into a meaningful format.
Once aggregated, transaction data can be used to derive attributes that
characterize an entity based on its activity or lack of activity.
SSIS Data Integration for MDM
SQL
Server 2005 Integration Services (SSIS), the successor to SQL Server
2000 Data Transformation Services, is an enterprise data-integration
platform for a new generation of integration solutions that must
quickly integrate large volumes from diverse data sources with complex
transformations and data cleansing.
Behind the scenes, the high-performance architecture of SSIS includes two powerful engines:
- Parallel runtime engine—The run-time engine coordinates the flow of control between tasks within SSIS.
- Fast, efficient data flow engine—The
data flow engine extracts data from one or more data sources, performs
any necessary transformations on the extracted data, and then delivers
that data to one or more destinations. To maximize efficiency, the data
flow engine takes advantage of in-memory processing to eliminate the
overhead normally added by physically copying and staging data in
tables at different points in the data integration process. By
manipulating the data in memory as it is transferred from source to
destination, the data flow engine reduces the number of manual steps
required to stage data, saves processing time, and helps you resolve
integration challenges within shorter data processing windows.
To support a broad range of data requirements for MDM solutions, SSIS provides the following data integration capabilities:
- Broad data connectivity—To integrate
diverse data, SSIS provides access to a variety of enterprise data
systems, structures, and formats that provide the unique identifier,
attribute, and transaction data for MDM applications.
- Robust data cleansing—To
ensure that data is combined in a meaningful manner, SSIS provides data
cleansing tools that can be leveraged to address common MDM data
integration challenges such as deriving unique identifiers and handling
attribute conflicts.
- Flexible data transformations—To
complement data cleansing, SSIS provides a variety of data
transformations that can be used to derive meaningful attributes from
source data and aggregate transaction data.
The sections
that follow describe how you can use these SSIS data integration
capabilities to solve common issues for MDM applications.
Accessing Diverse Data Sources
For
many enterprises, bringing together master data from CRM, ERP, and BI
systems involves unifying data from dozens, and possibly hundreds, of
diverse data sources. In a CDI scenario, for example, you may need to
bring together CRM data from a relational database management system
(RDBMS), sales data from mainframe file extracts, customer segmentation
data from Microsoft Excel, and third party demographic data via Web
services. Assembling customer data from these diverse data sources
requires an integration solution that can successfully access and
interpret their distinct interfaces, structures, and data types.
Solution
SSIS
provides access to a variety of diverse data sources including
databases, flat files, spreadsheets, packaged applications, and XML
data by including OLE DB, ODBC, and Microsoft .NET data providers
supplied by Microsoft and third-party providers.
SSIS can also
interpret and translate XML data by using an XML Schema Definition
(XSD) file or inline schemas. In addition, SSIS can consume data from
Web services using proxy classes and the Web service support of the
Microsoft .NET Framework. With this built-in support for XML and Web
services, SSIS can effectively participate in an organization's Service
Oriented Architecture (SOA) to create loosely coupled data integration
solutions.
To connect to sources and destinations for which a
provider is unavailable, SSIS connectivity can be easily extended by
creating custom data sources with scripted source components or
reusable adapters. You can also complement the data access
functionality of SSIS by leveraging the connectivity supplied by other
Microsoft data integration technologies such Microsoft Host Integration
Server for mainframe connectivity and Microsoft BizTalk Server for
business-to-business data exchange.
A unique and valuable
connectivity feature of SSIS is its flexible definition of a data
destination. A destination may be a data store such as a database, but
it can also be an ADO.NET DataReader, which can be published on demand to downstream applications without the need to persist data in a data store.
In
the following sections, you will see several examples of how SSIS
connects to a variety of sources to bring together data for MDM
solutions.
Managing Unique Identifiers
While
data connectivity provides the mechanism to access master data from
various source systems, data cleansing ensures that the master data is
successfully integrated with a high degree of data quality and
consistency. To support the data cleansing needs of MDM, SSIS provides
robust functionality to help you manage unique identifiers across data
sources.
When you create an MDM solution that combines data from
multiple data sources, you may have difficulty merging data when there
is no common, unique identifier that is shared across the systems. This
situation usually occurs when each source system has its own distinct
way to represent and identify a business entity.
Consider the
following example. Your organization's CRM, sales, customer-self
service, and call center systems each use different identifiers to
uniquely identify customers. In the absence of a common unique
identifier, you must consider using an alternate method to link records
across systems. In this particular scenario, each system contains an
instance of the customer's first name and last name. At first glance,
you may think that the combination of first name and last name provides
the perfect unique identifier; however, upon further investigation, you
realize that only about 20 percent of the data can be exactly matched
on first name and last name. While only a small percentage of the
records can be identically matched, there still may be duplicate
customers whose names are either misspelled or slightly different
depending on how the name was entered into the source system.
For
example, in the CRM sales system, a customer is identified as Terry
Ramos; however, in the sales system she is identified as Theresa G
Ramos. Even though Terry Ramos does not identically match Theresa G
Ramos, the different representations refer to the same person. While
you may be able to identify some of these imprecise duplicates using
manual methods, you will likely require a more systematic and robust
solution that can be applied to an entire source data set to identify
unique and duplicate records.
Solution
To
create this solution, SSIS provides data-cleansing functions that you
can combine to manage unique identifiers in a variety of scenarios.
In
the ideal scenario, a robust unique identifier is shared among multiple
data sources. In this case, you can create an SSIS data flow to bring
data together using equi-joins or equi-match lookups across the data
sources. For more complex scenarios, when equivalent matching is not
feasible, you can apply SSIS fuzzy logic functionality to identify the
best possible data match.
Fuzzy logic provides a mechanism to
perform imprecise data matches. Within SSIS, there are two
transformations that use fuzzy logic: fuzzy lookups and fuzzy grouping.
- Fuzzy Lookups—Fuzzy lookups are great
for matching dirty source data to a known set of cleansed, standardized
data such as a reference table. As SSIS performs the fuzzy lookup,
similarity and confidence indexes display the quality of the data
match. The similarity thresholds indicate how closely the input data
resembles its proposed match. A similarity value of one indicates an
identical match. The closer the value is to one, the closer the match.
To complement the similarity threshold, confidence describes the level
of certainty that SSIS has about the match. For example, when three
distinct people have the same name, SSIS uses a confidence index to
indicate how certain it is that it has chosen the correct person for
the match.
- Fuzzy Grouping—When standardized reference
data is not available, fuzzy grouping develops a master set of unique
records from a data set containing a combination of unique and
duplicate records. When you start MDM application development, creating
a master data set is often your first data integration challenge.
To create this system of record, fuzzy grouping examines source
data and derives a reference set of unique, canonical records. All
records in the data set are then evaluated against this derived
reference set for potential matches. Fuzzy grouping also creates two
additional fields, key_in and key_out, to help you identify unique and
duplicate records. The key_in column stores an SSIS-generated unique ID
for each record in the input data set. As the fuzzy grouping
transformation looks for matches, the key_out column stores the ID of
the proposed matched set. You can perform comparisons between the
key_in and key_out fields to identify whether a record is a part of the
reference set or a part of the set that is being matched. For example,
the records in the derived reference set have a key_in ID equal to the
key_out ID. Like the fuzzy lookup, similarity thresholds indicate the
quality of the proposed match.
While both fuzzy lookups and fuzzy grouping
transformations provide the core functionality to manage unique
identifiers, they can also be used in conjunction with other SSIS data
transformation functions to develop a comprehensive data integration
solution. The conditional split transformation, for example, allows you
to break up an input data set into multiple outputs based on your
business logic. Using this transformation, you can route records to one
or more destinations depending on the record's data content.
SSIS Example
Figure
1 illustrates an SSIS solution that uses a Fuzzy Grouping
transformation to manage customer unique identifiers across a variety
of data sources.
The SSIS solution uses the following steps to manage unique identifiers:
[ http://msdn2.microsoft.com/en-us/library/aa964123.datasol01(en-us,sql.90).gif ]
Figure 1. Managing unique identifiers (Click on the image for a larger picture)
- Data Sources—To
begin, connectivity is established to all four data sources: a CRM OLE
DB data source, a Sales OLE DB data source, a Call Center flat file,
and a Customer Self Service XML file. For each data source, the first
name, last name, and the source system unique identifier columns are
extracted. The first name and last name columns will be the primary
inputs for the data cleansing logic. Even though they are not shared
across sources, the source system unique identifiers have been
extracted to maintain links back to each record's data source. These
identifiers remain unchanged through all steps of the data flow.
- Union All—The
source data is then combined into a single data set using a Union All
transformation. At this point, the data set contains a mix of duplicate
and unique data.
- Fuzzy Grouping—The Fuzzy Grouping
transformation consumes the mixed source data, defines a reference data
set of first and last names, and then attempts to match records to that
reference data set. The fuzzy grouping itself has been configured to
ignore records that have a minimum similarity threshold of less than
.25. This similarity threshold needs to be adjusted based on the
specific characteristics of your source data.
- Conditional Split—The
output of the fuzzy grouping transformation is sent to a Conditional
Split transformation that conditionally processes the data by
identifying three groups of records:
- Group 1—Reference Records. These
are records in which the key_in field equals the key_out field. All
other source records have been evaluated against this reference data
set.
- Group 2—Duplicate Records. These are
records that have strong matches to the reference dataset. A strong
match is a match having a similarity of > .9. Again, you have the
ability to adjust this threshold based on your specific data set.
- Group 3—Unique Records.
These are records that did not have a strong match to the reference
data set; that is, they have a similarity score of less than .9.
To
provide tuning flexibility, the value of .9 in this example can be
defined by a variable that is populated from an XML configuration file.
By maintaining the similarity threshold value separately from the
solution, you can more easily tune the SSIS solution after reviewing
the results of each execution.
- Data Destinations—For illustrative purposes, each
of the groups created by the Conditional Split transformation has been
loaded to a different destination. In a real-world situation, you would
combine the reference records and unique records together to create a
master list of unique customer records.
Cleansing Attributes
While
managing unique identifiers can help you cleanse duplicate records, you
will likely also require functionality to cleanse your data attributes.
In many situations, you need to perform cleansing to manage conflicting
attributes across different data sources.
Consider the
following example. You are creating a partner MDM solution that
integrates partner data from your CRM system, sales system, and partner
profile system. Each system contains distinct attributes that describe
the partner as well as overlapping attributes that are in conflict. One
such overlapping attribute is the partner's industry classification.
Each source system contains an industry classification for a partner
that, in many cases, varies across systems. For example, Contoso, a
large hotel chain, has an industry classification of hospitality in the
sales system, a classification of retail in the partner profile system,
and no industry classification in the partner CRM system. To handle
these overlapping attributes, business rules are required to specify
how the attribute conflicts are to be resolved.
For example, business rules could define the order of precedence of source system as follows:
- Sales—If a sales industry
classification exists for the partner, then it should take precedence
as the master industry classification.
- Partner profile—If the sales industry classification is empty, then the partner profile industry should be used.
- CRM—When both the sales industry and partner profile industry classification are empty, the CRM industry should be used.
To
resolve conflicting attributes, you must create an integration solution
that merges data from the CRM, sales, and partner systems, retrieves
any distinct attributes from each system, and then applies the business
rules to handle the conflicting industry classification.
Solution
SSIS
provides a wide range of opportunities to apply custom business rules
for cleansing attribute data. One transformation in particular, the
Derived Column transformation, proves a great deal of flexibility in
applying custom logic.
The Derived Column transformation is
defined using an expression. This customizable expression can contain a
combination of operators, functions, and variables to manipulate the
source data. In a manner of speaking, the Derived Column transformation
can be seen as a consolidation of many different types of
transformations in one easy-to-customize unit.
When you are
cleansing attribute data, you will likely be implementing a set of
conditional business rules to establish the best possible system of
record. In the partner MDM scenario, the derived column expression will
contain the conditional business rules that determine which industry
classification should be applied.
Once the business rule is
defined by an expression, you can either use the derived column to
replace a column from the source data or to add a new column into the
data flow. In the partner MDM scenario, a new column, called master
industry classification, will be added into the data flow. Even though
a new master column is added to resolve the conflicts, the industry
classifications from the source systems are still maintained in
separate columns in order to provide a complete system of record.
SSIS Example
Figure 2 illustrates an SSIS solution example that uses a Derived Column transformation to cleanse attribute data.
In
this Partner MDM application example, you need to apply business rules
that assign each partner a master industry classification. Unlike the
previous example, a universal partner ID already exists to allow you to
easily bring together data from the various source systems.
[ http://msdn2.microsoft.com/en-us/library/aa964123.datasol02(en-us,sql.90).gif ]
Figure 2. Cleansing attributes (Click on the image for a larger picture)
The SSIS solution uses the following steps to cleanse attributes:
- Source Data Connectivity—To begin, partner
data is extracted from a CRM OLE DB data source, a Sales OLE DB data
source, and a Partner Profile flat file. In this example, only the
partner ID and industry classification field will be extracted into the
data flow. In a real-world scenario, you will likely use this
opportunity to extract additional required attributes from each of the
source systems.
- Merge Join—Data from each data source
is then joined together by using a Merge Join transformation. In order
to provide a complete set of partners, a full outer join is used across
the systems. Merge joins are discussed in more detail in the Categorizing Entity Activity example later in this paper.
- Derived Column—Applied
to a complete data set, the Derived Column transformation uses
conditional logic to create a new master industry classification. The
expression for this example might look like the following:
ISNULL ([SalesIndustry])? ISNULL ([CRMIndustry])? [PartnerProfileIndustry]: [CRMIndustry]: [SalesIndustry]
While
the business rules specify how to create the master industry
classification, a new issue has arisen. The industry classifications
among the systems use slightly different naming standards, producing a
highly disparate list of industries. For example, the industry list has
the following example values: Banking, Financial Services-Banking, and FinServ_Banking. To produce a uniform list of industries, you can use a fuzzy lookup transformation to cleanse the industry list.
- Fuzzy Lookup—To resolve the naming issue, the
Fuzzy Lookup transformation matches the master industry classification
to a reference list of industries.
- Conditional Split—The
output of the Fuzzy Lookup transformation is sent to a Conditional
Split transformation that produces two groups of data based on the
value of the similarity threshold:
- Group 1—Strong Match. These records
have a similarity threshold of > .9, indicating that a strong match
was made between the source data and the industry reference table.
- Group 2—No Strong Match.
These records have a similarity threshold of < .9, indicating a
strong match was not found. In this scenario, the master industry
classification values is left as is.
- Data Destinations—In
this example, each of the groups has been loaded to a different
destination. In a real-world situation, you could replace the
Conditional Split with another Derived Column transformation that
updates the master industry classification with the reference industry
wherever a strong match exists.
Categorizing Entity Activity
As
you have already seen, creating a master business entity typically
involves consolidating data from multiple source systems. Once you have
identified a mechanism to bridge and cleanse the data, you can begin to
categorize the entity based on the types of transactions or activities
that the entity is involved in. For example, typical activities for
resellers may include selling products and participating in a reseller
partner program. Categorizing resellers by their involvement in one or
both of these activities can help you better optimize marketing
investments, identify growth opportunities, and establish better
relationships with your resellers and customers.
In many
organizations, this type of categorization occurs in a data warehouse.
If you have this information already stored in a data warehouse, you
can simply include the data warehouse as another data source for your
MDM application. If you do not have the information already available,
you can include logic in your SSIS data flows to aggregate and
categorize data based on your specific business requirements.
In this particular scenario, the requirements are to categorize resellers using the following logic:
- Participating and Selling—Resellers who participate in the program who also have sales in the past 12 months.
- Participating and Not Selling—Resellers who participate in the program who do not have sales in the past 12 months.
- Not Participating and Selling—Resellers who do not participate in the program but have sales in the past 12 months.
Solution
When
you work with transaction data, you will often need to aggregate data
before including it in your MDM application. Within SSIS, there is an
aggregate transformation that provides the functionality to summarize
records using functions such as Sum, Min, Max, Average, and Distinct Count. You can also configure the level of detail at which SSIS should perform the aggregation by specifying GROUP BY fields.
You
also need to pay close attention to how you join data across your
source systems to ensure that you retrieve the correct data set. When
you join data in SSIS, you will frequently use the Merge Join
transformation. Merge Joins can perform a variety of join types such as
inner joins, right outer joins, left outer joins, and full outer joins.
In this particular scenario, you use a full outer join to build a
complete list of resellers from both source systems. If you were to
perform an equi-join or inner join between the two sources, you would
only bring back those resellers who are in both systems. As specified
in the business rules, not all resellers in the reseller program also
have sales transactions and vice versa. A full outer join allows you to
bring back the full data set for further processing.
SSIS Example
Figure
3 presents an SSIS solution example that demonstrates how you can use
the Aggregate and Merge Join transformations to categorize resellers
based on transaction data.
[ http://msdn2.microsoft.com/en-us/library/aa964123.datasol03(en-us,sql.90).gif ]
Figure 3. Categorizing Entity Activity (Click on the image for a larger picture)
The SSIS solution uses the following steps to categorize entity activity:
- Data Sources—To begin, reseller data is
extracted from the Sales OLE DB data source and from the list of
Participating Resellers contained in a flat file.
- Aggregate—Because
the sales data in the source system is at the transaction level, it is
more granular than is required for the solution. To adjust the level of
detail, the Aggregate transformation takes the sales data for the past
12 months and sums it by reseller.
- Merge Join—A Full
Merge Join brings together the participating resellers with the
resellers from the sales system. The data is joined using a reseller ID
that is shared between the two systems. As an output from the full
outer join, two fields play a significant role in categorizing the
resellers: the Reseller ID field from the sales system and the Reseller
ID field from the reseller program file. These two fields will have the
same values when resellers are found in both the sales data and in the
reseller program data. The Sales Reseller ID will be NULL when a
reseller participates in the program but does not have sales. The
Program Reseller ID will be NULL when a reseller has sales but does
participate in the program.
- Conditional Split—A Conditional Split directs the resellers to one of three groups based on the results of the merge join:
- Group 1: Participating and Selling—These resellers have NOT NULL values for both the Sales Reseller ID and the Program Reseller ID.
- Group 2: Participating and Not Selling—These resellers have a NULL value for the Sales Reseller ID and a NOT NULL value for the Program Reseller ID.
- Group 3: Not Participating and Selling—These resellers have a NOT NULL value for the Sales Reseller ID and a NULL value for the Program Reseller ID.
- Data Destinations—As
with previous examples, each group is loaded into a different
destination for illustrative purposes only. In a real-world scenario,
you can replace the Conditional Split with a simple Derived Column
transformation that contains the same conditional logic specified in
step 4 to include a new categorization column in the data output.
Extracting Attributes
In
addition to deriving interesting attributes from transaction data, to
satisfy requirements for some MDM applications you may need to extract
attributes that are embedded in a text data source. For example, you
could extract product feedback contained in the comments column of a
bug tracking system or skills information stored in employee resumes.
Extracting this type of information requires a robust mechanism to
efficiently identify attributes and to evaluate whether the attributes
are meaningful.
Consider the following example. A customer
call center system contains customer issues, complaints, and general
inquiries concerning products and services. As part of the standard
customer support processes, each customer service representative
profiles and tracks all activities associated with a customer support
call. For example, support representatives profile a customer support
event using structured attributes such as the product and/or service in
question, the type of issue, and the issue's priority. To complement
these structured attributes, they also use a free-form text field to
store other relevant information as a note, such as specific customer
feedback, special circumstances of the support event, or general
observations.
Manually deriving meaningful information from
the note field can be difficult due to the potentially large volume of
text in the field, the heterogeneous nature of the content, and the
inability to associate this information with a successful or
unsuccessful support activity. Accordingly, you require a systematic
solution that can discover potentially meaningful attributes.
Solution
To
help overcome the challenges associated with extracting attributes from
text data, SSIS provides text mining functionality to help you extract
business terms from text fields. More specifically, SSIS provides two
transformations that use text mining functionality: Term Lookup and
Term Extraction.
- Term Lookup is useful when you have an
established list of reference terms for which you want to search in
your source data set. To assess significance, the Term Lookup
transformation provides a frequency value that represents the number of
times a reference term appears in the source data. Depending on your
scenario, you may want to conditionally process records based on their
frequency value. For example, you may only want to accept those terms
that appear two or more times.
- Term Extraction is
useful when you do not already have a list of terms but want to create
a dictionary of candidate terms that you will investigate further. To
uncover candidates, the Term Extract transformation first breaks down
the source text data into terms and then counts the number of
occurrences for each term. To customize the term extraction, you can
configure how SSIS defines a term, such as whether it should consider
nouns, noun phrases, or both, as well as which key terms to exclude.
You can also adjust other parameters that SSIS uses to discover
patterns, such as the number of times a term needs to appear before
extracting it. To indicate the quality of the discovered pattern, the
Term Extraction transformation uses a score to describe the number of
times a term appears in the source data. This score can be expressed as
a raw frequency or as a normalized value. Depending on the
characteristics of your source data, you will likely perform several
iterations of the Term Extraction transformation to determine whether
an attribute is truly meaningful.
To extract attributes
from a notes field as described in the customer call center scenario,
you use a Term Extraction transformation to discover additional profile
characteristics.
SSIS Example
Figure 4
illustrates an SSIS solution that uses the Term Extraction
transformation to find potentially interesting patterns in a notes
field.

Figure 4. Extracting Attributes
The SSIS solution uses the following steps to discover new attributes:
- Source Data—To begin, the call center text data is loaded from an OLE DB data source.
- Term Extraction—The
Term Extraction transformation is applied to the notes column and
produces two outputs: term and score. The transformation has been
configured to:
- Extract both noun and noun phrases.
- Score on term frequency.
- Extract terms that meet or exceed a frequency threshold of two appearances.
- Exclude
known terms or "noise" words. For this particular scenario, the
following words could be identified as noise words: customer, project,
solution, and opportunity.
- Conditional Split—Based
on the score derived by the Term Extraction transformation, two groups
of data output are created: Group 1 contains high-frequency terms and
Group 2 contains low-frequency terms. In this scenario, a
high-frequency term has been identified with a score of 5 or more.
Group 2 would contain terms that have a score between 2 and 4.
- Data Destination—For
illustrative purposes, each of the term groups is loaded into a
different destination. With the high-frequency terms identified, you
can investigate these terms further to determine whether they suggest
meaningful attributes. For example, in this scenario, you might
discover that certain supplier names have a high frequency within the
notes field. Because supplier information is currently not profiled
explicitly in your call center system, you might create a new profiling
attribute called partner involvement to track the third party partners
who participate in a support event.
The value of using
SSIS text mining transformations is the ease with which they can be
incorporated into the regular data integration activities of any MDM
solution. To expand on the call center scenario, you can use the newly
derived partner involvement attribute in the following MDM solutions:
- As part of a customer MDM solution, you can profile the relationships between customers and partners.
- As part of a product MDM solution, you can identify the relationships between products and supporting partners.
- As
part of a partner MDM solution, you can identify those partners who are
involved in successful and unsuccessful customer support incidents.
Conclusion
SSIS
provides a compelling data integration platform for effectively and
efficiently solving the integration challenges of MDM applications by:
- Providing a wide range of data transformation
and cleansing functionality to manage unique identifiers, attributes,
and transaction data.
- Providing a comprehensive, robust, and
scalable platform and a set of development tools to create and manage
large-scale data integration solutions.
- Maximizing return on investment (ROI) through the lowest total cost of ownership when compared with competitive platforms.
About the authors Elizabeth Vitt, Intellimentum
Elizabeth
Vitt has ten years of business development, project management,
consulting, and training experience in business intelligence. Her
industry experience includes BI implementations in retail,
manufacturing, and financial services. She has specialized experience
as an educator in data warehousing, ETL, and OLAP design and
implementation. Elizabeth is an author of Microsoft Official Curricula
courses for Microsoft Business Intelligence product offerings as well
as the MSPress Business Intelligence: Making Better Decisions Faster.
In anticipation of the launch of SQL Server 2005, Elizabeth has
successfully implemented Microsoft SQL Server 2005 for early adopter
customers.
Hitachi Consulting
As
Hitachi, Ltd.'s (NYSE: HIT) global consulting company, Hitachi
Consulting is a recognized leader in delivering proven business and IT
solutions to Global 2000 companies across many industries. We leverage
decades of business process, vertical industry, and leading-edge
technology experience to understand each company's unique business
needs. From business strategy development through application
deployment, our consultants are committed to helping clients quickly
realize measurable business value and achieve sustainable ROI.
Hitachi
Consulting is also a Microsoft Certified Gold Partner for Business
Intelligence, exclusive provider of curriculum and instructors for the
Microsoft SQL Server 2005 Business Intelligence Ascend training
program, and an experienced systems integrator with successful SQL
Server 2005 BI implementations at companies participating in the
Microsoft Technology Adoption Program (TAP).
We offer a
client-focused, collaborative approach and transfer knowledge
throughout each engagement. For more information, visit
www.hitachiconsulting.com [ http://www.hitachiconsulting.com/ ] .
Hitachi Consulting—Inspiring your next success