Data Integration Solutions for Master Data Management

Microsoft SQL Server 9.0 Technical Articles

Elizabeth Vitt, Intellimentum
Hitachi Consulting

Technical Reviewers:
Donald Farmer
Microsoft Corporation

Stacia Misner
Hitachi Consulting

March 2006

Applies to:
   SQL Server 2005
   Master Data Management (MDM)
   SQL Server Integration Services (SSIS) 2005

Summary: This white paper describes how application developers can leverage the functionality of Microsoft SQL Server 2005 Integration Services to address the data integration challenges of Master Data Management applications. (19 printed pages)

Click here [ http://download.microsoft.com/download/f/1/c/f1cf7b8d-7fb9-4b71-a658-e748e67f9eba/data_sol_for_mdm.doc ] to download the Word document version of this article.

Introduction
SSIS Solution Overview
Accessing Diverse Data Sources
Managing Unique Identifiers
Cleansing Attributes
Categorizing Entity Activity
Extracting Attributes
Conclusion

Introduction

Data management is one of the most complex and expensive challenges facing the agile enterprise. Frustrated with the difficulty and inaccuracy of manually assembling inconsistent, redundant, and outdated data, many organizations are seeking a new generation of data management solutions to seamlessly convert hundreds of data sources into powerful data assets that can be shared across the enterprise.

To guide their data management efforts, many enterprises adopt Master Data Management (MDM), a lifecycle strategy for creating, organizing, and managing comprehensive data assets across the enterprise. MDM solutions provide a master system of record for a particular business entity and then publish this master view to a variety of consumer applications. MDM applications can be built around virtually any data entity—customers, partners, products, vendors, employees, and so on—wherever a master data view is needed. In fact, many MDM applications have distinct names to describe their solution area. For example, Product Information Management (PIM) is used to describe product-focused MDM applications and Customer Data integration (CDI) is used to describe customer-focused MDM applications. Since MDM solutions span a variety of business and technical functions, creating a successful MDM application requires a multi-disciplinary approach to data management that includes data integration, data modeling, data publishing, and data synchronization.

Of all these components, data integration continues to be the most critical and expensive aspect of an MDM solution. Master views are created by integrating data from a variety of internal data sources such as enterprise resource planning (ERP), customer relationship management (CRM), business intelligence (BI), and legacy systems, as well as external data from partners, suppliers, or syndicators. These integration processes typically present a variety of data integration challenges that can result in high development costs and extended project timelines. To work through these challenges in a cost-effective manner, IT requires a flexible data integration architecture that offers a broad set of technologies and tools to create customized integration solutions.

Microsoft SQL Server Integration Services (SSIS) 2005 provides a full-featured enterprise data integration platform and rich development environment for creating, organizing, and managing comprehensive data assets across the enterprise that can be leveraged specifically for MDM applications. SSIS also provides an expansive ecosystem of partners who have built specialized MDM application solutions on the core SQL Server platform. Using SSIS 2005, an enterprise can successfully create a broad range of data integration solutions that support master business views, quickly overcome data management challenges, and reduce overall master data management costs.

SSIS Solution Overview

MDM Data Integration Requirements

While every MDM solution will likely have distinct business and technical requirements, as you integrate data for your MDM solutions you will encounter three general categories of data: unique identifiers, attributes, and transactions. Each of these data categories has specific data integration challenges that you will need to overcome as you build your MDM solution.

Unique identifiers—Unique identifiers are the building blocks for defining a business entity's master system of record. As you bring together data from multiple data sources, you must have a consistent mechanism to uniquely identify an entity across different business functions. Unfortunately, there is not always a common unique identifier that identifies a particular entity in the same way across all of the source systems. For example, when creating an employee MDM application, you discover that your organization's human resources (HR) system, sales system, sales forecasting system, Microsoft Active Directory directory service, and CRM system each use different unique identifiers for sales representatives. To build the MDM solution, you must establish an alternate method for successfully combining this data together without duplicating data. Duplicating data may sound like something that is easy to avoid; however, you will find that some types of duplicate data are more difficult than others to uncover.
Attributes—Once you determine the unique identifier for an entity, you can add attributes that provide meaningful business context, categorize the business entity into one or more groups, and describe the entity's relationship to other business entities. These attributes may be directly obtained from source systems. For example, you may bring together employee profile attributes from your HR system with employee e-mail addresses from your organization's Active Directory. Alternatively, attributes can be derived once the data is combined. As you combine attributes, you need to manage conflicts between attributes from various systems. For example, you may have employee address information that differs between source systems. To correct the problem, you need to define business rules that determine which address provides the best system of record for each employee.
Transactions—Transactions describe the activities that a particular business entity engages in. For example, a sales representative might participate in sales opportunities, sales transactions, and sales forecasts. Within MDM applications, transaction data frequently requires some degree of aggregation to summarize the source detail into a meaningful format. Once aggregated, transaction data can be used to derive attributes that characterize an entity based on its activity or lack of activity.

SSIS Data Integration for MDM

SQL Server 2005 Integration Services (SSIS), the successor to SQL Server 2000 Data Transformation Services, is an enterprise data-integration platform for a new generation of integration solutions that must quickly integrate large volumes from diverse data sources with complex transformations and data cleansing.

Behind the scenes, the high-performance architecture of SSIS includes two powerful engines:

Parallel runtime engine—The run-time engine coordinates the flow of control between tasks within SSIS.
Fast, efficient data flow engine—The data flow engine extracts data from one or more data sources, performs any necessary transformations on the extracted data, and then delivers that data to one or more destinations. To maximize efficiency, the data flow engine takes advantage of in-memory processing to eliminate the overhead normally added by physically copying and staging data in tables at different points in the data integration process. By manipulating the data in memory as it is transferred from source to destination, the data flow engine reduces the number of manual steps required to stage data, saves processing time, and helps you resolve integration challenges within shorter data processing windows.

To support a broad range of data requirements for MDM solutions, SSIS provides the following data integration capabilities:

Broad data connectivity—To integrate diverse data, SSIS provides access to a variety of enterprise data systems, structures, and formats that provide the unique identifier, attribute, and transaction data for MDM applications.
Robust data cleansing—To ensure that data is combined in a meaningful manner, SSIS provides data cleansing tools that can be leveraged to address common MDM data integration challenges such as deriving unique identifiers and handling attribute conflicts.
Flexible data transformations—To complement data cleansing, SSIS provides a variety of data transformations that can be used to derive meaningful attributes from source data and aggregate transaction data.

The sections that follow describe how you can use these SSIS data integration capabilities to solve common issues for MDM applications.

Accessing Diverse Data Sources

For many enterprises, bringing together master data from CRM, ERP, and BI systems involves unifying data from dozens, and possibly hundreds, of diverse data sources. In a CDI scenario, for example, you may need to bring together CRM data from a relational database management system (RDBMS), sales data from mainframe file extracts, customer segmentation data from Microsoft Excel, and third party demographic data via Web services. Assembling customer data from these diverse data sources requires an integration solution that can successfully access and interpret their distinct interfaces, structures, and data types.

Solution

SSIS provides access to a variety of diverse data sources including databases, flat files, spreadsheets, packaged applications, and XML data by including OLE DB, ODBC, and Microsoft .NET data providers supplied by Microsoft and third-party providers.

SSIS can also interpret and translate XML data by using an XML Schema Definition (XSD) file or inline schemas. In addition, SSIS can consume data from Web services using proxy classes and the Web service support of the Microsoft .NET Framework. With this built-in support for XML and Web services, SSIS can effectively participate in an organization's Service Oriented Architecture (SOA) to create loosely coupled data integration solutions.

To connect to sources and destinations for which a provider is unavailable, SSIS connectivity can be easily extended by creating custom data sources with scripted source components or reusable adapters. You can also complement the data access functionality of SSIS by leveraging the connectivity supplied by other Microsoft data integration technologies such Microsoft Host Integration Server for mainframe connectivity and Microsoft BizTalk Server for business-to-business data exchange.

A unique and valuable connectivity feature of SSIS is its flexible definition of a data destination. A destination may be a data store such as a database, but it can also be an ADO.NET DataReader, which can be published on demand to downstream applications without the need to persist data in a data store.

In the following sections, you will see several examples of how SSIS connects to a variety of sources to bring together data for MDM solutions.

Managing Unique Identifiers

While data connectivity provides the mechanism to access master data from various source systems, data cleansing ensures that the master data is successfully integrated with a high degree of data quality and consistency. To support the data cleansing needs of MDM, SSIS provides robust functionality to help you manage unique identifiers across data sources.

When you create an MDM solution that combines data from multiple data sources, you may have difficulty merging data when there is no common, unique identifier that is shared across the systems. This situation usually occurs when each source system has its own distinct way to represent and identify a business entity.

Consider the following example. Your organization's CRM, sales, customer-self service, and call center systems each use different identifiers to uniquely identify customers. In the absence of a common unique identifier, you must consider using an alternate method to link records across systems. In this particular scenario, each system contains an instance of the customer's first name and last name. At first glance, you may think that the combination of first name and last name provides the perfect unique identifier; however, upon further investigation, you realize that only about 20 percent of the data can be exactly matched on first name and last name. While only a small percentage of the records can be identically matched, there still may be duplicate customers whose names are either misspelled or slightly different depending on how the name was entered into the source system.

For example, in the CRM sales system, a customer is identified as Terry Ramos; however, in the sales system she is identified as Theresa G Ramos. Even though Terry Ramos does not identically match Theresa G Ramos, the different representations refer to the same person. While you may be able to identify some of these imprecise duplicates using manual methods, you will likely require a more systematic and robust solution that can be applied to an entire source data set to identify unique and duplicate records.

Solution

To create this solution, SSIS provides data-cleansing functions that you can combine to manage unique identifiers in a variety of scenarios.

In the ideal scenario, a robust unique identifier is shared among multiple data sources. In this case, you can create an SSIS data flow to bring data together using equi-joins or equi-match lookups across the data sources. For more complex scenarios, when equivalent matching is not feasible, you can apply SSIS fuzzy logic functionality to identify the best possible data match.

Fuzzy logic provides a mechanism to perform imprecise data matches. Within SSIS, there are two transformations that use fuzzy logic: fuzzy lookups and fuzzy grouping.

Fuzzy Lookups—Fuzzy lookups are great for matching dirty source data to a known set of cleansed, standardized data such as a reference table. As SSIS performs the fuzzy lookup, similarity and confidence indexes display the quality of the data match. The similarity thresholds indicate how closely the input data resembles its proposed match. A similarity value of one indicates an identical match. The closer the value is to one, the closer the match. To complement the similarity threshold, confidence describes the level of certainty that SSIS has about the match. For example, when three distinct people have the same name, SSIS uses a confidence index to indicate how certain it is that it has chosen the correct person for the match.
Fuzzy Grouping—When standardized reference data is not available, fuzzy grouping develops a master set of unique records from a data set containing a combination of unique and duplicate records. When you start MDM application development, creating a master data set is often your first data integration challenge.
To create this system of record, fuzzy grouping examines source data and derives a reference set of unique, canonical records. All records in the data set are then evaluated against this derived reference set for potential matches. Fuzzy grouping also creates two additional fields, key_in and key_out, to help you identify unique and duplicate records. The key_in column stores an SSIS-generated unique ID for each record in the input data set. As the fuzzy grouping transformation looks for matches, the key_out column stores the ID of the proposed matched set. You can perform comparisons between the key_in and key_out fields to identify whether a record is a part of the reference set or a part of the set that is being matched. For example, the records in the derived reference set have a key_in ID equal to the key_out ID. Like the fuzzy lookup, similarity thresholds indicate the quality of the proposed match.

While both fuzzy lookups and fuzzy grouping transformations provide the core functionality to manage unique identifiers, they can also be used in conjunction with other SSIS data transformation functions to develop a comprehensive data integration solution. The conditional split transformation, for example, allows you to break up an input data set into multiple outputs based on your business logic. Using this transformation, you can route records to one or more destinations depending on the record's data content.

SSIS Example

Figure 1 illustrates an SSIS solution that uses a Fuzzy Grouping transformation to manage customer unique identifiers across a variety of data sources.

The SSIS solution uses the following steps to manage unique identifiers:

[ http://msdn2.microsoft.com/en-us/library/aa964123.datasol01(en-us,sql.90).gif ]

Figure 1. Managing unique identifiers (Click on the image for a larger picture)

Data Sources—To begin, connectivity is established to all four data sources: a CRM OLE DB data source, a Sales OLE DB data source, a Call Center flat file, and a Customer Self Service XML file. For each data source, the first name, last name, and the source system unique identifier columns are extracted. The first name and last name columns will be the primary inputs for the data cleansing logic. Even though they are not shared across sources, the source system unique identifiers have been extracted to maintain links back to each record's data source. These identifiers remain unchanged through all steps of the data flow.
Union All—The source data is then combined into a single data set using a Union All transformation. At this point, the data set contains a mix of duplicate and unique data.
Fuzzy Grouping—The Fuzzy Grouping transformation consumes the mixed source data, defines a reference data set of first and last names, and then attempts to match records to that reference data set. The fuzzy grouping itself has been configured to ignore records that have a minimum similarity threshold of less than .25. This similarity threshold needs to be adjusted based on the specific characteristics of your source data.
Conditional Split—The output of the fuzzy grouping transformation is sent to a Conditional Split transformation that conditionally processes the data by identifying three groups of records:
- Group 1—Reference Records. These are records in which the key_in field equals the key_out field. All other source records have been evaluated against this reference data set.
- Group 2—Duplicate Records. These are records that have strong matches to the reference dataset. A strong match is a match having a similarity of > .9. Again, you have the ability to adjust this threshold based on your specific data set.
- Group 3—Unique Records. These are records that did not have a strong match to the reference data set; that is, they have a similarity score of less than .9.
To provide tuning flexibility, the value of .9 in this example can be defined by a variable that is populated from an XML configuration file. By maintaining the similarity threshold value separately from the solution, you can more easily tune the SSIS solution after reviewing the results of each execution.
Data Destinations—For illustrative purposes, each of the groups created by the Conditional Split transformation has been loaded to a different destination. In a real-world situation, you would combine the reference records and unique records together to create a master list of unique customer records.

Cleansing Attributes

While managing unique identifiers can help you cleanse duplicate records, you will likely also require functionality to cleanse your data attributes. In many situations, you need to perform cleansing to manage conflicting attributes across different data sources.

Consider the following example. You are creating a partner MDM solution that integrates partner data from your CRM system, sales system, and partner profile system. Each system contains distinct attributes that describe the partner as well as overlapping attributes that are in conflict. One such overlapping attribute is the partner's industry classification. Each source system contains an industry classification for a partner that, in many cases, varies across systems. For example, Contoso, a large hotel chain, has an industry classification of hospitality in the sales system, a classification of retail in the partner profile system, and no industry classification in the partner CRM system. To handle these overlapping attributes, business rules are required to specify how the attribute conflicts are to be resolved.

For example, business rules could define the order of precedence of source system as follows:

Sales—If a sales industry classification exists for the partner, then it should take precedence as the master industry classification.
Partner profile—If the sales industry classification is empty, then the partner profile industry should be used.
CRM—When both the sales industry and partner profile industry classification are empty, the CRM industry should be used.

To resolve conflicting attributes, you must create an integration solution that merges data from the CRM, sales, and partner systems, retrieves any distinct attributes from each system, and then applies the business rules to handle the conflicting industry classification.

Solution

SSIS provides a wide range of opportunities to apply custom business rules for cleansing attribute data. One transformation in particular, the Derived Column transformation, proves a great deal of flexibility in applying custom logic.

The Derived Column transformation is defined using an expression. This customizable expression can contain a combination of operators, functions, and variables to manipulate the source data. In a manner of speaking, the Derived Column transformation can be seen as a consolidation of many different types of transformations in one easy-to-customize unit.

When you are cleansing attribute data, you will likely be implementing a set of conditional business rules to establish the best possible system of record. In the partner MDM scenario, the derived column expression will contain the conditional business rules that determine which industry classification should be applied.

Once the business rule is defined by an expression, you can either use the derived column to replace a column from the source data or to add a new column into the data flow. In the partner MDM scenario, a new column, called master industry classification, will be added into the data flow. Even though a new master column is added to resolve the conflicts, the industry classifications from the source systems are still maintained in separate columns in order to provide a complete system of record.

SSIS Example

Figure 2 illustrates an SSIS solution example that uses a Derived Column transformation to cleanse attribute data.

In this Partner MDM application example, you need to apply business rules that assign each partner a master industry classification. Unlike the previous example, a universal partner ID already exists to allow you to easily bring together data from the various source systems.

[ http://msdn2.microsoft.com/en-us/library/aa964123.datasol02(en-us,sql.90).gif ]

Figure 2. Cleansing attributes (Click on the image for a larger picture)

The SSIS solution uses the following steps to cleanse attributes:

Source Data Connectivity—To begin, partner data is extracted from a CRM OLE DB data source, a Sales OLE DB data source, and a Partner Profile flat file. In this example, only the partner ID and industry classification field will be extracted into the data flow. In a real-world scenario, you will likely use this opportunity to extract additional required attributes from each of the source systems.
Merge Join—Data from each data source is then joined together by using a Merge Join transformation. In order to provide a complete set of partners, a full outer join is used across the systems. Merge joins are discussed in more detail in the Categorizing Entity Activity example later in this paper.
Derived Column—Applied to a complete data set, the Derived Column transformation uses conditional logic to create a new master industry classification. The expression for this example might look like the following:
ISNULL ([SalesIndustry])? ISNULL ([CRMIndustry])? [PartnerProfileIndustry]: [CRMIndustry]: [SalesIndustry]
While the business rules specify how to create the master industry classification, a new issue has arisen. The industry classifications among the systems use slightly different naming standards, producing a highly disparate list of industries. For example, the industry list has the following example values: Banking, Financial Services-Banking, and FinServ_Banking. To produce a uniform list of industries, you can use a fuzzy lookup transformation to cleanse the industry list.
Fuzzy Lookup—To resolve the naming issue, the Fuzzy Lookup transformation matches the master industry classification to a reference list of industries.
Conditional Split—The output of the Fuzzy Lookup transformation is sent to a Conditional Split transformation that produces two groups of data based on the value of the similarity threshold:
- Group 1—Strong Match. These records have a similarity threshold of > .9, indicating that a strong match was made between the source data and the industry reference table.
- Group 2—No Strong Match. These records have a similarity threshold of < .9, indicating a strong match was not found. In this scenario, the master industry classification values is left as is.
Data Destinations—In this example, each of the groups has been loaded to a different destination. In a real-world situation, you could replace the Conditional Split with another Derived Column transformation that updates the master industry classification with the reference industry wherever a strong match exists.

Categorizing Entity Activity

As you have already seen, creating a master business entity typically involves consolidating data from multiple source systems. Once you have identified a mechanism to bridge and cleanse the data, you can begin to categorize the entity based on the types of transactions or activities that the entity is involved in. For example, typical activities for resellers may include selling products and participating in a reseller partner program. Categorizing resellers by their involvement in one or both of these activities can help you better optimize marketing investments, identify growth opportunities, and establish better relationships with your resellers and customers.

In many organizations, this type of categorization occurs in a data warehouse. If you have this information already stored in a data warehouse, you can simply include the data warehouse as another data source for your MDM application. If you do not have the information already available, you can include logic in your SSIS data flows to aggregate and categorize data based on your specific business requirements.

In this particular scenario, the requirements are to categorize resellers using the following logic:

Participating and Selling—Resellers who participate in the program who also have sales in the past 12 months.
Participating and Not Selling—Resellers who participate in the program who do not have sales in the past 12 months.
Not Participating and Selling—Resellers who do not participate in the program but have sales in the past 12 months.

Solution

When you work with transaction data, you will often need to aggregate data before including it in your MDM application. Within SSIS, there is an aggregate transformation that provides the functionality to summarize records using functions such as Sum, Min, Max, Average, and Distinct Count. You can also configure the level of detail at which SSIS should perform the aggregation by specifying GROUP BY fields.

You also need to pay close attention to how you join data across your source systems to ensure that you retrieve the correct data set. When you join data in SSIS, you will frequently use the Merge Join transformation. Merge Joins can perform a variety of join types such as inner joins, right outer joins, left outer joins, and full outer joins. In this particular scenario, you use a full outer join to build a complete list of resellers from both source systems. If you were to perform an equi-join or inner join between the two sources, you would only bring back those resellers who are in both systems. As specified in the business rules, not all resellers in the reseller program also have sales transactions and vice versa. A full outer join allows you to bring back the full data set for further processing.

SSIS Example

Figure 3 presents an SSIS solution example that demonstrates how you can use the Aggregate and Merge Join transformations to categorize resellers based on transaction data.

[ http://msdn2.microsoft.com/en-us/library/aa964123.datasol03(en-us,sql.90).gif ]

Figure 3. Categorizing Entity Activity (Click on the image for a larger picture)

The SSIS solution uses the following steps to categorize entity activity:

Data Sources—To begin, reseller data is extracted from the Sales OLE DB data source and from the list of Participating Resellers contained in a flat file.
Aggregate—Because the sales data in the source system is at the transaction level, it is more granular than is required for the solution. To adjust the level of detail, the Aggregate transformation takes the sales data for the past 12 months and sums it by reseller.
Merge Join—A Full Merge Join brings together the participating resellers with the resellers from the sales system. The data is joined using a reseller ID that is shared between the two systems. As an output from the full outer join, two fields play a significant role in categorizing the resellers: the Reseller ID field from the sales system and the Reseller ID field from the reseller program file. These two fields will have the same values when resellers are found in both the sales data and in the reseller program data. The Sales Reseller ID will be NULL when a reseller participates in the program but does not have sales. The Program Reseller ID will be NULL when a reseller has sales but does participate in the program.
Conditional Split—A Conditional Split directs the resellers to one of three groups based on the results of the merge join:
- Group 1: Participating and Selling—These resellers have NOT NULL values for both the Sales Reseller ID and the Program Reseller ID.
- Group 2: Participating and Not Selling—These resellers have a NULL value for the Sales Reseller ID and a NOT NULL value for the Program Reseller ID.
- Group 3: Not Participating and Selling—These resellers have a NOT NULL value for the Sales Reseller ID and a NULL value for the Program Reseller ID.
Data Destinations—As with previous examples, each group is loaded into a different destination for illustrative purposes only. In a real-world scenario, you can replace the Conditional Split with a simple Derived Column transformation that contains the same conditional logic specified in step 4 to include a new categorization column in the data output.

Extracting Attributes

In addition to deriving interesting attributes from transaction data, to satisfy requirements for some MDM applications you may need to extract attributes that are embedded in a text data source. For example, you could extract product feedback contained in the comments column of a bug tracking system or skills information stored in employee resumes. Extracting this type of information requires a robust mechanism to efficiently identify attributes and to evaluate whether the attributes are meaningful.

Consider the following example. A customer call center system contains customer issues, complaints, and general inquiries concerning products and services. As part of the standard customer support processes, each customer service representative profiles and tracks all activities associated with a customer support call. For example, support representatives profile a customer support event using structured attributes such as the product and/or service in question, the type of issue, and the issue's priority. To complement these structured attributes, they also use a free-form text field to store other relevant information as a note, such as specific customer feedback, special circumstances of the support event, or general observations.

Manually deriving meaningful information from the note field can be difficult due to the potentially large volume of text in the field, the heterogeneous nature of the content, and the inability to associate this information with a successful or unsuccessful support activity. Accordingly, you require a systematic solution that can discover potentially meaningful attributes.

Solution

To help overcome the challenges associated with extracting attributes from text data, SSIS provides text mining functionality to help you extract business terms from text fields. More specifically, SSIS provides two transformations that use text mining functionality: Term Lookup and Term Extraction.

Term Lookup is useful when you have an established list of reference terms for which you want to search in your source data set. To assess significance, the Term Lookup transformation provides a frequency value that represents the number of times a reference term appears in the source data. Depending on your scenario, you may want to conditionally process records based on their frequency value. For example, you may only want to accept those terms that appear two or more times.
Term Extraction is useful when you do not already have a list of terms but want to create a dictionary of candidate terms that you will investigate further. To uncover candidates, the Term Extract transformation first breaks down the source text data into terms and then counts the number of occurrences for each term. To customize the term extraction, you can configure how SSIS defines a term, such as whether it should consider nouns, noun phrases, or both, as well as which key terms to exclude. You can also adjust other parameters that SSIS uses to discover patterns, such as the number of times a term needs to appear before extracting it. To indicate the quality of the discovered pattern, the Term Extraction transformation uses a score to describe the number of times a term appears in the source data. This score can be expressed as a raw frequency or as a normalized value. Depending on the characteristics of your source data, you will likely perform several iterations of the Term Extraction transformation to determine whether an attribute is truly meaningful.

To extract attributes from a notes field as described in the customer call center scenario, you use a Term Extraction transformation to discover additional profile characteristics.

SSIS Example

Figure 4 illustrates an SSIS solution that uses the Term Extraction transformation to find potentially interesting patterns in a notes field.

Figure 4. Extracting Attributes

The SSIS solution uses the following steps to discover new attributes:

Source Data—To begin, the call center text data is loaded from an OLE DB data source.
Term Extraction—The Term Extraction transformation is applied to the notes column and produces two outputs: term and score. The transformation has been configured to:
- Extract both noun and noun phrases.
- Score on term frequency.
- Extract terms that meet or exceed a frequency threshold of two appearances.
- Exclude known terms or "noise" words. For this particular scenario, the following words could be identified as noise words: customer, project, solution, and opportunity.
Conditional Split—Based on the score derived by the Term Extraction transformation, two groups of data output are created: Group 1 contains high-frequency terms and Group 2 contains low-frequency terms. In this scenario, a high-frequency term has been identified with a score of 5 or more. Group 2 would contain terms that have a score between 2 and 4.
Data Destination—For illustrative purposes, each of the term groups is loaded into a different destination. With the high-frequency terms identified, you can investigate these terms further to determine whether they suggest meaningful attributes. For example, in this scenario, you might discover that certain supplier names have a high frequency within the notes field. Because supplier information is currently not profiled explicitly in your call center system, you might create a new profiling attribute called partner involvement to track the third party partners who participate in a support event.

The value of using SSIS text mining transformations is the ease with which they can be incorporated into the regular data integration activities of any MDM solution. To expand on the call center scenario, you can use the newly derived partner involvement attribute in the following MDM solutions:

As part of a customer MDM solution, you can profile the relationships between customers and partners.
As part of a product MDM solution, you can identify the relationships between products and supporting partners.
As part of a partner MDM solution, you can identify those partners who are involved in successful and unsuccessful customer support incidents.

Conclusion

SSIS provides a compelling data integration platform for effectively and efficiently solving the integration challenges of MDM applications by:

Providing a wide range of data transformation and cleansing functionality to manage unique identifiers, attributes, and transaction data.
Providing a comprehensive, robust, and scalable platform and a set of development tools to create and manage large-scale data integration solutions.
Maximizing return on investment (ROI) through the lowest total cost of ownership when compared with competitive platforms.

About the authors

Elizabeth Vitt, Intellimentum

Elizabeth Vitt has ten years of business development, project management, consulting, and training experience in business intelligence. Her industry experience includes BI implementations in retail, manufacturing, and financial services. She has specialized experience as an educator in data warehousing, ETL, and OLAP design and implementation. Elizabeth is an author of Microsoft Official Curricula courses for Microsoft Business Intelligence product offerings as well as the MSPress Business Intelligence: Making Better Decisions Faster. In anticipation of the launch of SQL Server 2005, Elizabeth has successfully implemented Microsoft SQL Server 2005 for early adopter customers.

Hitachi Consulting

As Hitachi, Ltd.'s (NYSE: HIT) global consulting company, Hitachi Consulting is a recognized leader in delivering proven business and IT solutions to Global 2000 companies across many industries. We leverage decades of business process, vertical industry, and leading-edge technology experience to understand each company's unique business needs. From business strategy development through application deployment, our consultants are committed to helping clients quickly realize measurable business value and achieve sustainable ROI.

Hitachi Consulting is also a Microsoft Certified Gold Partner for Business Intelligence, exclusive provider of curriculum and instructors for the Microsoft SQL Server 2005 Business Intelligence Ascend training program, and an experienced systems integrator with successful SQL Server 2005 BI implementations at companies participating in the Microsoft Technology Adoption Program (TAP).

We offer a client-focused, collaborative approach and transfer knowledge throughout each engagement. For more information, visit www.hitachiconsulting.com [ http://www.hitachiconsulting.com/ ] .

Hitachi Consulting—Inspiring your next success

Contents

Introduction

SSIS Solution Overview

MDM Data Integration Requirements

SSIS Data Integration for MDM

Accessing Diverse Data Sources

Solution

Managing Unique Identifiers

Solution

SSIS Example

Cleansing Attributes

Solution

SSIS Example

Categorizing Entity Activity

Solution

SSIS Example

Extracting Attributes

Solution

SSIS Example

Conclusion