Data cleaning (also called data cleansing or scrubbing) is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. In data warehouses, data cleansing is a major part of the so-called ETL process. Data cleansing deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data. When multiple data sources need to be integrated, e.g., in data warehouses, federated database systems or global web-based information systems, the need for data cleansing increases significantly. Most data cleansing is typically performed in a separate data staging area before loading the transformed data into the warehouse. A data cleansing approach should satisfy several requirements. First of all, it should detect and remove all major errors and inconsistencies both in individual data sources and when integrating multiple sources. The approach should be supported by tools to limit manual inspection and programming effort and be extensible to easily cover additional sources. Furthermore, data cleansing should not be performed in isolation but together with schema-related data transformations based on comprehensive metadata. Mapping functions for data cleansing and other data transformations should be specified in a declarative way and be reusable for other data sources as well as for query processing. Especially for data warehouses, a workflow infrastructure should be supported to execute all data transformation steps for multiple sources and large data sets in a reliable and efficient way.
Rahm and Do (University of Leipzig, Germany) in their 2000 IEEE paper on the subject classify major data quality problems to be solved by data cleaning and data transformation into the nature of the integration required (single source vs. multi source). They also describe the different approaches to be taken in phases to address the data cleansing process. The approaches, described in detail in the authors’ work, are summarized as follows:
Data analysis: To detect and create an inventory of errors and inconsistencies to be removed. May involve manual inspection of the data or data samples as well as profiling programs.
Definition of transformation workflow and mapping rules: Depending on the number of data sources, their degree of heterogeneity and the ”dirtyness” of the data, a large number of data transformation and cleaning steps may have to be executed.
Verification: The correctness and effectiveness of a transformation workflow and the transformation definitions should be tested and evaluated. Multiple iterations of the analysis, design and verification steps may be needed.
Transformation: Execution of the transformation steps either by running the ETL workflow for loading and refreshing a data warehouse or during answering queries on multiple sources.
Backflow of cleaned data: After (single-source) errors are removed, the cleaned data should also replace the dirty data in the original sources in order to give legacy applications the improved data too and to avoid redoing the cleaning work for future data extractions.
(The following is an excerpt from the excellent work of Vivek Mehra and Anupam Manglik titled Extending Enterprise BI Capabilities: New Patterns for Data Integration that appeared on Business Intelligence Best Practices at http://www.bi-bestpractices.com/view-articles/4763)
Organizations face a constantly changing business environment; they must deal with regulatory requirements, mergers and acquisitions, and the introduction of new channels for the business. This requires enterprise business intelligence (BI) capabilities to evolve in lockstep with the new business strategies. To ensure an ongoing alignment of BI strategies with business needs, data integration strategies also need careful and continuous assessment. It is critical to understand an enterprise’s existing data flow.
Enterprise data can be broadly divided into two categories:
(a) data supporting operational systems (such as accounting, human resources, general ledger, and so on), and
(b) data that supports BI functions (such as executive dashboards, reporting, and analytics).
Data supporting BI activities typically flows through four logical steps across the enterprise. These steps form the information supply chain:
1. Acquiring data from systems of record
2. Consolidating and harmonizing the data within a given domain
3. Integrating data across domains, performing calculations, and deriving information
4. Disseminating data to end users via BI applications
Each of these steps may include activities such as data cleansing; extraction, transformation, and loading (ETL); validation against metadata; and data quality feedback loops that ensure data trustworthiness and minimal data degradation.
Data Integration Patterns
Integration and dissemination functions have traditionally used physical stores (data warehouses and marts) as the primary mechanism for creating an integrated BI view. However, with the advent of EII tools (enterprise information integration), new virtual data integration patterns have now become viable. We will explore these data integration patterns and discuss their benefits and constraints.
Three primary data integration patterns can be applied to integrating disparate sources to provide an integrated BI view.
1. The Physical Pattern
This pattern (see Figure 1) uses a data warehouse to integrate and persist data from various sources, and employs data marts to disseminate data. These data marts are created and optimized for specific business needs. BI applications such as dashboards, reports, and analytics use data marts to present the data to business users.

The physical pattern has traditionally been the preferred approach for data integration. One of the significant drivers for adopting this approach is its superior query performance. The integrated data in data marts is stored in a manner that optimizes query performance for a particular BI need.
This approach is also popular because, by definition, the physical data integration pattern provides for the business need of capturing historic data and capturing summary information. Multi-year trending analysis requires historical information, but transactional systems typically capture limited historical data due to performance reasons. In such cases, a data warehouse is used to capture historical data. Data marts house summary information required by a specific set of users. This avoids repeated summarization of data for each user query for summary data, and enhances query performance. Data marts are often also used to store and disseminate atomic level data for specific user groups.
Implementation of the physical data integration pattern is a multi-stage process and can be complex, lengthy, and costly. Organizational constraints, such as transfer of ownership of physical data to a different group, may also impede the adoption of physical integration.
2. The Virtual Pattern
This pattern (see Figure 2) integrates data from disparate sources virtually and uses virtual data marts to disseminate data. This pattern therefore combines the data integration and dissemination functions in the traditional BI information supply chain. Virtual integration eliminates physical movement of data from data sources to target integrated data platform.

Implementation of the virtual integration pattern has been made viable by the availability of a new set of enterprise information integration (EII) tools, which promise real-time data integration from a variety of data sources such as relational databases, delimited text files, Web services, and multidimensional databases. EII tools use predefined metadata to populate views that make integrated data appear relational to an end user. Some EII products available in the market include BEA’s Liquid Data, Composite Software’s Information Server, and MetaMatrix’s Metamatrix Server. These product suites vary in features and technical strengths and should be selected based on a specific set of business and technical requirements.
The primary obstacles to adopting such tools for data integration have been linked to the performance of queries. Virtual integration accesses data from source systems directly, which in many situations is not optimized for BI queries. Hence query performance may be slow in such cases. However, EII features (such as the caching of query results and availability of powerful hardware that can execute these queries in parallel mode) help overcome the performance constraints and make these tools a viable alternative for some situations.
EII offers benefits such as access to real-time data and transparency to the source data systems. EII tools can be used to provide integrated BI across different data formats as they can integrate data available in different formats. Further, EII tools can disseminate data to end users or systems in a variety of formats (Excel, Web services, and text files, among others) and integrate with BI reporting and analytics tools as well. Output interfaces like Web services provide reusable data services to organizations that can then be leveraged by different applications.
The virtual pattern does not work well when data summarization is required or when historical information must be captured. Also, since the data is provided directly from the source data systems, availability of source systems is an important consideration in applying the virtual data integration pattern. Further, EII does not match the data cleansing capabilities provided by ETL routines involved in physical integration and provides limited cleansing capabilities. For repeatable, non-trivial data cleansing needs such as de-duping, an ETL tool will be required to cleanse the data. Typically, the cleansing requires setting up a physical staging area. The dirty data goes through ETL, validation, and data quality routines and moves to the staging area before the EII tool can access it for integration.
3. The Hybrid Pattern
This pattern (see Figure 3) manifests itself in combinations of virtual and physical patterns to integrate data for BI. The use of virtual or physical pattern combinations for integrating a set of data sources is driven by the constraints of those data sources. Some of the possible combinations include:
- Using the physical pattern for integrating data sources and then disseminating the integrated data virtually. The virtual dissemination is attained using EII tools that expose the integrated data as Web services. This combination is appropriate when there is a need to capture historical data, or there are data source constraints such as low data source availability. Both of these imply a requirement for persistence or physical integration. However, at the same time there may be a need for a high degree of reuse of the disseminated data service through Web services, implying a need for virtual integration to coexist with the physically integrated data.
- Using a physical pattern for integrating a set of data sources, then virtually integrating this physical data set with another set of data sources. This integrated data can then be disseminated virtually. An administrator can use the EII tool to create views that access the integrated data, as well as create “über-views” that integrate underlying views. These views can be queried by the end user or application. This combination is appropriate when some of the source data systems necessitate physical integration due to various constraints but others can be integrated virtually. A real-life scenario is integrating data from transactional systems with data in an existing data warehouse owned by a different organization. The transactional systems can be integrated physically, then virtually integrated with the existing data warehouse to provide an integrated BI view.
The hybrid pattern provides benefits of both the physical and virtual approaches when applied in a combination that meets data integration constraints specific to the enterprise.
