Data cleaning (also called data cleansing or scrubbing) is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. In data warehouses, data cleansing is a major part of the so-called ETL process. Data cleansing deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data. When multiple data sources need to be integrated, e.g., in data warehouses, federated database systems or global web-based information systems, the need for data cleansing increases significantly. Most data cleansing is typically performed in a separate data staging area before loading the transformed data into the warehouse. A data cleansing approach should satisfy several requirements. First of all, it should detect and remove all major errors and inconsistencies both in individual data sources and when integrating multiple sources. The approach should be supported by tools to limit manual inspection and programming effort and be extensible to easily cover additional sources. Furthermore, data cleansing should not be performed in isolation but together with schema-related data transformations based on comprehensive metadata. Mapping functions for data cleansing and other data transformations should be specified in a declarative way and be reusable for other data sources as well as for query processing. Especially for data warehouses, a workflow infrastructure should be supported to execute all data transformation steps for multiple sources and large data sets in a reliable and efficient way.
Rahm and Do (University of Leipzig, Germany) in their 2000 IEEE paper on the subject classify major data quality problems to be solved by data cleaning and data transformation into the nature of the integration required (single source vs. multi source). They also describe the different approaches to be taken in phases to address the data cleansing process. The approaches, described in detail in the authors’ work, are summarized as follows:
Data analysis: To detect and create an inventory of errors and inconsistencies to be removed. May involve manual inspection of the data or data samples as well as profiling programs.
Definition of transformation workflow and mapping rules: Depending on the number of data sources, their degree of heterogeneity and the ”dirtyness” of the data, a large number of data transformation and cleaning steps may have to be executed.
Verification: The correctness and effectiveness of a transformation workflow and the transformation definitions should be tested and evaluated. Multiple iterations of the analysis, design and verification steps may be needed.
Transformation: Execution of the transformation steps either by running the ETL workflow for loading and refreshing a data warehouse or during answering queries on multiple sources.
Backflow of cleaned data: After (single-source) errors are removed, the cleaned data should also replace the dirty data in the original sources in order to give legacy applications the improved data too and to avoid redoing the cleaning work for future data extractions.