The following criteria should be considered while evaluating vendors and technologies for meeting data quality requirements in an organization:

  1. Enterprise Data Quality Framework – Evaluate the degree of effort required to integrate the wide range of functional capabilities into a single architecture and product, so that end users will ideally have a single point of access and integration point for the Data Quality  domain. This requirement may be a long term goal as it is almost impossible to achieve enterprise wide data quality platform from a single initiative. A multi-year program consisting of small subject-oriented projects is a more feasible approach. However, the vendor plays an important role throughout the implementation of the program, and therefore should be evaluated against this criteria. Of course, it is critical to analyze if a true enterprise wide implementation is required, as a localized departmental one may suffice; it really depends on the organization’s mandates.
  2. Functionality Coverage – Evaluate the extent of functionalities an capabilities of the data quality platform/solution. The usual capabilities that are required from a robust data quality platform include data profiling, reference data standardization, value matching, data rule analysis and rules discovery, data enrichment using supplemental data feeds and data quality monitoring. Depending on the specific requirements, evaluate a platform based on the depth of coverage for these and other capabilities. Of course, a solution that offers all of these may not be a best fit for the organization’s requirements, therefore a one size fits all approach may not be ideal either.
  3. Adaptability and Usage by non-IT Resources – Recent economic and organizational evolution has increased the focus on the business, non-IT resources to own and to some extent manage domains such as data quality and master data management. To meet this requirement, it becomes necessary for the system and the tools to be more business user friendly. They should be able to be managed easily and effectively with minimum confusing clutter in the presentation layer, and a seamless workflow that is transparent to the end users. The extent and degree of these requirements should be properly evaluated.
  4. Operational Characteristics of the Data quality Platform – Data Quality solutions can be deployed in multiple ways depending on an organization’s budgetary and organizational requirement. The three main options are on-premises software deployment, hosted solutions and software as a service [SaaS]. The technical capability of a product or solution to be suitable for the appropriate deployment option, the capability of a vendor to support and provide partnership for each of these options as well as the technology requirement to manage the data and code that will reside in each layer should be analyzed.
  5. Centralized Data Quality Management Governance Framework – It is always a good idea to think about managing the data quality aspects similar to the fashion in which a master data or a metadata repository will be owned and operated by a dedicated team. To that effect, establishing the roots of a centralized data governance framework will reap benefits when the data quality initiative is operational. The capability of a vendor to support seeding such an initiative, least of all by sharing best practices, proposed organizational models and workflows will go a long way in helping out setting up the direction for data quality success. Rolled into this may be the task of increasing awareness across the business and technology community of the importance of data quality and therefore the concept of design for quality so that quality is built into future systems and not treated as an after the fact exercise.
  6. Non-technology characteristics, such as cost, maintenance/support, upgrades, pricing models, viability and lateral partnerships of the vendors are critical factors since data quality is a program rather than a technology and the relationship with the vendor is a long term one.

There has been a drive by increasingly technically savvy business teams to perform self-service analysis on the business information that they intimately know in terms of element relationships and business rules. At the same time these business teams do not want to involve their IT counterparts in all steps of the analysis. The primary reason for this drive is simply the overhead required in terms of time and resources to involve IT in generating critical information that drive business decisions. If business teams could get their hands on tools that allow them pull information in from the data repositories and allow them to do the analysis they need fast and in an easy, non-technically daunting manner, the value of such tools is easily bought by business sponsors. Of course, in no way does this imply that IT is redundant, IT has to be involved in the enterprise reporting level and to maintain the infrastructure that supports the so-called local departmental analytics, and they will always be brought into picture when a certain local analysis is so critical that it needed to be “productionalized” and deployed to a larger audience.

Unfortunately, thus far such “magic” self-service tools that business teams could master and use locally did not exist, or if they did they were either in spreadsheet formats with restricted analysis capabilities, or embedded inside of bigger complex software suites and therefore were cost prohibitive. The fast response times from such local applications where the business user wanted results rather than waiting for response prompts from the reporting applications were absent as well.

The advent of in-memory analytics has sought to address the desire by business users to do their own analysis to a large extent. Such niche tools have been around for a long time. However, the changing dynamics of the industry (consolidations), the decreased cost of processing power and memory, as well as the technical advances in operating systems has suddenly found these niche tools on the “right side of the wall”. These tools are classified  in the in-memory analytics niche of business intelligence tools. Thus far they serve to compliment rather than compete against existing business intelligence suites. Several niche players have emerged in this space, the leaders being Qliktech (with Qlikview), and Tibco (Spotfire). The technologies basically load data from sources into the memory of the machine they are running on, rather than landing or reading data (or parts of the data) to the disk as is done in other business intelligence tools. Inherent efficiencies of a 64-bit OS platform, as well as the current typical RAM memory configurations in multiples of GBs allow complex and large data sets to be simultaneously loaded to the memory, providing near-instantaneous response times to user queries.

Bigger players in the business intelligence space have started offering comparable technologies as part of their bigger suites. Of note, Micorosft announced a release of their in-memory engine (thus far known as Project Gemini) in latest release of their flagship SQL Server platform. Oracle has also announced in-memory capabilities in the latest  build of their 11g database. Given the increased value especially perceived by end user groups, it will not be surprising if in-memory features become an integral part of all business intelligence suites in the near future.

The level of aggregation as well as the granularity of details required from management information reports is not known to absolute certainity at the time of analysis and design of a data warehouse. At the same time, it is not optimal to assume that the data warehouse should hold the most granular data available in the transactional systems. Therefore, the data warehouse must be designed in such a way that it provides optimal support for aggregation on the fly and for navigation through aggregated hierarchies, thus supporting reporting in any desired layout.

While there are sound methods for the analysis and design of ordinary transaction processing  systems, a comparable method for the development of management information systems such as a data warehouse remains not solidly defined. Inmon deals with many phenomena related to data warehouse design, but leaves the ‘how’ of it completely untouched. Schouten’s article (link) on analysis and design of data warehouses is an attempt to devise a proper way of thinking and working to achieve this goal. Schouten outlines two complementary methods for the analysis of data warehouse relations; one simple and the other advanced. The simple method exploits the knowledge contained in an ordinary relational schema. The advanced method is based on the analysis of derivation rules. Subsequently, the design of data warehouses based on these methods is investigated. Special attention has been given to the actuality of data warehouses that contain historical information, to the transitivity of derivations, to the navigation through aggregation hierarchies via so-called drill paths and to the maintenance of various aggregation levels within a single data warehouse relation.

One of the goals of a data warehouse is to provide management information that will be used to make organizational policy decisions. By nature, all management information is derived from operational information, i.e. information created and stored in transactional information management systems. Therefore, it follows that management information is almost always an aggregated, derived form of the operational information. A distinct observation about the difference between analysis and design as applicable to data warehousing systems is as follows: analysis consists of the detection of derivable facts and the applicable derivative rules, while design consists of grouping the derivable facts into well formed relationships in the data warehouse.

Citation: H. Schouten, “Analysis and Design of Data Warehouses”;  Proceedings of the International Workshop on Design and Management of Data Warehouses (DMDW’99)

Data cleaning (also called data cleansing or scrubbing) is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. In data warehouses, data cleansing is a major part of the so-called ETL process. Data cleansing deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data. When multiple data sources need to be integrated, e.g., in data warehouses, federated database systems or global web-based information systems, the need for data cleansing increases significantly. Most data cleansing is typically performed in a separate data staging area before loading the transformed data into the warehouse. A data cleansing approach should satisfy several requirements. First of all, it should detect and remove all major errors and inconsistencies both in individual data sources and when integrating multiple sources. The approach should be supported by tools to limit manual inspection and programming effort and be extensible to easily cover additional sources. Furthermore, data cleansing should not be performed in isolation but together with schema-related data transformations based on comprehensive metadata. Mapping functions for data cleansing and other data transformations should be specified in a declarative way and be reusable for other data sources as well as for query processing. Especially for data warehouses, a workflow infrastructure should be supported to execute all data transformation steps for multiple sources and large data sets in a reliable and efficient way.

Rahm and Do (University of Leipzig, Germany) in their 2000 IEEE paper on the subject classify major data quality problems to be solved by data cleaning and data transformation into the nature of the integration required (single source vs. multi source). They also describe the different approaches to be taken in phases to address the data cleansing process. The approaches, described in detail in the authors’ work, are summarized as follows:

Data analysis: To detect and create an inventory of errors and inconsistencies to be removed. May involve manual inspection of the data or data samples as well as profiling programs.

Definition of transformation workflow and mapping rules: Depending on the number of data sources, their degree of heterogeneity and the ”dirtyness” of the data, a large number of data transformation and cleaning steps may have to be executed.

Verification: The correctness and effectiveness of a transformation workflow and the transformation definitions should be tested and evaluated. Multiple iterations of the analysis, design and verification steps may be needed.

Transformation: Execution of the transformation steps either by running the ETL workflow for loading and refreshing a data warehouse or during answering queries on multiple sources.

Backflow of cleaned data: After (single-source) errors are removed, the cleaned data should also replace the dirty data in the original sources in order to give legacy applications the improved data too and to avoid redoing the cleaning work for future data extractions.

There is a continual pressure to reduce information latency times in the quest to keep a finger on the pulse of a business. This pressure stems from the rapid changes in the business landscape that have shaped IT strategy over the past four or five years. The move to real time information delivery demands just-in-time reporting. This demand is not restricted to the operational or transactional reports; the demand has expanded to include the analytics and business intelligence as well. Traditionally, these analytics and business intelligence applications were sourced from data warehouses that would be updated in a nightly batch window. As such, a day’s latency was the norm for analytics, and sufficed to know how the business did at the end of the last business day. However, in today’s world, it is necessary to know how the business is doing every hour. This is especially true for high volume rapid transaction systems such as financials, retail and airlines.

The demand for real time reporting and analytics is translated to stringent requirements on the backed data integration systems. The traditional flow of data to a data warehouse is via the ETL routines. These routines need to be architected to meet the near-real time latency requirements. Some typical challenges in a real time data integration environment are:

  • Necessity to conduct business 24/7 is reducing batch windows. As more business is done across time zones and over the web, many organizations are faced with the problem of shrinking batch windows, making it more difficult for traditional ETL tools to extract data in the short time available.
  • Large volumes of information are difficult to handle in a batch window. As more information is gathered – such as online transaction data, inventory data, and customer information – the effort involved in moving it to the warehouse increases drastically. Many organizations are finding that an eight hour batch window is no longer sufficient for traditional ETL (extract, transform, load) tools to integrate all of the needed data.
  • The need to track all changes for auditing purposes. Organizations need to comply with regulations, which often requires them to continuously track all changes to data and not just the net result of those changes.
  • Growing need to detect and react to business events as they happen. Many organizations are looking for ways to detect business events in production systems and have those events trigger a response in another system. For example, a cell phone company would like to send a text message to a customer running low on minutes asking if him if he would like to purchase more.
  • Organizations want to deploy new applications using data on legacy systems without paying for an increase in workload. Often, legacy systems are already maxed out and new capacity is very expensive. Organizations want new applications on UNIX or Windows to avoid this cost, but integrating the data from those legacy systems without increasing the load on them is a key challenge
  • Increasing need to keep data in sync across the enterprise. Customers want up-to-the-minute access to order, payment and inventory data so they can buy products, pay bills and check delivery status online. Employees need much of the same so they can better service customers and make wise business decisions. To accomplish this, eCommerce data needs to be in sync with business applications and data needs to flow in real-time across the enterprise.

 For more information, the reference for this post is here (Link to IBM)

Companies constantly morph their operational models to find optimal ways of accessing new revenue streams and to cut costs. As a result of these morphing, the underlying technology infrastructure needs to adjust and adapt as well. Mergers and acquisitions, reductions and streamlining, integration with supply chain partners demand constant changes to the data models, often creating lots of new niche challenges. Companies look forward to take advantage of the new operational models to launch new initiatives and offer new products to grow even more.

However, many companies fail to realize the benefits, often as a result of lack if proper information. For many companies, the resulting changes to the underlying data infrastructure comes with often unsurmountable challenges. Data integration from multiple sources, standardization of codes etc become difficult to maintain. Understanding the customer demands and partner transactions is obscured by lack of integrated common data models. Any in many cases, the data surfaced by the end user facing business intelligence systems is inconsistent across different departments, leading to large data reconciliation and audit efforts.

Dr. Fay Cobb Payton and Dr. Robert Handfield of NCSU studied these possible problem sources and proposed guidelines for these issues (Data Warehousing Implementation and Outsourcing Challenges: An Action Research Project with Solectron (with R. Handfield), Communications of AIS, 2003 (12), 633-648), applicable to scenarions where the data warehouse is implemented by a service provider. They recommend that companies wishing to integrate data from outside parties must first re-evaluate their internal processes to ensure that the data already present is in a usable format. Firms should centralize the data from multiple departments to a single platform and continually evaluate the data’s quality and reliability by proper implemenation of a data governance program and using established quality metrics. Additionally, they recommend that the data warehouse performance be continually monitored. Lastly, on the organizational front, they recommend defining and honing the skill sets required in-house before chosing an data warehouse integration partner.

An integral part of an MDM solution is the identification of the business critical reference data elements. Such data elements are often embodied as codes in a data model. Those codes that are candidates to be managed via an MDM solution can be termed as global data codes. Global data codes are the key cornerstones to achieve integration of transaction systems, and enable reporting across functions and businesses. It is essential to identify these global codes and classify them in the proper business context.

Examples of global data codes, or master data objects might be customer, supplier, product, employee, business group, payment term, currency, language etc.

Though this sounds like a simple concept, identifying global data codes can be a challenging task, especially in organizations where silo-ed data management systems exist as a result of an inconsistent, or a total lack of, standard data management practices. Exactly what qualifies to be a global data code to be managed by the MDM system is a careful exercise to be undertaken by a joint team of business and technical functions.

One of the challenges in identifying global data codes is that often the same business entity can be represented by different codes, or similar sounding descriptives across an organization. An empirical analysis across all existing fragmented business data models to define a consistent enterprise-wide data model is the first step to identify the overlapping definitions of the codes. Once the common data model and the associated semantic metadata has been identified and centralized, the analyst will find it much easier to identify overlapping or conflicting definitions of the global data codes, and thus be able to rank the codes based on their usage and criticality and then make them candidates for the MDM solution.

Defining Master Data
Master data is a set of core data elements—with their associated hierarchies, attributes, properties, and dimensions—such as customer, product, legal entity, chart of accounts, employee, vendor, market channel, geographic location, etc., that span the enterprise IT systems and drive the business. Master data is not transaction data. Transaction data is information that is generated and captured by operational systems and describes the activities, or transactions, of the business. Master data is core elements of the business that are applied to multiple transactions and are used to categorize, aggregate, and evaluate the transaction data. Master data is the company’s set of “control” data that enables IT systems and users alike to understand the meaning of each piece of data in a transaction.

Effect of managing master data

Effectively managing master data facilitates quality information by establishing a mechanism for shared internal information controls: business units have autonomy over the information they use, but IT has ultimate oversight and control of corporate information. Auditability is another by-product of managing master data. Effectively standardizing and managing master data, gives a company the ability to track the change history of corporate information, as well as the ability to roll back to a prior state in the event of a systems failure or for comparing prior states with current or proposed future states. Effective master data management also enables enhanced, seamless reporting capabilities, which can result in reduced time needed to implement changes to the business, i.e., improved flexibility and adaptability. Finally, effective data relationship management facilitates referential integrity across systems that share master data and can, by doing so, support alignment of information and decision making.

Given the analytical nature of data warehouses (DWs) and the need for understanding the source data and its fluid, changing user requirements, a data warehouse as an entire project is not an appropriate candidate for outsourcing. There are benefits in hiring consultants for some or all of the work to having access to experienced consultants in scope, discovery, development and implementation to assist the in-house staff in knowledge development, skills and staff augmentation. In an outsourced situation, contracts are very clear about what will and will not be included. The very nature of a data warehouse is that the users are never able to articulate all their requirements initially, meaning each new major opportunity requires renegotiations and contract changes. By the time these details are worked out, the opportunity might be lost. Strategic outsourcing makes sense in most medium to large data warehouse projects. The outsourcing contractor can supply the needed expertise and personnel at the various development phases. Though the up-front costs of an outsourcing firm to be higher than hiring in-house personnel, the long-term savings will be far greater with a professional outsourcing firm than by retaining in-house personnel.

The transient and changing nature of business transactions can be captured by appropriate versioning methods in the data warehouse. Variation of data over time is applicable for both reference as well as transactional data. An example of changing reference data is an individual changing home address as a result of a move. If the business requires the former and current addresses to be captured and retained, then the customer dimension will typically be modelled as a Type-2 (or as needs be, a Type-3) dimension to capture the old and new addresses with the associated start and end dates. Changes in the transactional data are easy to observe: a loan paid down over time by the borrower has its unpaid balance reduce with every payment (thus ties to the time dimension), and the changing unpaid principal balance captured in a fact table over time presents an example of the versioning of transactional data.

There are subtle flavors of archived, or versioned data that can be expanded on. For simplicity, lets call them the following:

  1. As Is – Current transactional and reference data (Current nature of the facts and dimension as they are known to be true today)
  2. As Of – Past transactional and reference data (Past value of a transaction tied to the then applicable dimension value at a former point in time)
  3. As Was – Current transactional and past reference data (Current value of a transaction tied to a dimension value applicable at a firmer point in time)

Another way to represent this information would be to present them in a question format. These would be typical business questions around a certain transaction:

  1. What is the value today, and how does it classified today? – As Is
  2. What is the value today, and how would it have been classified in the past? – As Of
  3. What was the value then, and how was it classified at that point in time? – As Was

Lets take an example. Lets go back to the example of a loan that is paid down over time. A simple dimensional model will be a Loan dimension, perhaps with a Loan Index Type attribute (showing what financial index is used to compute the loan’s interest rate), and a Loan Position Fact table, which sits at the intersection of the Loan Dimension and the Time Dimension (monthly grain). Lets say the value of the loan for two months are as follows:

  • On April 20, Loan 123 was tied to the Treasury index and had an UPB of $210K
  • On May 20, Loan 123 is indexed to LIBOR and has an UPB of $209K (the borrower made a payment of $1K towards the loan – a transaction; and the loan was modified to be tied to another index – a change in the reference data)
  1. An as-is query run on May 20 will reveal Loan 123 tied to LIBOR index having an UPB of $209K
  2. An as-of query run on May 20 for April 20 will return the following: Loan 123 with an UPB of $210K indexed to Treasury
  3. An as-was query run on May 20 for April 20 will return the following: Loan 123 with an UPB of $209K indexed to Treasury

The first two have obvious applications – how did a business unit (in this case a loan) look like today, or looked like in the past. The As-Was is a bit tricky. Think about a situation where the accounting treatment for a loan changes based on the index it is tied to. In that case, a business analyst would want to check how a current loan would have looked like if the index had been different in the past. An As-Was query would be what will be useful in that situation. In conclusion, it is easier to think about the data versioning in these three terms.

There can be additional variations, however, these three flavors – as is, as was, as of – are the main variations that can be used to drive the dimensional model.

top