At a high level a generic data warehousing architecture can be said to comprise of the following components:
- Data Staging Layer (DSL)
- Operational Data Store (ODS)
- Data Warehouse (DW, EDW for an Enterprise)
- Data Marts (DM)
Each components functions are follows:
Data Staging Layer
- Decouples extraction processes from data cleansing processes
- Houses the transformation of formats of data from the incoming disparate systems to the single target data warehousing technology platform
- It reduces impact of loads on transaction system
- Maintains a ‘Snapshot’ that allows repeatable / re-startable ETL processes
- Is an area to store data using different load schedules to the DW
- Is not typically available to end users for data consumption
- Acts as an area to store 3rd party data
- Employs data manipulation not appropriate for the DW
- The data in Data Staging Area is: a copy of operational data, deleted after a short period so non-historical, in flat tables atomic in granularity
Operational Data Store (ODS)
ODS is a hybrid of data warehouse and operational systems. Functions of an ODS are:
- Allows OLTP response times, update capabilities and DSS capabilities
- Can be used for operational reporting because data from multiple source systems are integrated at this point
- Operational reporting is typically more structured so dimensional models are not required
- Data in the ODS is: kept for 6 months so non-historical, in flat tables or 3NF, often a copy of operational data, and is atomic in granularity
For architectural purposes, ODS’ are classified as follows:
- Class I – Real-time
- Class II – Two-hour to four-hour
- Class III – Daily
- Class IV -Aggregated data from the data warehouse
Enterprise Data Warehouse (EDW)
An EDW is the central hub of the BI infrastructure in some architectural approaches. The function of the EDW is to be the single point of truth for enterprise. Data in the EDW is:
- cleansed and harmonized during the ETL process
- Distributed to various other systems
- Often 3NF but can be dimensional
- Usually atomic but sometimes aggregate
The degree of normalization in the EDW data model is a tradeoff between multiple properties or characteristics of the data warehouse. The following matrix captures the trade-offs:
A Data mart is a departmental, or subject oriented, subset of the EDW. Data in the data mart comes from the EDW if one exists Relationship between EDW and data marts is often called a hub-and-spoke architecture. Data in Data Mart is often aggregated but may be atomic, dimensional, may or may not be historical, and may be volatile.
Data Warehousing Architectural Options
There are three architectural approaches for setting up a flexible, future-oriented data warehouse. This architectural decision is critical since the selection will play a large role in many aspects of the data warehouse such as data modeling methodology uses and physical infrastructure characteristics. The data warehouse architecture will determine the locations of the data warehouses and data marts themselves, and where the control resides. For example, the data can reside in a central location that is managed centrally. Alternatively, the data can reside in distributed local and/or remote locations that are either managed centrally or independently.
At a higher level, the three architectural approaches are:
- Enterprise data warehouse (EDW)
- Dependent data marts
- Independent data marts
These architectural choices do not have to be used exclusively by themselves, thy can be used in combinations as suited to the organizational needs. For example an often used option is to combine the EDW and dependent data marts in the same instance. The data marts are chained to follow in lock step with EDW, rather than directly from the staging area. The grain of the marts are restricted by the lowest grain of the EDW, as is the data latency and recency.
Enterprise Data Warehouse:
An enterprise data warehouse supports a large part of business requirements for a more fully integrated data warehousing environment that has a high degree of data access and usage across departments or lines of business. The data warehouse is designed and constructed based on the needs of the business as a whole. It is a common repository for decision-support data that is available across the entire organization. The term “Enterprise” reflects the scope of data access and usage, not the physical structure.
This type of data warehouse is characterized as having all the data under central management. However, centralization does not necessarily imply that all the data is in one location or in one common systems environment. While the data is centralized, it is logically centralized rather than physically co-located. When this is the case, by design, it then may also be referred to as a hub and spoke implementation. The key point is that the environment is managed as a single integrated entity.
Independent data mart architecture:
An independent data mart architecture, as the name implies, is comprised of standalone data marts that are controlled by particular workgroups, departments, or lines of business. There typically is no connectivity with data marts in other workgroups, departments, or lines of business. Therefore, these data marts do not share any conformed dimensions and conformed facts between them. This is one of the concerns when using an independent data mart. The data in each may be at a different level of currency, and the data definitions may not be consistent – even for data elements with the same name. Independant data marts are primary candidates for data mart consolidation for companies around the world today. The proliferation of such independent data marts has resulted in issues such as:
- Increased hardware and software costs for the numerous data martsIncreased resource requirements for support and maintenance
- Many redundant and inconsistent implementations of the same data
- Development of many extract, transform, and load (ETL) processes
- Lack of a common data model, and common data definitions, leading to inconsistent and inaccurate analyses and reports
- No data integration or consistency across the data marts
- Time spent, and delays encountered, while deciding what data can be used, and for what purpose
- Concern and risk of making decisions based on data that may not be accurate, consistent, or current
- Many heterogeneous hardware platforms and software environments that were implemented, because of cost, available applications, or personal preference, resulting in even more inconsistency and lack of integration.
- Inconsistent reports due to the different levels of data currency stemming from differing update cycles; and worse yet, data from differing data sources
Dependent data mart architecture
An interconnected data mart architecture is basically a distributed implementation. Although separate data marts are implemented in a particular workgroup, department, or line of business, they are integrated, or interconnected, to provide a more global view of the data. These data marts are connected to each other using conformed dimensions and conformed facts. Each of the dependant data marts typically has a common staging area. At the highest level of integration, the combination of all dependent data marts could be thought of as a distributed enterprise data warehouse. In an implementation where the EDW and dependent data mart architectures are combined, the Staging Area is basically replaced by the EDW.