Data Vault Architecture for Enterprise Data Warehouse

By Debashis Parida

Enterprise Data Warehouse (EDW) architecture needs to be robust, scalable, and flexible to adapt any change in the business environments. Traditionally, the relational and structured data sources feed data to data warehouse. In latest time, there is a demand to integrate structured data with unstructured data coming mainly from social media, external contents available in public domain, documents, email, etc. The architecture built based on the normalized and dimensional design techniques neither respond to the fast changes to the business requirements nor applicable to solutions where Big Data is applicable. Data vault based architecture is suitable to address the fast changes to requirements.

A good data architecture determines what are the key building blocks in data environment and ensures that the data needs of the enterprise are met in data warehousing solution. Through that the design, data flow, data management and data governance processes are defined. A typical data warehousing solution follows either Kimball based solution or Inmon’s based solution.

Kimball has introduced an often-used, two-layer architecture for the data warehouse. The raw data from the source systems is loaded into stage layer before it gets cleansed, transformed and loaded into data warehouse. The advantage of Kimball's proposed architecture is that it is easy to build a dimensional store from a source system. However, the disadvantage is that it is complex to rework on the same dimensional model to incorporate the information for similar kind of sources or any modification of the source system. It has impacts to the existing ETL packages. This needs additional effort to rework on everything with less re-usability of existing solution.

In Inmon's defined three-layer architecture, data marts are developed on top of the data warehouse which is in 3NF form. The business users access the data marts using Business Intelligence (BI) tool, similar to the two-layered architecture. It is quite easy to develop data marts from the data warehouse in three-layer architecture as enterprise data is cleaned, standardized, and integrated. However, it is more time consuming and complex to work on data model and data processing for enterprise data warehouse in one go is suggested top-down approach.

Data vault architecture takes the best of solutions from both the architectures mentioned above and brings a new perspective to address the challenges in those architecture.

Data Vault Architecture:

Data Vault architecture has the following layers:

Staging Area: It stores the data as is available in source systems. It does not store historical information and captures only primitive data, mostly in independent structures.
Data Warehouse: The warehouse layer captures the data in integrated form using Data Vault design. Data Vault design in data warehouse has core vault and business vault that are for core data and processed data respectively. In this way, data warehouse in Data Vault architecture holds both raw and derived data.
Information Delivery Layer: Different data marts are maintained in the form of dimensional and other structures to fulfill the requirements of downstream requirements including analytical report and dashboard requirements.
Data

Data vault design has three components: Hub, Link, and Satellite. Hub has on the business key, Link captures the relationship between the business keys and Satellite has the non-business keys to capture majorly descriptive information. In data vault model, multiple satellite tables are there by following rate of change of information and types of information. The satellite tables capture the historical updates like Type II Slowly Changing Dimension in case of dimensional design. Data Vault has both the properties of normalization and dimensional design concept. The data vault architecture provides many-to-many relationship among the business keys in Link table which brings flexibility and extendibility in solution [1].

Data Vault Architecture

As EDW is treated as single source of truth at enterprise level, it needs to be flexible, scalable, and robust to accept any change happening to upstream layer, be it source systems, data structures, business rules, or addition of new source applications. Unlike other architectures, Data Vault architecture based EDW can handle easily following scenarios where changes are unavoidable at enterprise level:

Acquisition of new business When business grows in-organically, the organization acquire other organizations. In that process, new applications of similar of dissimilar natures get added at the enterprise level.
Upgradation of Source Applications: When the applications would upgrade to new version, different changes happened in that process such as the source data structures, relationship among the data structures, data format, etc.
Addition of new Applications: When the business processes expand, it demands new applications or application functionalities to support those processes.
Change in Business Rules: Quite often, the business rules pertaining to data definition / data dictionary, data filter, data range, data standardization, data aggregation, data hierarchy get changed and modified rules are used at enterprise level.
Change in Business Requirements: Based on the usage of the data warehouse, additional or change in business requirements come from the business users. If there are multiple changes come in the requirements, the dimensional design needs major changes and may need complete revamp of modelling [4].
Emergence of new technologies: The vendors is now favorable of considering Hadoop into their solutions apart from all latest technologies. The relational data is now combined with the information extracted from big data in the form of clickstream records, log files, sensor data, images and much more.

Major advantages of Data Vault architecture solution are:

For any change in source systems and requirements, no rework is required in the data model
Due to consistent design, the Extraction, Transformation, and Loading (ETL) scripts follow standard techniques to load the end to end warehouse
Simple mapping with the source systems ensures straight forward ETL codes
It supports parallel ETL process from different data sources. The loading sequence is hubs, then links, finally satellites.
It supports real-time, unstructured data and NoSQL environments [3]
It is flexible, scalable, consistent, and adaptable to the needs of the enterprise data warehouse
It supports historical traceability
It provides information about the benefits and advantages of the data, its historical changes based on new rules for the business user
Many-to-many relationships provide flexibility because changes to the business rules need not require to re-engineer the links
An incremental build-out of the enterprise data warehouse gets benefit from rapid delivery to integrate the source systems and deploy formation marts.

Comparison of Architecture:

	Inmon 3NF EDW	Kimball Bus Architecture	Data Vault
Implementation Strategy	Top-down approach The design needs big bang approach where it needs to be finalized at the start.	Bottom-up approach with iterative and incremental implementation methodology Long development time due to redesigns caused by delayed requirements.	Methodology follows agile development methods It supports pure incremental implementation [2].
Design	Fully normalized with optimal number of joins Design is not easy for data loading to dimensional model	De-normalized with less joins Designed for analytical query processing	Semi-normalized with many joins Unlike other designs, it is flexible to allow different relationships (one-to-one, one-to-many, many-to-one, many-to-many) without changing the data model structure.
Data Management	Referential integrity is maintained properly	Expensive updates and deletes	Data load happens based on design pattern (Hub and Link before Satellite tables)
Audit Mechanism	Duplicate data sources need priority which is difficult to maintain the audit tracking	Not fully auditable	Fully auditable
Data Presentation	Not suitable for end-user queries.	Star schema is suitable for user queries and reports	Not suitable for end-user queries

Following table covers a comparison of different architectures used for data warehouse implementation.

Conclusion:

Data Vault architecture is a mature data architecture as it supports impact free, robustness, flexibility and scalability in the solution through its optimal design in the data model. Coforge has different best practices, pre-defined templates and guidelines for implementing data warehouse solutions. Based on the customer’s requirements and needs, Coforge implements a specific architecture. In recent years, most of the organizations are considering agile based implementations. Data Vault architecture is a suitable architecture to implement enterprise data warehouse in agile approach.

Reference:

[1] Dan Linstedt, “Data Vault Series 4 – Link Tables”, TDAN.com, January 2004. https://tdan.com/data-vault-series-4-link-tables/5172

[2] Dan Linstedt, “Data Vault Series 2 – Data Vault Components”, TDAN.com, January 2003. https://tdan.com/data-vault-series-2-data-vault-components/5155

[3] Dan Linstedt and Michael Olschimke, “Building a Scalable Data Warehouse with Data Vault 2.0”, 2016.

[4] W. Inmon, D. Strauss, G. Neuschloss, “DW 2.0: The Architecture for the Next Generation of Data Warehousing”, Elsevier, 2008.