Skip to main content

Creating a centralized data repository


Client grew their AUM from $3.5bn to $10bn over a period of two years. Currently, there is no centralized data repository and all departments have rolled out their own solutions. As can be expected this leads to data trust issues and manual report collation.


A Hadoop based data lake with

  • Controlled data ingress through an ETL layer
  • Data quality checks with resolution workflow
  • Flexible, dynamic schema that can evolve through the solution lifetime
  • Reporting solution with canned reports as well as self-service ability.

The technology stack: Hortonworks HDP2.6 big data stack; Pentaho for ETL; Activiti workflow engine; Tableau reporting suite

Key features

  • Incoming data feed definition is completely configuration based - Apache Atlas metadata store
  • ETL layer is powered by Pentaho
  • Data moved from raw to staging to production with extensive quality checks to ensure veracity
  • Exception reporting and resolution (whether quality related or otherwise). The workflow service runs standalone so it can be used for other enterprise workflow needs
  • A reporting data mart to power Tableau reporting suite
  • Self-service reporting is available through standard Tableau tools
  • Data governance (QA, Security and Lineage) built into the design

Key wins

  • Low-cost solution - traditional Warehouse/BI technology stacks cost massive amounts of money for licensing, installation and maintenance. ODA solution delivers on all the architectural needs of a modern data management solution but uses open-source technologies entirely.
  • Opened door towards enabling new generation technologies like machine learning, data science et al.

Solution Overview

  • Multi-account infrastructure using AWS Landing Zone to help reduce the time in setting up secure and scalable workloads while implementing an initial security baseline through the creation of core accounts and resources
  • Employing Infrastructure-as-Code (laC) through automation templates based on Terraform/CloudFormation to help minimize the time required to roll-out the basic infrastructure for setting up application environments within individual accounts to support end-to-end application lifecycle.
  • Configuration management using AWS Config to ensure configuration meets standard baselines and facilitate rapid assessment/evaluation of multi-account IT infrastructure components. This further helps simplify compliance auditing, security analysis and change management
  • Organization wide centralized monitoring and compliance using CloudWatch and CloudTrail integrated with Splunk
Let’s engage