Active Metadata Management- Extraction, Cataloging and Management
By Kalyana Chakravarthy Yellapragada
Data offers the promise of many benefits, from profits to efficiency, but in order to realize these benefits, it must be understood and managed. The growth of data of all kinds is exponential, and surpassing an organization’s ability to manage it. Data has universally been recognized in recent years as a valuable asset as well. It has great potential to maximize efficiency, pinpoint new opportunities, and report on the status of mission goals. However, as such, it requires management and tracking, just like other assets.
Data Catalog, which is an automated inventory of data assets, augmented (powered) by machine learning (AI/ML). Data Catalog enables users to discover and explore all the data sources available, enhancing their understanding of these sources, enabling collaboration with other users to enrich the quality of the assets, and achieving more value from the organization’s data.
Industry scenario or Technology evolution What is Data Catalog
Data Catalogs promote intelligent and secure data sharing by centralizing, labelling, and monitoring your organization’s data assets. This single control plane allows for better collaboration, stronger regulatory compliance, and reduced overhead.
What Data Catalog Contains
A single source of the truth for the enterprise.
Ability to find new data sources monetize Data elements
User Data Monitoring and Auditing the Data Lineage
Role-based access control to ensure Data Security
Opensource/COTS-On-premise/CLOUD based solutions
Integration with the Data Governance
Data Usage: Who is using the Data and What is the purpose of using the Data?
Data Greenness: Are we using the most up-to-date version of the data we need?
Data Security: How do I restrict permissions so that access to data is controlled to only certain rows and columns? Can I grant limited or read-only access easily?
Data Overhead: Is there another department in this organization that could use the same data? Is there a way for me to find out if we’re already buying it?
Data Redundancies: Are there different departments doing similar work on the same data?
Data Discrepancies: How can we link all of the data we have to ensure we’re conforming to the same standards across our whole organization?
Business needs in focus
Data Catalog addresses a variety of business needs. The following are some of them
Data Lake modernization: Enterprises store various source systems Data into Data Lake in raw format with minimal metadata information, so it is difficult for the users to identify, understand and access the Data. To overcome this situation Data Catalog on the top of Data Lake would provide desirable information to the users like Business Analysts, Data scientists etc.
Cloud Modernization: Many Enterprises are moving towards Cloud Infra and services. Data Catalog provides the visibility of Data across on-premise, Cloud, and Hybrid environments. In addition, data Catalog data lineage provides involved source and target information and ensures that it won't be any data lost during cloud migration and Integration. Data Catalog supports for the Cloud spend analysis based on source data so users can make appropriate decisions related to cost comparison
Data Democratization: In Enterprises, Data is spread across multiple departments and stored in different systems. As a result, it is challenging to organize, maintain and utilize the data effectively and efficiently. Data Catalog provides reliable, predefined and pre-approved data. In addition, systems do not need to wait for the required datasets, and it contains a streamlined approval process to utilize any Datasets. This process increases productivity and reduces the timelines to search for the datasets to spread and analyze the findings.
Discovering Sensitive Data: Data Catalog supports discovering sensitive data where business doesn’t know the existence of sensitive data objects, which are compliance with PII, GDPR, etc.
Technology Ecosystem, Products and Platforms
There are many products in the industry that supports Data Catalog. Some of them are Data.World, Informatica, Collibra, Alation, etc.
The below is one of the Gartner reports which highlights product capabilities in various sectors.
Approach for Technology evaluation and selection
We can evaluate the tools based on various properties/ categories/functionalities comparison.
Regarding the Data Catalog, some of the functionalities are as follows.
Column similarity Identification
Data classification of structured, semi structured and unstructured data
Intelligently recommends other Data sets that are similar to what they are working on
Questions & Answers
Metadata Repository & Metadata Search
Data Lineage with JAVA, COBOL and SQL Code.
Holistic relationship view
Logical, Physical Data Model Integration
Business Glossary association with technical Metadata.
Searching Technical assets using Business Glossary
Data Governance Integration
Accesses Metadata Knowledge graph with REST APIs
Knowledge Graph acts as Data virtualization
API connectivity: Documents, Big Data, Cloud Platforms, Applications, ETL tools, BI and Databases, etc.
Data Catalog self services based on Data Domains, Searches
Catalog customizations: Ratings, Rankings and Reviews
Collaboration with other users.
Manage Data Access, Track user behavior and Guide the Analysts
Catalog as Cloud Native SaaS offering
Built-in Query Editor
Coupled with the chat and comment streams that accompany each Data asset
Ease of deployment
Quality of end user Training
Timelines of Vendor Response
Quality of Technical support
Quality of per user community
Advanced Features and Focus Areas
Ability to understand needs
Number of customers
Outcome or Business benefits envisaged
Data Catalog can help improve your data cost-to-value ratio, spur collaboration and creativity with better data access, and solidify your data-driven culture.
To implement the trustworthy Data sources required for Data Governance and Data Privacy.
Scalable and flexible Data Infrastructure
Provides organized insights of Data by segregating the silos among the users and creating a central Data Domain.
Ensures stronger Data Governance and Data Stewardship.
Monitors the Data effectively.
Distribution hub by sharing the dataset among users based on role based permissions
Enables organizations to discover, govern and monetize the Data.
Flexible, visible, usable, sharable and security of the data.
Coforge value proposition or USP
Coforge has the following value proposition to implement Data Catalog.
In summary, A Data Catalog is a collection of metadata, combined with data management and search tools, that helps analysts and other data users to find the data that they need, serves as an inventory of available data, and provides information to evaluate fitness data for intended uses.
Coforge has the end to end experience to implement Data Catalog and to showcase the results as per the customer needs and guide the customer to identify the best ways to implement it