Active Metadata Management- Extraction, Cataloging and Management

By Kalyana Chakravarthy Yellapragada

Data offers the promise of many benefits, from profits to efficiency, but in order to realize these benefits, it must be understood and managed. The growth of data of all kinds is exponential, and surpassing an organization’s ability to manage it. Data has universally been recognized in recent years as a valuable asset as well. It has great potential to maximize efficiency, pinpoint new opportunities, and report on the status of mission goals. However, as such, it requires management and tracking, just like other assets.

Synopsis

Data Catalog, which is an automated inventory of data assets, augmented (powered) by machine learning (AI/ML). Data Catalog enables users to discover and explore all the data sources available, enhancing their understanding of these sources, enabling collaboration with other users to enrich the quality of the assets, and achieving more value from the organization’s data.

Industry scenario or Technology evolution What is Data Catalog

Data Catalogs promote intelligent and secure data sharing by centralizing, labelling, and monitoring your organization’s data assets. This single control plane allows for better collaboration, stronger regulatory compliance, and reduced overhead.

What Data Catalog Contains

  • A single source of the truth for the enterprise.
  • Ability to find new data sources monetize Data elements
  • Knowledge Graphs
  • Data Lineage
  • Impact Analysis
  • User Data Monitoring and Auditing the Data Lineage
  • Role-based access control to ensure Data Security
  • Opensource/COTS-On-premise/CLOUD based solutions
  • Integration with the Data Governance

Industry Scenarios

  • Data Usage: Who is using the Data and What is the purpose of using the Data?
  • Data Greenness: Are we using the most up-to-date version of the data we need?
  • Data Security: How do I restrict permissions so that access to data is controlled to only certain rows and columns? Can I grant limited or read-only access easily?
  • Data Overhead: Is there another department in this organization that could use the same data? Is there a way for me to find out if we’re already buying it?
  • Data Redundancies: Are there different departments doing similar work on the same data?
  • Data Discrepancies: How can we link all of the data we have to ensure we’re conforming to the same standards across our whole organization?

Business needs in focus

Data Catalog addresses a variety of business needs. The following are some of them

  • Data Lake modernization: Enterprises store various source systems Data into Data Lake in raw format with minimal metadata information, so it is difficult for the users to identify, understand and access the Data. To overcome this situation Data Catalog on the top of Data Lake would provide desirable information to the users like Business Analysts, Data scientists etc.
  • Cloud Modernization: Many Enterprises are moving towards Cloud Infra and services. Data Catalog provides the visibility of Data across on-premise, Cloud, and Hybrid environments. In addition, data Catalog data lineage provides involved source and target information and ensures that it won't be any data lost during cloud migration and Integration. Data Catalog supports for the Cloud spend analysis based on source data so users can make appropriate decisions related to cost comparison
  • Data Democratization: In Enterprises, Data is spread across multiple departments and stored in different systems. As a result, it is challenging to organize, maintain and utilize the data effectively and efficiently. Data Catalog provides reliable, predefined and pre-approved data. In addition, systems do not need to wait for the required datasets, and it contains a streamlined approval process to utilize any Datasets. This process increases productivity and reduces the timelines to search for the datasets to spread and analyze the findings.
  • Discovering Sensitive Data: Data Catalog supports discovering sensitive data where business doesn’t know the existence of sensitive data objects, which are compliance with PII, GDPR, etc.

Technology Ecosystem, Products and Platforms

There are many products in the industry that supports Data Catalog. Some of them are Data.World, Informatica, Collibra, Alation, etc.

The below is one of the Gartner reports which highlights product capabilities in various sectors.

Approach for Technology evaluation and selection

We can evaluate the tools based on various properties/ categories/functionalities comparison.

Regarding the Data Catalog, some of the functionalities are as follows.

  • Domain discovery
  • Data profiling
  • Semantic search
  • Column similarity Identification
  • Data classification of structured, semi structured and unstructured data
  • Intelligently recommends other Data sets that are similar to what they are working on
  • Questions & Answers
  • Alerts& Notifications
  • Metadata Repository & Metadata Search
  • Data Lineage
  • Data Lineage with JAVA, COBOL and SQL Code.
  • Impact analysis
  • Holistic relationship view
  • Logical, Physical Data Model Integration
  • Business Glossary association with technical Metadata.
  • Searching Technical assets using Business Glossary
  • Data Governance Integration
  • Accesses Metadata Knowledge graph with REST APIs
  • Export Metadata.
  • Knowledge Graph
  • Knowledge Graph acts as Data virtualization
  • API connectivity: Documents, Big Data, Cloud Platforms, Applications, ETL tools, BI and Databases, etc.
  • Data Catalog self services based on Data Domains, Searches
  • Catalog customizations: Ratings, Rankings and Reviews
  • Collaboration with other users.
  • Manage Data Access, Track user behavior and Guide the Analysts
  • Catalog as Cloud Native SaaS offering
  • Built-in Query Editor
  • Coupled with the chat and comment streams that accompany each Data asset
  • Ease of deployment
  • Quality of end user Training
  • Timelines of Vendor Response
  • Quality of Technical support
  • Quality of per user community
  • Solution Vision
  • Advanced Features and Focus Areas
  • Planned Enhancements
  • Execution Roadmap
  • Innovation Roadmap
  • Partner ecosystem
  • Ability to understand needs
  • Product Revenue
  • Number of customers
  • Pricing Flexibility

Outcome or Business benefits envisaged

  • Data Catalog can help improve your data cost-to-value ratio, spur collaboration and creativity with better data access, and solidify your data-driven culture.
  • To implement the trustworthy Data sources required for Data Governance and Data Privacy.
  • Scalable and flexible Data Infrastructure
  • Provides organized insights of Data by segregating the silos among the users and creating a central Data Domain.
  • Ensures stronger Data Governance and Data Stewardship.
  • Monitors the Data effectively.
  • Distribution hub by sharing the dataset among users based on role based permissions
  • Enables organizations to discover, govern and monetize the Data.
  • Flexible, visible, usable, sharable and security of the data.

Coforge value proposition or USP

Coforge has the following value proposition to implement Data Catalog.

Conclusion

In summary, A Data Catalog is a collection of metadata, combined with data management and search tools, that helps analysts and other data users to find the data that they need, serves as an inventory of available data, and provides information to evaluate fitness data for intended uses.

Coforge has the end to end experience to implement Data Catalog and to showcase the results as per the customer needs and guide the customer to identify the best ways to implement it