Data Catalog

Key Visual: 

Data Catalog Reference Model and Market Study published

Every day employees create new sets of data or obtain new datasets from external sources. The amount of data is steadily increasing, and so is the number of formats data is available in. Searching for data and retrieving it thus becomes a real challenge. ​ For example, everybody knows the situation of failing to retrieve a market study or sales report where it is supposed to be stored in the company’s information systems. This is often followed by wasting valuable time for finding the right person to ask for help. It is obvious that some kind of Google for searching and retrieving datasets would help enormously in this situation.

For any company that wants to make use of big data and data analytics it is key to make datasets available across the entire organization and in a transparent fashion. By cataloging data and empowering users to retrieve it, enterprises can build up this capability.

Data Catalogs as an integrated platform matching data supply and demand

The idea of a Data Catalog seems intuitive. It can be compared to a traditional library or more broadly considered as platform that matches data supply and demand. In reality, however, companies find it challenging to define which scope and users their Data Catalog should have. Uncertainty also arises from the fact that the market for Data Catalog solutions is still very dynamic, so that an increasing number of Data Catalog tools of different scope and functionality are offered by multiple vendors.

CC CDQ research results on Data Catalogs – Reference model and market study

Data catalogs have been a focus of research activities in the Competence Center Corporate Data Quality (CC CDQ) in 2018, and resulted in a reference model and a market study. These outcomes have been developed in collaboration with researchers from Fraunhofer ISST and a large number of data management experts. They help companies in assessing Data Catalog solutions available on the market and choosing the one most suitable to meet their specific needs. The report is available for CC CDQ members in the CC CDQ knowledge base.

A few key insights of the report are presented in the following:

Since it all started in the 1970s, companies have established a number of concepts for documenting data and making it available in a structured manner. At the beginning, companies mainly used system-specific data dictionaries. These data dictionaries were basically used for technical documentation of database tables. But system landscapes became more and more complex over time.

It became necessary to plan the data architecture as well as the integration of data between systems. This implied linking system-specific data documentations to generic logical models. At the same time, it became increasingly important to analyze business data to facilitate decision-making. Companies therefore established data warehouses for central storage of business data. These data warehouses documented business data with regard to various dimensions in metadata repositories.

In addition, business glossaries were used to define the semantics and allow users to correctly interpret data. In recent years, companies have started to view data as a strategic asset. To treat data accordingly, they began to systematically parse, index and catalog data. So the concept of the Data Catalog was born.

The concept of the Data Catalog can be illustrated with the help of concepts known from library science (see Table 1). In a traditional library, a librarian takes care of books and other physical media. Cataloguing the items is a central task and associated with documenting essential information in a structured manner so that users can find, access and borrow the books, DVDs, etc. they are interested in.

In the library, a librarian is responsible for cataloguing and organizing books and other media. In a Data Catalog, a similar role is needed for describing, maintaining, updating and extending datasets. This requires also ways to define roles, responsibilities, and even managing workflows (e.g. to manage data quality or understanding the usage). In this way, a Data Catalog supports certain aspects of data governance.

Table 1: Comparison of a Library and a Data Catalog

From the user’s perspective, a Data Catalog should not just offer a search function, but also functions allowing users to quickly check datasets and find out whether they are relevant for their search. Once the user has found the right set of data, she or he can get more information about the dataset and where it is stored, otherwise the data can be accessed directly via a download from the catalog. Similar to borrowing books from a library, each request and download of a dataset is registered by the system (e.g., by means of a shopping cart function, as it is typically used by online shops).

In analogy to a library, in which usually books can be borrowed by authorized persons, a Data Catalog allows specifying certain restrictions for certain datasets (for example, access to personal data can be granted to authorized groups of users only). A function exceeding the concept of a traditional library is to rate and comment datasets – the Data Catalog thereby turns into a collaborative platform for users to provide each other with information and make the search for the right dataset more efficient.

From reviewing academic and practitioner publications, we derive the following definition:

"A Data Catalog is an integrated platform for data curation, bringing data supply and demand together. It offers users functions to register data; to retrieve and use data; and to assess and analyze data. A Data Catalog therefore should provide a data inventory (for data supply) and features for data discovery (for data demand) as key components. Additional features should support data governance, data assessment, and data analytics, alongside with appropriate features for catalog administration and data collaboration."

Data Catalog as an Integrated Platform for Bringing Data Supply and Demand Together

A main research outcome is the Data Catalog reference model. It was developed based on experiences from 5 Data Catalog projects and integrates feedback from the extended CC CDQ network. The reference model comprises three components:

  1. The role model describes eight user groups and use cases to be supported by a Data Catalog; they include data manager and Chief Data Officer, data citizen, data owner, data analyst, data protection officer, data steward, data architect and solution architect.

  2. The functional model describes the main functional building blocks to be provided by a Data Catalog. It comprises basic function groups (Administration, Data Inventory, Data Collaboration, Data Governance, Data Assessment, Data Discovery, and Data Analytics), plus a set of cross-sectional functions that can be applied across these functional areas.

  3. The information model describes how data should be documented in a Data Catalog. It is currently still under development and will be available in the next version of the report.

Overall, 29 Data Catalog solutions have been identified, and fifteen of them are part of the detailed market analysis in the report. We found that there is a broad variety of tools that is marketed under the term “Data Catalog”, addressing diverse functional areas and user requirements. Basically, two groups of solutions can be distinguished:

  • On the one side, we have tools primarily targeting the management of data lakes and/or the management of use cases (particularly of data analysts). These tools take advantage of machine learning technology in order to build up a data inventory by scanning, collecting and describing data in a highly automated fashion. In addition, these tools offer analytics functions to support the management of data lake environments. Solutions in this category are, for example, Anzo Smart Data Lake 4.0 (Cambridge Semantics), Enterprise Data Catalog (Informatica), Smart Data Catalog (Waterline), or Data Management Platform (Zaloni)

  • On the other side, we have tools that focus on data collaboration and data governance. These tools primarily aim at supporting data management workflows. With these tools, the data inventory is not built up automatically, but through manual action on the part of the data catalog users. Solutions in this category are, for example, Adaptive Metadata Manager, Data Governance Center (Collibra), Information Value Management (Datum), InfoSphere Information Governance Catalog (IBM), or Information Steward (SAP)

Recommendations and support for your Data Catalog projects

The Data Catalog reference model and market study are intended to support Data Catalog projects: First, you can use the reference model to define and scope a Data Catalog project; Second, the reference model and market study help you in assessing Data Catalog solutions available in the market and choosing the one most suitable to meet their specific needs.

Outlook – How to implement and succeed with a data catalog?

Data Catalogs will continue to be a focus topic of CC CDQ’s research activities in 2019. Since the market is evolving rapidly, we will continue our research and update the report in 2019. We will also closely work and exchange with CC CDQ companies on Data Catalog design and implementation, to capture and derive good practices.

Any questions about Data Catalogs or comments to our study? Just contact us!
Go to top