Data catalogs


Every day employees create new sets of data or obtain new datasets from external sources. The amount of data is steadily increasing, and so is the number of formats data is available in. Searching for data and retrieving it thus becomes a real challenge.​ For example, everybody knows the situation of failing to retrieve a market study or sales report where it is supposed to be stored in the company’s information systems. This is often followed by wasting valuable time for finding the right person to ask for help. It is obvious that some kind of Google for searching and retrieving datasets would help enormously in this situation.

For any company that wants to make use of big data and data analytics it is key to make datasets available across the entire organization and in a transparent fashion. By cataloging data and empowering users to retrieve it, enterprises can build up this capability.​

Data catalogs as a crucial element of future-proof data management

Having established a data catalog is an important milestone for a company on its journey to becoming a data-driven organization. From our point of view, a data catalog supports three basic principles of future-proof data management:​

  • data democratization: Data must comply with the FAIR principle; i.e., it has to be
    • Findable,
    • Accessible,
    • Interoperable, and
    • Reusable.
  • data citizenship: Employees that use or edit data become data citizens. As such, they are granted certain rights (above all, they receive convenient access to all corporate data they need for their daily routines). But data citizens also have their duties, as they are expected to update and maintain the data they use.

  • data sharing culture: Data silos must be cracked. Data does not belong to a person or function, but must be shared to unfold its value.

But what exactly is a data catalog?  And how did the concept evolve from previous concepts? To answer these questions, it is useful to take a brief look at how data management evolved over time.

From data dictionaries to data catalogs

Since it all started in the 1970s, companies have established a number of concepts for documenting data and making it available in a structured manner. At the beginning, companies mainly used system-specific data dictionaries. These data dictionaries were basically used for technical documentation of database tables.  But system landscapes became more and more complex over time.

It became necessary to plan the data architecture as well as the integration of data between systems. This implied linking system-specific data documentations to generic logical models. At the same time, it became increasingly important to analyze business data to facilitate decision-making. Companies therefore established data warehouses for central storage of business data. These data warehouses documented business data with regard to various dimensions in metadata repositories.

In addition, business glossaries were used to define the semantics and allow users to correctly interpret data. In recent years, companies have started to view data as a strategic asset. To treat data accordingly, they began to systematically parse, index and catalog data. So the concept of the data catalog was born.

What functionality should a data catalog offer? Well, the basic functionality of a data catalog is very similar to the concept of a traditional library …​

Data catalog as a digital library

A data catalog is a digital medium for users to search for and retrieve datasets. Organizing a data catalog can be compared to working in a library, where a librarian takes care of things and catalogs the books.In the case of data catalog, someone is responsible for describing, maintaining, updating and extending the datasets.  It is also possible to define roles, responsibilities, and even workflows (e.g., to manage data quality). In this regard, a data catalog supports certain aspects of data governance.

A data catalog should not just offer a search function. Functions allowing users to efficiently check datasets and find out whether they are relevant for their search are equally important. Once the user has found the right set of data, she or he may view it or download it from the catalog.

Similar to borrowing books from a library, each request and download of a dataset is registered by the system (e.g., by means of a shopping cart function, as it is typically used by online shops). 

Data catalog

As opposed to the concept of a library, in which usually each book can be borrowed by anybody, a data catalog allows specifying certain restrictions for certain datasets. For example, access to personal data may be granted to certain groups of users only.

Another function exceeding the concept of a traditional library is to rate and comment datasets. The data catalog thereby turns into a collaborative platform for users to provide each other with information and make the search for the right dataset more efficient.​

So, what data catalog solutions are available in the market? And how can they be distinguished from each other? We have identified four different types of providers.​

Data catalog software solutions

A large variety of solutions are available in the market. They typically have their focus on one of the following aspects:​

  • generalist: business-oriented data catalogs that support data governance and help manage data as an asset (e.g., collibra);
  • analytics enabler: data catalogs that support data scientists and data engineers in analyzing data in data lakes (e.g., Alation);
  • system expander: data catalogs which are embedded in the vendor’s own system landscape and which can be either of the two previous types (e.g., Microsoft Azure);
  • external data provider: data marketplaces offering data for money or free (e.g.,

The CC CDQ will be monitoring the data catalog market and various use cases in the upcoming months. An update on this topic will be presented in the course of our workshop in June.

Subscribe to our newsletter and stay updated on the activities of the CC CDQ.
Go to top