Sourcing and Managing External Data

External Data in Data Management

External Data Often Remains an Untapped Resource

With increasing amounts of data available via the Internet or obtained from specialized data providers, external data is becoming more and more relevant. External data complements internal data and help to improve advanced analysis, optimize business processes (e.g. with geolocation, weather, or traffic data), reduce internal data maintenance efforts, and create new services. However, despite their increasing relevance, external data remain an untapped resource for most companies.

What is External Data?

Although most companies have an intuitive understanding of external data, there is no common definition.

In practice, external data is often associated with specific debates like "open data" "linked open data" or "data market places". The following definition has been developed by the CC CDQ and reflects the understanding of most companies.

Definition


External data refers to any type of data that has been captured, processed, and provided from outside the company.
(Krasikov, Eurich and Legner, 2020)

The Four Types of External Data: Open Data, Paid Data, Shared Data, and Social Media data

Based on a review of current practices, we distinguish four relevant external data types: open data, paid data, shared data, and social media data. While all four types have a common feature of stemming from external data sources, they differ in provenance, access, costs, structure, and further dimensions.

Open Data Paid Data Shared Data Social Media
Provenance Governments, NGO's, companies Professional data providers Companies' internal data, authoritative sources User-generated content
Access Open data platforms, meta platforms, direct links Dedicated portals or software Bilateral exchange or intermediary Connection to official access points or web-crawling of social media platforms
Price and listing Freely available Provided at a cost (e.g. pay for use, subscriptions, freemium) Fees can be charged by any intermediary Freely available, subject to copyrights
Structure Semi-structured, unstructured Structured Structured, semi-structured, unstructured Unstructured

  • Open Data

    Open data can be defined as data that is freely available, and can be used as well as republished by everyone without restrictions from copyright or patents.

  • Paid Data

    Paid data is commercially available data, acquired directly from specialized data providers (or brokers) and data marketplaces, and offered at a certain cost.

  • Shared Data

    Shared data refers to the data which is shared between companies within business ecosystems (for instance within the CDQ Data Sharing Community). 

  • Social Media

    Social media data refers to the data shared by users of social media platforms (e.g. Facebook, LinkedIn, Twitter), including metadata (e.g. location, time, language).

Open Data

Open data can be defined as “data that is freely available, and can be used as well as republished by everyone without restrictions from copyright or patents” (Braunschweig et al., 2012).

As of April 2020 there were about 4000 open data sources worldwide. Among them for instance the European Open Data Portal.

The variety of themes and topics covered by open datasets the ground for many usage scenarios in the business context. For instance

  • Demographic and economic statistics improve marketing and customer targeting analytics
  • Multiple codes and standards (HS codes, dangerous good, GTINs, ISO country codes, etc.) may be used to enrich the existing companies' data
  • Data from official corporate registers can help to improve the business partner data quality by removing duplicates or adding new entries

An exhaustive overview of how open data can be used in the business setting is available in the CC CDQ working report Open data use cases“ (Krasikov et al., 2019).

Paid Data

Paid data is commercially available data, acquired directly from specialized data providers (or brokers) and data marketplaces, and offered at a certain cost.

One of the typical providers of paid data is Dun & Bradstreet (D&B) with their D&B master data solutions, which offer commercial data, analytics, and insights for business. Other popular providers are Nielsen for market research data or Reuters for financial data.

Given that paid external data is provided at a cost, it is in provider’s best interest to deliver the data in high quality and provide exhaustive description.

Typically, the paid data is provided as a structured information, or even delivered with extracted knowledge directly to the end user.

Shared Data

This external data type refers to the data which is shared between companies within business ecosystems (for instance within the CDQ Data Sharing Community or industry platforms, such as Skywise or GDSN).

Within a protected environment companies can share their internal data with their business partners and benefit from community efforts, e.g. in terms of maintenance, completeness, and up-to-dateness. Examples for sharing and exchange environments include

  • CDQ Data Sharing Community in which leading multinational companies enrich their own data not only with public sources, but also validate their data with the records of their business partners. This innovative concept goes beyond simple exchange of data, but also offers the opportunity to share the common knowledge, such as business rules
  • Global Data Synchronization Network (GDSN) provided by GS1 as global data pool for retail and consumer goods industries
  • Skywise, a data platform, initiated by Airbus and Palantir Technologies, connects the aviation value chain and includes over a hundred airlines worldwide, as well as suppliers

While sharing data can yield multiple benefits, establishment of the infrastructure to ensure the data exchange may be costly for the intermediary, incurring fees for the users.

Nonetheless, transparency within the community, joint interest of maintaining high quality of the data, and trustworthiness of the sources makes this type of external data particularly valuable.

Social Media

Social media data refers to the data shared by users of social media platforms, including the metadata (e.g. location, time, language, biographical data).

Availability of this data varies significantly from one social media platform (e.g. Facebook, LinkedIn, Twitter) to another, primarily because of the unstructured or semi-structured format.

From an enterprise perspective, social media is an invaluable source of information. For example

  • Personalized advertisement can be based on the previous social media postings and users’ preferences
  • Hashtags, which are widely used across multiple social media platforms, allow analyzing trends, and are an important source for market analytics
  • Contact information provided by the users of the social media platforms might be considered as a source of the up-to-date information to validate company’s internal records

One of the major challenges is the extraction of the information from the social media platforms. The nature of the content, which appears very fast, monitoring or web-scrapping software is necessary to ensure the constant flow of the information. Even though there are no direct fees applicable to the extraction of the data from the social media platforms, processing efforts might be costly.


What are Typical Challenges of Sourcing and Managing External Data?

According to recent studies, most companies:

  • Do not coordinate external data sourcing. They have no formal sourcing process for external data
  • Do not leverage the potential of external data sources: They have no role of a “data hunter”, specializing in discovering new alternative data sources
  • Find it challenging identifying, integrate and maintain external data sources given the heterogeneity and lack of transparency (about semantics quality, update cycles, etc.)

How to Use External Data: Examples of Usage of External Data in Business

External data can also be useful in the following situations:

  • Providing data-driven insights: Data analytics can be enhanced with external data in operational areas, like customer relationship management, HR, supply chain and warehousing. For example, a grocer who wants to improve the demand forecast with the help of external data can rely on the weather data, data from suppliers, and economic data
  • Improving business processes: many companies already use geolocation, weather and traffic data to plan and manage their deliveries; additional information about exceptional events, such as disasters, can help them avoid disruptions in the supply chain
  • Enhancing data management capabilities: sourcing external data reduces data maintenance efforts. It may be also used to enrich internal data and improve data quality
  • Enabling new services: external data is also used to innovate and introduce new products and services matching consumers' needs
Advanced analytics, ML and AI
Enriching with external data helps to improve the model quality of company's analytics algorithms

Use example: a digital media company used external data to better predict and improve employee retention rates

Relevant sources: postings from job websites, economic and labor sources, social media data

Business processes
External data enhances company's business processes

Use example: a logistics company uses external data to avoid and predict disruptions to clint's supply chains

Relevant sources: demographic data, social media data, geolocation data

Data management
Reduction of data maintenance, enrichment, and data quality improvement

Use example: maintenance of an up-to-date reference data in order to fill the import/export documents (SAD) timely and precisely

Relevant sources: commodity codes, tax tariffs, country codes, industry classification codes

Creation of new services
External data helps to innovate and introduce new types of services and business models

Use example: an agricultural giant launched a new service to help farmers predict and optimize crop yields

Relevant sources: geolocation data, weather data, IoT data


Managing External Data is a Research Focus in the CC CDQ

The Competence Center Corporate Data Quality (CC CDQ) currently works on improving external data sourcing practices. The dedicated co-innovation group consists of data management experts representing over 10 global companies from different industries (manufacturing, IT, pharmaceutical, transportation, retail, packaging, and insurance).

How to Join this Initiative?

If you would like to be part of the Community of the Competence Center Corporate Data Quality, please contact Tobias Pentek.

Contact us now!

Go to top