Using Machine Learning for Improving Data Quality

Why Use Machine Learning to Improve Data Quality?

Increased data volumes put companies under pressure to systematically manage and control their data assets. In addition, common data management practices lack sufficient scalability and do not have the capacity to manage ever-increasing data volumes. Companies, therefore, need to rethink their data management. The good news is that substantial progress in artificial intelligence (AI) and machine learning (ML), in terms of learning from data and automating repetitive tasks, can support you in your data management activities.

Machine learning has proven its potential in real-world business settings: With an ML enabled data curation system, the curation costs for data cleansing, data transformation and deduplication could be reduced by 90%. (Stonebraker, Bruckner, and Ilyas 2013).

How Can Machine Learning Support our Data Management and Help us Improve our Data Quality?

In order to assess the role and potential, the Competence Center Corporate Data Quality (CC CDQ) collected and analyzed ML use cases from academic research, software vendors, and data management experts (Fadler & Legner, 2018). With our study, we aim to identify typical application scenarios that can help data managers find potential areas of application for ML in data management. By now, we developed a taxonomy for classification of uses cases and derived 11 typical application scenarios for machine learning in data management from 44 collected use cases.

Our study reveals that ML can be applied in all phases of the data life-cycle to achieve the following:

  • To create and enrich data assets in an efficient, user-friendly way
  • To maintain high-quality data by supporting proactive and reactive data maintenance as well as for data unification
  • To manage the data life-cycle, especially when it comes to sensitive data and retiring data
  • To increase the use of data by improving data discovery by users, specifically by data scientists

The following overview provides more details about the data life-cycle phases and ML application scenarios.

Data Management Activity Acquire and Create Data Unify and Maintain Data Protect and Retire Data Discover and Use Data
Role(s) Involved Data Collector Data Manager, Data Integrator Data Protection Officer, Data Manager Data Scientist
Problem Areas Typos; wrong/invalid data entries; blank fields; manual effort Data integration across multiple systems (leading to inconsistencies); correction of data errors; definition of business rules Lack of transparency where personally identifiable information (PII) is stored; compliance with data protection regulations Finding and cleaning relevant data; identification of data relationships
Learning Data entry patterns; data incidents; data extraction patterns; data creation patterns Data repairing patterns; association rules; outliers and anomalies; similarities PII identifiers; fraudulent data access behavior Data recommendations; linking of datasets
ML Application Scenarios ML assisted data creation (e.g. auto-filling values in forms, automatic extraction of data) and data enrichment ML assisted data maintenance (reactive: data correction; proactive: business rules) and data unification (matching and deduplication) AI-/ML assisted data protection (e.g. identification of sensitive data, detection of fraudulent behavior) and data retirement (end of life) AI-/ML assisted data discovery (e.g. recommendations, linking of datasets)

Although ML's use in data management is only in an early stage, the first implementations are very promising! For instance, Bosch has been able to almost completely automate the manual process of commodity code assignment in product master data creation with the help of machine learning. With this approach, Bosch can fulfill the increasing demand for this assignment task across the enterprise with a scalable solution. Find detailed information in Bosch's winning application for the CDQ Good Practice Award 2018.

The bottom line of our analysis is that ML has the potential to significantly enhance data management practices and improve data quality. ML allows for managing data assets in an intelligent and more scalable way, but also disrupts the way data is managed.

How Can I Get Access to Further Research Results?

In order to get access to further information, we kindly ask you to evaluate the benefits and implementation status of the identified application scenarios. Answering our questions should not take longer than 10 minutes.

Your benefits of participating in the survey:

  • You will get immediate access to a presentation with ML application scenarios and a summary of our current findings
  • You will receive the upcoming research paper "Managing Data as an Asset with the Help of Artificial Intelligence" in your inbox as soon as it is published
  • You will receive the results of this study

Start Survey now!

Managing Data as an Asset with the Help of AI

A recently published study of the CC CDQ investigates the question of how business advantages for companies can be drawn from the rapidly increasing amounts of data with the help of artificial intelligence. Request study now

Any questions about Machine Learning or comments on our study?

Just contact us!

CDQ Suite for Collaborative Data Management

The CDQ Suite helps companies get rid of tedious data maintenance work and sustain high data quality. It delivers Data Quality as a Service (DQaaS) with zero maintenance for data-driven organizations. Improve your data quality now!

Go to top