Advanced analytics for more data-driven insights and self-learning systems
Data volumes are exploding, and companies are striving to use advanced analytics for more data-driven insights and self-learning systems. This implies rethinking their enterprise analytics platforms (EAP) and the way how data is managed. Enabling scalable data onboarding and analytics delivery processes with little human intervention but strong governance is key to extract value from Big Data and Analytics successfully.
From Data Warehouses and Business Intelligence to Data Lakes
Data warehouses and business intelligence (BI) tools have been the cornerstone of existing analytics platforms and work well when the use of data is known from the beginning. Here, data is first cleaned and then loaded with a pre-defined schema (schema-on-write) to achieve a single version of the truth (SVOT). However, this approach does not serve the needs of modern data science where data is examined in an explorative manner to develop advanced analytics models. In this case, data lake infrastructures are more suitable. Here, data is loaded without a pre-defined structure (schema-on-read) to enable multi versions of the truth (MVOT). But data lakes are not only used to explore data and develop data science pipelines. They serve to run data science pipelines in production.
Enterprise Analytics Platform to support Information Supply Chains
A future-proof Enterprise Analytics Platform supports the delivery of different type of information products - from simple, pre-defined reports to ad-hoc analysis, real-time insights, data science development and data science production. These information products fulfil distinct analytics needs in an organization. Each information product can be conceptualized and associated with a specific information supply chain, i.e. the successive processing steps required to produce and deliver it in a scalable way.
Figure 1 provides an overview of a generic information supply chain, with six generic processing steps along the phases of data onboarding and information product delivery. Data onboarding includes data sourcing, preparation and storage. Information product delivery comprises the steps of data processing, access and usage.
Figure 1: Enterprise Analytics Platform © Competence Center Corporate Data Quality
For each step, the platform provides different components depending on the analytics needs:
- Sourcing: Comprises all activities in order to access data from the data sources available to the company. Data sources can be either internal, e.g. ERP, but also external, e.g. Social Media. Depending on the data source, data can be either structured, e.g. business transactions, or unstructured, e.g. text documents.
- Preparation: Comprises all activities to prepare data before storing it. In the case of the data warehouse, data is cleaned after extraction and then loaded into a pre-defined structure. In the case of the data lake, data is extracted and loaded into the infrastructure as it is.
- Storage: In this phase all data is stored in a specific format and form depending on the storage medium/ concept. The data warehouse stores data centralized with a pre-defined schema. In the data lake, data is physically distributed and stored without defining its schema upfront.
- Processing: Comprises various activities to transform data for a specific analytical purpose. In data warehouses, data is analyzed in batches. In data lakes, data can be analyzed in batches, but also in streams for real-time processing (so-called Lambda architecture). Data stored in the data warehouse is extracted, transformed and loaded into data marts to process data of a certain domain. In a data lake, data is analyzed in a distributed manner with multiple computer node processing data in parallel.
- Access: Provides the access to data in various forms, depending on the analytical purposes. This can either be self-service analytics tools, pre-defined tools or dashboards or dedicated environments for data science exploration and development.
- Usage: Data is used for different purposes. We can distinguish reporting, ad-hoc analysis, development, production and real-time insights.
Exemplary information supply chains
The following two examples illustrate different types of information supply chains – with differences in data onboarding and analytics delivery.
To democratize data and increase its use in daily decision making, companies provide self-service analytics tools, such as Tableau or PowerBI, to their employees. With these tools, users can easily analyze and aggregate data without programming skills and visualize data in an interactive way. When it comes to data onboarding, master and transaction data is extracted from operational systems, transformed and loaded into a data warehouse with a unified format. The data warehouse holds data from various domains. In order to analyze data of interest, data needs to be loaded first into a data mart before it can be accessed with self-service analytics tools.
In order to develop advanced analytics use cases, data is provided in dedicated environments to data scientists, typically called data labs or sandboxes. In these environments, data scientists can use the tools they are most comfortable with and experiment with the provided data as they wish. For the purpose of the use case, data needs either to be newly onboarded or is already accessible on the data lake. Following this “pull-principle” for data onboarding, it is avoided to load data into a data lake which is not used at the end. Within their dedicated environments, data scientist can explore and develop pipelines using the distributed infrastructure of the data lake in a scalable way.
Data management for Big Data and Analytics - where to start?
Data management is key to provide scalable enterprise analytics platforms and leverage their full potential. The Competence Center Corporate Data Quality (CC CDQ) currently works on extending the Data Excellence Model to cope with the new challenges (see table below):
© Competence Center Corporate Data Quality
The required actions to manage Big Data and Analytics comprise organizational as well as technical enablers. The CC CDQ has initiated a Co-innovation group to elaborate on these enablers and to develop stable manage concepts for the Enterprise Analytics Platform using a proven research methodology. In this close collaboration between researchers and practitioners, existing data management practices will be extended to ensure the scalability of Analytics use cases. The organizational enablers will be addressed with a process model, a role model and data ownership principles. The technical enablers comprise an Analytics product lifecycle and Data Catalog workflows.
Figure 2: Organizational and Technical Enablers © Competence Center Corporate Data Quality