Data Augmentation: Bringing New Life to Your Data

By Yaniv Leven | December 13, 2016 | Updated On: March 30, 2021 | Data Stack

If you recognize your data as an asset, then augmenting it simply means growing your business assets. With data augmentation, you can run manipulations on existing data, use multiple sources from inside your business, and enrich with data from the outside.

Using the cloud and modern data management solutions, once connected, multiple internal and external data sources allow users to generate insights that were traditionally locked. In this article we will discuss the challenges of data augmentation, and suggest a number of practices that will help you address some of these challenges.

The business value of data augmentation

Before diving into the challenges and practices, let’s look at a few cases that demonstrate the added value that data augmentation provides. The most typical example is using unstructured data such as email, smartphone calls, and appointment data to augment customer relationship management through data science and machine learning. Data augmentation can also generate value for operations by linking real-time status data of field personnel (field engineers in oil or telecommunication companies, for example) with historical data about events and site equipment (for example).

In addition to internal business operations benefits, quality data can generate direct revenues. Organizations today not only use data to improve operational efficiency within their business, but also require their data engineers to ensure its quality in order to create new revenue streams.

To achieve this, data professionals need to build systems that can seamlessly access, link and correlate high volumes of new and existing data from various sources, and then find patterns and trends.

However, these tasks today are more challenging than ever, due to vast amounts of ever-changing data.

Data: Massive amounts, high velocity

The massive amounts of historical data in a data warehouse—and the constant streaming of new data—pose an ongoing challenge to organizations dependent on up-to-date, readily-available, business intelligence systems. This is challenging in particular due to evolving blurring lines between traditional batch jobs (i.e., ETL) and real-time data integration.

In line with the endless amounts of new data, there is an increase in the heterogeneity of data types. Raw pre-processed data comes in multiple forms, and includes a mixture of unstructured, semi-structured, structured, and archived data, of which only a subset is valuable. Only after the data has been transformed, indexed, and enriched does it become accessible and valuable for BI purposes. In addition to the amount and types of data, there is also the velocity of change. Incorporating data augmentation processes includes coping with ever-changing environments and business demands to analyze new sources of data.

These type of challenges result in an increase in complexity in your data lifecycle management. For this reason, augmentation should be addressed by your data management system, which should make the life of a data engineer easier.

Best practices for data augmentation

Augmentation is one of the last stages in the management process of your data. It enhances the quality of your data after it has been monitored, profiled and integrated. Data augmentation techniques include those based on heuristics, tagging to create groups, data aggregation using statistics, or the probability of events.

Below is a short list of best practices and recommendations to help you augment an existing data warehouse with new capabilities, with minimal disruption to ongoing operations.

Use a data explorer that supports JSON, CSV, or XML formats (for example), and can provide a basic view of the raw data, its format and values. Indexing and correlating the raw data using a tool such as IBM’s Watson Explorer can then help you identify relationships inside the data.
Keep data hierarchies, subject-oriented aggregates, and data dimensions in your DWH. In addition, federate data from your DWH with new data sources using data virtualization and management tools to extend existing data and schemas. Make sure also that you have the computing resources needed to maintain comprehensive clustering results, and the capability to run intensive analyses.
Use ELT (vs. ETL) technologies that allow you to load all your raw data, and only then transform and enrich it. ETL is useful for dealing with smaller subsets of data and moving them into the data warehouse. However, with the right ELT tool, all of your raw data can be instantly available while transformations take place asynchronously. You can run new transformations, and test and enhance queries directly on the raw data as required.
Use the cloud to store everything in your DWH, including your unstructured data, communications data such as customer feedback, Facebook and other social media data, phone logs, GPS data, photos, emails, and messaging.

No limits, no boundaries

In addition to limitless cloud resources to support storage and resources required to host and process the vast amount of data, in today’s world of data, there are no boundaries.

Modern data processing avoids specific algorithms or thresholds, where the expected result is a given. Instead one should ask what the results will be, given specific inputs.

This can be seen when dealing with machine learning or neural networks systems, as these complex modern systems are built to augment their own capabilities. By definition these intelligent systems don’t follow a set of strict rules, and with self-augmentation they evolve to be a capable part of every software system.

The data layer should be no different.