What is data preparation?
Data preparation, part of the data management process, involves collecting raw data from multiple sources and consolidating it into a file or database for analysis. Data preparation is an initial step in data warehousing, data mining, and machine learning projects.
The data preparation process includes the following activities:
- Data ingestion—copying or loading data from different data sources.
- Data fusion—integrating multiple data sources to create one consistent representation.
- Data cleansing—ensuring data is valid, complete, consistent, uniform, and accurate.
- Data augmentation—adding information that increases value—for example, enriching sales leads data with contact details.
Data preparation steps include:
- Accessing and cataloging data sources, tables, and fields
- Harmonizing, transforming and modeling data for analytics.
- Modifying, filtering, and transforming data to improve data quality.
Data fusion techniques
A key part of data preparation is data fusion—merging several data sources together to create a consolidated data set for analysis. Data fusion techniques include:
- Complementary data fusion—combining related information from different sources to obtain a complete picture. For example, combining product names and IDs with data about product attributes and prices.
- Redundant data fusion—combining the same information from different sources to obtain a higher level of confidence that the data is correct, or diagnose data quality.
- Cooperative data fusion—combining data sources into a target data set that is more complex than the original. For example, combining data about product models with data about sales of those products in the past 10 years.
What are data preparation tools?
Data preparation has traditionally been manual, time-consuming, and error-prone. Many data warehousing projects suffered from infrastructure and consistency issues that stemmed from imperfect data preparation.
Data preparation and data cleansing tools alleviate the problem, enabling data engineers, data scientists and business analysts to quickly explore, understand, and consolidate source data into a consistent format.
Data preparation tools are commonly offered as part of data mining, data integration, Extract-Transform-Load (ETL), or data management tools. Increasingly, data preparation tools support structured, unstructured, and semi-structured data, working with formats like XML, JSON, and text files.
Brief overviews of 5 leading data preparation tools
ClearStory Data, an automated analytics platform, provides data preparation as part of a full data discovery and visualization pipeline. Main data preparation features:
- Flexible modeling of source data using data stories.
- Fast data exploration using BI dashboards.
- Automated data inference that intelligently infers semantics in the data.
- Automated data harmonization that identifies relationships across all data dimensions.
- Automatic discovery of data values, patterns, and correlations across multiple sources.
Datawatch Monarch, a Windows application, accepts file or web pages and automatically extracts their data into analytics-ready rows and columns. It can also connect to databases and big data sources. Main data preparation features:
- Self service—built for non-technical business users.
- Automatic extraction—pulls data from files or web pages and converts it into rows and columns for analysis.
- Data sources—connects to relational databases, big data, and 30 other data sources.
- Workflows—provides automated user-defined workflows to eliminate manual steps.
- Data profiling—profiles and filters of data with visibility into quality issues.
Trifacta Wrangler helps clean and prepare messy data, leveraging machine learning algorithms to suggest common transformations and aggregations. It exports a file ready to be used for data visualization or machine learning projects. Main data preparation features:
- Works with messy data, including JSON, raw CSV, Excel files.
- Integrates with HDFS, Redshift, S3, and many other data sources.
- Automatically profiles data and provides a quick measurement of data quality—how many rows have missing, mismatched, or inconsistent values.
- Visually categorizes data by type.
- Automatically cleans and enriches data—makes intelligent suggestions such as splitting columns, adding, deleting, or changing values, with auto-complete.
- Creates and reuses data preparation recipes.
SAS Data Preparation
SAS Data Preparation, based on the Viya platform, helps you access, profile, cleanse, and transform data from an intuitive interface. Main data preparation features:
- Provides a visual interface for all operations, without requiring knowledge of code or SQL.
- Allows advanced users to customize SAS data quality routines or code in Python.
- Automates data preparation tasks with pre-built transformation and cleansing functions.
- Auto-generates code which users can share with IT to run during data updates.
- Enables collaboration on jobs and project activity with other team members.
- Provides high-performance, in-memory data processing.
Datameer is a platform that covers the entire data lifecycle, from data ingestion tooling to data exploration and consumption. Main data preparation features:
- Versatile—can process any data regardless of complexity—structured, semi-structured and unstructured.
- Spreadsheet interface—intuitive interface business analysts can easily use to transform and blend complex data.
- Visual profiles—visualizes data at every step of the data pipeline.
- Data transformation—270 pre-built functions, from deduplication and normalization to entity extraction, sessionization, and geo-location.
- Pattern recognition—uses algorithms to discover hidden patterns within data, such as behavioral groupings, relationships, and data quality recommendations.
Automated data management: making data preparation a thing of the past
Specialized data preparation tools have taken us a long way from the days of manual data wrangling with spreadsheets and SQL queries. While these tools automate many parts of the data preparation process, they still require humans to manually review data, understand its complexity, and piece together the required data set.
Today many organizations are moving away from the traditional Extract-Transform-Load (ETL) paradigm, in which you prepare data and then load it into a data warehouse. In the new way, called Extract-Load-Transform (ELT), raw data is loaded as-is to a data warehouse—typically deployed on the cloud with practically unlimited storage. Then data preparation and transformations are done automatically and on the fly.
Panoply is a cloud-based data warehouse automation platform that uses machine learning and natural language processing to take source data, automatically cleanse and prepare it for analysis and dashboarding.
Panoply performs takes care of the entire data pipeline—including data ingestion, automatic schema identification, query optimization, and fast query processing at petabyte scale. It is an automated data management platform that makes data preparation—as a separate stage in the data pipeline—a thing of the past.