What is data management?
Data management is a broad and ambiguous concept. The Global Data Management Community (DAMA International) defines it as “the development of architectures, policies, practices and procedures to manage the data lifecycle”. But when people say “data management”, what do they really mean? We suggest five possibilities:
- Cloud data management - the process of integrating data from an organization's ecosystem of cloud applications. The main distinction of cloud data management is that all data storage, intake, and processing takes place in a cloud-based storage medium.
- ETL and data integration - loading data from data sources into a data warehouse, transforming, summarizing and aggregating them into a format suitable for high in-depth analysis.
- Master data management - a method for managing critical organizational data: customers, accounts and parties named in business transactions, in a standardized way that prevents redundancy across the organization.
- Reference data management - defines permissible values that can be used by other data fields, such as postal codes, lists of countries, regions and cities, or product serial numbers. Reference data can be home-grown or externally provided.
- Data analytics and visualization - processing selected data from big data sources and data warehouses, performing advanced data analytics, and allowing analysts and data scientists to slice, dice, and present visualizations and dashboards.
With today’s massive quantities of data, high-quality tools are essential to achieving data management best practices. Organizations use data management tools from all five categories above, to manage and automate the data management process:
- Cloud Data Management tools - built on the cloud, for the cloud, these tools connect to and integrate multiple data sources via API’s, webhooks, or direct database connections.
- ETL tools - help organizations load data from multiple sources, define complex, automated transformations of the data, test the data pipeline, and load data continuously to a target database or data warehouse.
- Data Transformation tools - help with the transformation of raw data into clean, aggregated, analyzable data as it moves from individual data sources to an analytics warehouse--or within the analytics warehouse, at the point of analysis.
- Master Data Management (MDM) tools - help visualize complex sets of master data across the organization, and facilitate data stewardship by subject matter experts, who oversee creation and maintenance of reference data.
- Reference Data Management (RDM) tools - often provided as part of MDM suites, define business processes around reference data, and help stakeholders populate reference data and manage it over time.
- Data visualization and data analytics tools - help organizations explore, analyze and visualize big data sets, and generate reports and dashboards to extract insights and guide business decisions.
Below we cover several great tools from each of these categories, both to help you understand each category and to move closer to selecting the best data management tool for your needs.
Best Cloud Data Management tools
As storage and bandwidth have become cheaper, more and more off-premise solutions for data warehousing and management have become available. Companies with large amounts of data to store, sift through and analyze now routinely store and manage their data entirely in the cloud. This workflow has been made possible by the proliferation of cloud data management tools in the past 5-10 years. While the field has been led primarily by giants like Amazon and Google so far, many smaller companies now offer tools for customers with data needs of all sizes. See below for a list of potential options for cloud data management.
Panoply offers a cloud-native automated data warehouse that makes it easy to integrate and manage all your organization’s data. Key features:
- Large selection of native data connectors allows for easy, one-click data ingestion
- Automated data ingestion and preprocessing frees up IT resources
- Intuitive management dashboard takes the guesswork out of data management and budgeting
- Automated scaling and maintenance of multi node databases for low-maintenance data warehousing
- In-browser SQL editor for data analysis and querying
- Connections to common data visualization and analysis suites such as Tableau, Looker, Chartio
Panoply price: $325/month
2. Amazon Web Services
Amazon Web Services offers an ever-expanding set of tools that can be put together into an effective cloud data management stack. Key services include:
- Amazon S3 for temporary and/or intermediate storage
- Amazon Glacier for long-term backup and storage
- AWS Glue for building data catalogs to categorize, search and query your data
- Amazon Athena for SQL-based data analytics
- Amazon Redshift for data warehousing
- Amazon Quicksight for dashboard construction and data visualization
- Each service is spun up and billed separately, so costs depend on the extent of utilization
AWS Price: variable, dependent on implementation
3. Microsoft Azure
Microsoft’s Azure platform provides a number of different ways to set up a cloud-based data management system, as well as analytics tools that can be used on your Azure-stored data. Like AWS, Azure allows for multiple database/data warehouse styles with a great set of tools for managing them. Key services include:
- Standard SQL databases and VM-based SQL servers
- Blob storage
- NoSQL-style table storage options
- Private cloud deployments
- Azure Data Explorer (ADX) - a recently-added service that allows for real-time analysis of very large streaming data without the need for preprocessing
- Easy integration with Panoply for ELT/ETL services
Azure price: variable, dependent on implementation
4. Google Cloud
Like Amazon, Google’s Cloud platform offers a broad set of tools for cloud-based data management, as well as a workflow manager that can be used to tie the different components together. Key Google Cloud components:
- BigQuery for tabular data storage
- Cloud BigTable for NoSQL database-style storage
- Cloud Pub/Sub and Cloud Data Transfer for data intake; Google Cloud can also connect with a variety of other data sources
- BigQuery analytics for SQL-style queries
- ML Engine for more advanced analyses using machine learning and AI
- Data Studio for GUI-based analysis and dashboard construction
- Cloud Datalab for code-based data science
- Connections to common BI tools like Tableau, Looker, etc.
Google cloud price: variable, dependent on implementation
Best ETL and Data Integration Tools
For the comprehensive list of EL tools check out our list of top ETL tools.
1. Informatica Powercenter
Informatica Powercenter is an on-premise ETL tool with the following key features:
- Seamless connectivity and integration with all types of data sources using out-of-the-box connectors.
- Automated data validation - script-free automated audit and validation of data moved or transformed.
- Advanced data transformations - supports non-relational data, able to parse XML, JSON, PDF, Microsoft Office and IoT data.
- Metadata-driven management - provides graphical views of data flows, impact and lineage.
Informatica PowerCenter price: $2,000/month for the most basic plan
2. Stitch Data
Stitch Data is a cloud-based ETL platform with following key features:
- Pre-integrated with dozens of data sources on and off the cloud, moves data into Amazon Redshift, S3, BigQuery, Panoply, PostgreSQL, and more.
- Easy scheduling for data replication.
- Error handling and alerting with automated resolution when possible.
- API and JSON framework, letting you push data into a data warehouse programmatically.
- Managed cloud service with automatic scaling and enterprise-grade SLAs.
Stitch price: $100 - $1,000/month based on data size
Fivetran is a fully-managed data pipeline with a web interface that integrates data from SaaS services and databases into a single data warehouse. Key features:
Provides direct integration and sends data over a direct secure connection using a sophisticated caching layer.
Caching layer helps to move data from one point to another without ever storing a copy on the application server.
Fivetran does not impose any data limit.
Can be used to centralize a company’s data and integrate all sources to determine Key Performance Indicators (KPIs) across an entire organization.
Fivetran price: Available upon request
Blendo is another cloud-based ETL and data integration service, with the following key features:
- Self service - connects to numerous data sources with a few clicks, moves data to Amazon Redshift, Panoply, PostgreSQL, MS SQL Server, and more.
- Historical data - loads and synchronizes historical data from cloud services.
- Scheduled loading - load data periodically or at selected frequencies from different data sources.
- Data scheme optimization - automated collection, detection and preparation of data using optimal relational schema.
Blendo price: Available upon request
4. Microsoft SQL Server SSIS
Microsoft offers SSIS, a graphical interface for managing ETL using MS SQL Server. Key features include:
- Easy-to-use interface allows users to deploy integrated data warehousing solutions without having to get involved with writing much--or any--code.
- Graphical interface allows for easy drag-and-drop ETL for multiple data types and warehouse destinations, including non-MS DBs.
- Great solution for a team with a mix of technical skill levels, as it’s equally effective for ETL ninjas and point-and-click types alike.
SSIS price: $931 - $15,000
5. Azure Data Factory
In addition to SQL Server SSIS, Microsoft’s on-premise ETL solution, the company also offers Azure Data Factory (ADF), an ETL tool for their cloud-based Azure platform. Key features of ADF:
- ETL pipelines in ADF are built in a graphical interface, allowing for low-code use.
- Wide variety of data connectors for easy data ingestion--except, strangely, support for loading Microsoft Excel files.
- Full support for loading data into Azure data warehouses
Azure Data Factory price: $1 for 1,000 runs per month
Talend open source data integration software products provide software to integrate, cleanse, mask and profile data. Key features of Talend offerings include:
- GUI that enables managing a large number of source systems using standard connectors.
- Master Data Management (MDM) functionality
- Single, consistent and accurate view of key enterprise data.
Talend price: $1,170/user monthly or $12,000 annually
Alooma offers an enterprise-scale data integration platform with great ETL tools built in. Some key features of Alooma offerings:
- Strong focus on rapid pipeline construction, data quality monitoring and error handling.
- Protection against loss or corruption of data in a potentially error-prone ETL process.
- Flexibility to intervene and write your own scripts to monitor, clean and move your data as needed.
- Designed for enterprise-scale operations.
Alooma price: $1,000 - $15,000/month
Best Data Transformation Tools
Dataform is a SQL-based, fully managed data transformation platform for managing processes in your cloud data warehouse. Key features:
- Write SQL workflows as a team in a collaborative IDE. Built in version control and integration with Github.
- Write data quality tests and set alerts if they fail to ensure your data is always reliable.
- Create a centralized repository for data definitions across your company, document your data and discover datasets in a data catalog.
- Run schedules to ensure your data is always up to date.
Dataform price:: From $550/month (startup discounts available)
DBT (Data Build Tool) is a SQL-based data transformation tool that allows you to set up modular transformation flows from the command line. Built with an eye toward streamlining data analytics and engineering workflows, DBT’s key features include:
- SQL-based tool
- Easy workflows for building modular transformation and data modeling flows
- Automated data quality management with a robust testing system
- Streamlined analytics code deployment using environments, package management and continuous integration
- Easy data documentation with auto-generated DAGs and easy data annotation and cataloguing
- Data snapshotting for better historical data analysis
DBT price: $0 for free tier, $100/mo for basic, with quotes available for larger enterprise deployments
Originally developed at Airbnb, Airflow is a popular new open source data infrastructure tool. While it doesn’t do any of the data processing itself, Airflow can help you schedule, organize and monitor ETL processes using python. Key features:
- Directed Acyclic Graphs (DAGs) allow its scheduler to spread your tasks across an array of workers without requiring you to define precise parent-child relationships between data flows
- Handy web-based UI for managing and editing your DAGs
- Nice set of tools that makes it easy to perform “DAG surgery” from the command line.
- Highly extensible and scalable
Airflow price: free and open source
Luigi is an open source Python package developed by Spotify. It’s designed to make the management of long-running batch processes easier, so it can handle tasks that go far beyond the scope of ETL--but it does ETL pretty well, too. Key features:
- Python codebase makes it easy to build efficient data pipelines, especially ones that deal with long-running batch processes
- Easy dependency resolution
- Web interface that allows the user to visualize tasks for workflow management
- Conceptually similar to GNU Make, but isn’t only for Hadoop
- Atomic file system operations ensure that your pipelines will never crash with partial data
Luigi price: Free and open source
Best Master Data Management tools
1. Dell Boomi
Dell Boomi's Master Data Hub has the following key features:
- Defines models via low-code, visual experience.
- Deploys data models and identifies which source systems interact with them.
- Onboards system records into a consolidated repository, automatically merges similar records.
- Enables data stewarding - alerts teams to resolve duplicates and data entry issues.
- Governs data with real-time bidirectional process flows across silos.
Dell Boomi Master Data Hub price: $100,000 - $300,000
Profisee’s Master Data Management has the following key features:
- Stewardship and governance - enables “data stewards” within the organization to manage master data with feedback from analytics.
- Golden record management - standardizes, cleans and matches source data with no coding.
- Event management - detects data changes, distributes events to subscribing systems.
- Integrator - federates master data for global enterprises, with real time bi-directional integration.
- Enterprise workflow - enforces business processes cross-organization, lets administrators manage data steward performance.
- SDK - enables integration of custom applications.
Profisee price: $10.30/hour
3. SAP NetWeaver
SAP NetWeaver MDM, a component of the NetWeaver development platform, has the following key features:
- Automatically extracts master data from all major SAP applications.
- Loads master data from other sources.
- Integrates data using business content like repository structures, validation rules, inbound and outbound mappings.
- Distributes master data to targets.
- Enables programmatic data integration via APIs and web services.
SAP NetWeaver pricing: Available upon request
4. Semarchy xDM
Semarchy is a relative newcomer to the MDM scene, having been launched in 2011. Based out of France, it was founded by a team of former Sunopsis employees. xDM, their main MDM product, has the following features:
- Non-Hadoop dependent.
- Optimized for Oracle.
- Updates and changes are tracked and propagated using metadata, allowing for iterative, “evolutionary” data management.
- Features a step-by-step user interface that can be customized to specific business roles (i.e. analysts, operations, etc.).
Semarch xDM price: Available upon request
5. Tibco MDM
Tibco is a perennial placer in top MDM tool rankings. Key features include:
- Flexibility and scalability for multidomain applications.
- Independent, distributed master data layer.
- Customizable no-code visual interface that can be built to suit various business roles.
- Highly-rated data versioning support.
- SOAP Web services support.
Tibco MDM price: Available upon request
6. Ataccama ONE
Ataccama offers ONE, a highly automated data management tool that can be run on-premise, in the cloud, or in a hybrid setup. Key features include:
- Machine learning-based data curation, cleaning and classification.
- Automated metadata discovery.
- Automated project configuration.
- Domain agnostic, but comes pre-configured with pre-built rules for MDM for typical domains such as customer, contact and product.
- Connectors for wide range of data types and sources.
- GDPR ready.
- Full data audit history with detailed log of applied business rules and transformations.
Ataccama ONE price: Available upon request
7. Stibo STEP
Stibo has the distinction of being probably the oldest company on the list, having been founded in the 18th century as a printing company. The data management arm of the company is a younger addition, and STEP, its main data management tool offering, has the following key features:
- STEP Workbench UI.
- Holistic multidomain MDM platform.
- High level of automation.
- Automated data and language translation for multinational operations.
- Automated review and approval of digital assets.
- Enterprise-level master data management.
Stibo STEP price: Available upon request
Best Reference Data Management tools
Collibra’s Reference Data solution has the following key features:
- Automates workflows to create new codes and code sets.
- Delivers codes and code sets to users in a friendly way.
- Performs accurate data mapping to eliminate barriers to data access.
- Compares data from different parts of the organization.
Collibra price: Available upon request
Magnitude’s Reference Data Management has the following key features:
- Multi-domain modeling - supports business structures from code lists to multi-path, self-referencing hierarchies.
- Automation - provides automation, governance and control over reference data objects and load processes.
- Mapping - provides global to local, external to internal, and specific to general mapping with no disruption to existing elements.
- Governance - provides a customizable workflow to control business processes related to reference data, with model-based security controls allowing users to view, add or update.
- Time variance - enables users to change models, subjects, attributes and associations and retrieve any previous version of the object.
Magnitude Reference Data Management price: Available upon request
3. Informatica MDM Reference 360
Informatica’s MDM Reference 360 has the following key features:
- Fully cloud-based - improved performance and scalability.
- End-to-end platform - embedded data integration, data quality, process management.
- Self service - Master Data Management and workflows built for business users with no technical background.
- Match and merge - merges and cross-references data from new types and sources.
Informatica MDM Reference 360 price: Available upon request
4. Reltio Cloud
Reltio makes Reltio Cloud, a graph-based master data management tool that includes reference data management tools. Reltio is built on graph databases to give it maximum flexibility, both in scaling data stores and in defining clear relationships between the data in your repository. Other key features:
- Easy integration with existing MDM tools and/or other data sources.
- User friendly interface.
- Always-available cloud platform makes zero-downtime upgrades possible.
Reltio Cloud price: Available upon request
Best analytics and visualization tools
Tableau is a BI platform available both on the cloud and as downloadable software, with the following key features:
- Easily connects to data sources.
- Allows easy access to visualizations for teams, partners and clients.
- Enables unlimited data exploration with interactive dashboards.
- Creates “dashboard starters”, actionable dashboards setup in minutes with data from popular web applications.
- Creates interactive maps automatically.
Tableau price: $35-$70 per user per month.
Chartio is a cloud-based BI and visualization platform with the following key features:
- Interactive mode - drag and drop data to create, filter and share dashboards.
- SQL mode - communicate with databases in SQL to directly extract insights.
- Data layering - add successive transformation steps to data to transform query results.
- Visualizations and charts - instantly visualize data; Chartio recommends the most appropriate chart.
- Data blending and drill downs - combine disparate data sources on the fly and get actionable insights without exploring raw data.
Chartio price: $249/month for a team license.
3. Looker BI
Looker is another cloud-based analytics and visualization platform, with the following key features:
- Define metrics once using LookML, Looker’s simple data modeling language, and Looker writes SQL queries to answer any question on those metrics.
- Make data beautiful with easy-to-read dashboards that allow users to drill in and explore.
- Connect directly to databases, with no extracts or software to download.
- Open access to dashboards and reports to everyone, not just analysts or data scientists.
Looker price: $3,000 - $5,000 per month for 10 users.
Metabase offers a user-friendly, open source interface for connecting and analyzing your data. As a data visualization tool, it offers:
- Hundreds of native data connectors.
- Easy setup and access for your entire team.
- Large range of different data visualization options to choose from.
- White labeled embedding (premium embedding tier and up).
- Auditing and data permission management (enterprise tier and above).
Metabase price: Metabase is free and open source, so its free tier offers a range of features that will be suitable for most users. Enterprise plans can range from $3,000 - $10,000/year starting prices.
5. Microsoft Power BI
Power BI, Microsoft’s offering in the business analytics space, is designed to be useful for business analysts and data scientists alike. Main features:
- Simple drag-and-drop interface that is designed to be familiar to users of MS Excel.
- Power BI is a no-code platform, and offers both desktop and web clients.
- Library of pre-built connectors
- Performance tends to lag on very large data sets--look elsewhere for big data solutions
Microsoft Power BI price: $9.99 per user per month, or free for the basic version.
6. Mode Analytics
Mode Analytics offers a web-based data analytics suite aimed at data scientists and analysts, with a focus on collaboration and sharing. Some of Mode’s key features:
- Designed to make it especially easy to connect your data sources to their built-in SQL editor and visualization platform.
- Mode can pipe the results of your SQL queries directly into an R or Pandas dataframe in a Mode-native notebook.
- Share the data, analysis and resulting dashboard all through Mode.
- Mode is a good choice for a team of data scientists who want to be able to collaborate more easily.
Mode Analytics price: available upon request.
Towards automated data management
We covered five ways of thinking about data management tools - Reference Data Management, Master Data Management (MDM), ETL and big data analytics - and a few great tools in each category.
As data infrastructure moves to the cloud, more of the data stack becomes managed and fully integrated. There is no replacement for managing business processes around structured data in large organizations. But cloud-based platforms can help with much of the data management strategy - from treatment and preparation of raw data, to data ingestion, loading, transformation, optimization and visualization - automatically in a single system.
For example, Panoply’s cloud-based automated data warehouse can connect directly to data sources, manage data loading, and automatically transform your data into clean tables that are ready for analysis. Tools which provide an integrated big data stack take us one step closer to a truly holistic concept of data management.