Data processing and analysis can’t happen without data profiling—reviewing source data for content and quality. As data gets bigger and infrastructure moves to the cloud, data profiling is increasingly important.
Need to achieve big data profiling with limited time and resources? Learn about:
- An efficient data profiling process from the father of data warehousing.
- Best practices for learning about your data and obtaining data quality metrics, even without meta data or documentation.
- Top six open source and commercial tools to automate parts of the process.
- The future of autonomous data profiling in a cloud-based environment.
What is data profiling?
Data profiling is the process of reviewing source data, understanding structure, content and interrelationships, and identifying potential for data projects.
Data profiling is a crucial part of:
- Data warehouse and business intelligence (DW/BI) projects—data profiling can uncover data quality issues in data sources, and what needs to be corrected in ETL.
- Data conversion and migration projects—data profiling can identify data quality issues, which you can handle in scripts and data integration tools copying data from source to target. It can also uncover new requirements for the target system.
- Source system data quality projects—data profiling can highlight data which suffers from serious or numerous quality issues, and the source of the issues (e.g. user inputs, errors in interfaces, data corruption).
Data profiling involves:
- Collecting descriptive statistics like min, max, count and sum.
- Collecting data types, length and recurring patterns.
- Tagging data with keywords, descriptions or categories.
- Performing data quality assessment, risk of performing joins on the data.
- Discovering metadata and assessing its accuracy.
- Identifying distributions, key candidates, foreign-key candidates, functional dependencies, embedded value dependencies, and performing inter-table analysis.
Types of data profiling
There are three main types of data profiling:
Validating that data is consistent and formatted correctly, and performing mathematical checks on the data (e.g. sum, minimum or maximum). Structure discovery helps understand how well data is structured—for example, what percentage of phone numbers do not have the correct number of digits.
Looking into individual data records to discover errors. Content discovery identifies which specific rows in a table contain problems, and which systemic issues occur in the data (for example, phone numbers with no area code).
Discovering how parts of the data are interrelated. For example, key relationships between database tables, references between cells or tables in a spreadsheet. Understanding relationships is crucial to reusing data; related data sources should be united into one or imported in a way that preserves important relationships.
Ralph Kimball, a father of data warehouse architecture, suggests a four-step process for data profiling:
- Use data profiling at project start to discover if data is suitable for analysis—and make a “go / no go” decision on the project.
- Identify and correct data quality issues in source data, even before starting to move it into target database.
- Identify data quality issues that can be corrected by Extract-Transform-Load (ETL), while data is moved from source to target. Data profiling can uncover if additional manual processing is needed.
- Identify unanticipated business rules, hierarchical structures and foreign key / private key relationships, use them to fine-tune the ETL process.
Data profiling and data quality analysis best practices
Basic data profiling techniques:
- Distinct count and percent—identifies natural keys, distinct values in each column that can help process inserts and updates. Handy for tables without headers.
- Percent of zero / blank / null values—identifies missing or unknown data. Helps ETL architects setup appropriate default values.
- Minimum / maximum / average string length—helps select appropriate data types and sizes in target database. Enables setting column widths just wide enough for the data, to improve performance.
- Key integrity—ensures keys are always present in the data, using zero/blank/null analysis. Also, helps identify orphan keys, which are problematic for ETL and future analysis.
- Cardinality—checks relationships like one-to-one, one-to-many, many-to-many, between related data sets. This helps BI tools perform inner or outer joins correctly.
- Pattern and frequency distributions—checks if data fields are formatted correctly, for example if emails are in a valid format. Extremely important for data fields used for outbound communications (emails, phone numbers, addresses).
Data profiling, a tedious and labor intensive activity, can be automated with tools, to make huge data projects more feasible. These are essential to your data analytics stack.
Open source data profiling tools
1. Quadient DataCleaner—key features include:
- Data quality, data profiling and data wrangling
- Detect and merge duplicates
- Boolean analysis
- Completeness analysis
- Character set distribution
- Date gap analysis
- Reference data matching
2. Aggregate Profiler (Open Source Data Quality and Profiling)—key features include:
- Data profiling, filtering, and governance
- Similarity checks
- Data enrichment
- Real time alerting for data issues or changes
- Basket analysis with bubble chart validation
- Single customer view
- Dummy data creation
- Metadata discovery
- Anomaly discovery and data cleansing tool
- Hadoop integration
3. Talend Open Studio—a suite of open source tools, data quality features include:
- Customizable data assessment
- A pattern library
- Analytics with graphical charts
- Fraud pattern detection
- Column set analysis
- Advanced matching
- Time column correlation
Commercial data profiling tools
4. Data Profiling in Informatica—key features include:
- Data stewardship console which mimics data management workflow
- Exception handling interface for business users
- Enterprise data governance
- Map data quality rules once and deploy on any platform
- Data standardization, enrichment, de-duplication and consolidation
- Metadata management
5. Oracle Enterprise Data Quality—key features include:
- Data profiling, auditing and dashboards
- Parsing and standardization including constructed fields, misfiled data, poorly structured data and notes fields
- Automated match and merge
- Case management by human operators
- Address verification
- Product data verification
- Integration with Oracle Master Data Management
6. SAS DataFlux—key features include:
- Extracts, cleanses, transforms, conforms, aggregates, loads and manages data
- Supports batch-oriented and real-time Master Data Management
- Creates real-time, reusable data integration services
- User-friendly semantic reference data layer
- Visibility into where data originated and how it was transformed
- Optional enrichment components
Traditional data profiling, as described in this post, is a complex activity performed by data engineers prior to, and during, ingestion of data to a data warehouse. Data is meticulously analyzed and processed (with partial automation) before it is ready to enter the pipeline.
Today more organizations are moving data infrastructure to the cloud, and discovering that data ingestion can happen at the click of a button. Cloud data warehouses, data management tools and ETL services come pre-integrated with hundreds of data sources. But if you can click a button and move data instantly into your target system, what about data profiling?
Data profiling is more crucial than ever, with huge volumes flowing through the big data pipeline and the prevalence of unstructured data. In an cloud-based data pipeline architecture, you need an automated data warehouse that can take care of data profiling and preparation on its own. Instead of analyzing and treating the data using a data profiling tool, just pour it into the automated data warehouse, and it will automatically be cleaned, optimized and prepared for analysis.
If this sounds like science fiction, it isn’t—Panoply offers the world’s first smart data warehouse, which uses AI technology to automatically model incoming data, define schemas and prepare data for analysis in seconds.
Try Panoply today to see the next generation of data preparation in action.