Anybody who hangs out in IT or data science circles will invariably become familiar with the acronym GIGO. GIGO stands for “garbage in, garbage out” (or “good in, good out,” depending on who you ask). But semantics aside, it’s a really good reminder about how data quality - including its collection and distribution - can have a noticeable impact on your analytics. So let’s look at a couple areas where engineers and analysts run into GIGO pitfalls - and how we can avoid them.
The first step to data verification actually has nothing to do with the data itself. Instead, it centers on the applications, programs and processes that supply the data for your database. Understanding where your data is coming from and how it gets into the database helps ensure the data is accurate and that nothing gets lost in translation. It can be a challenge in getting to a single source of data truth when you’re only looking through one lens. By widening your perspective, you can make a more informed analysis of your data set and be more secure in any recommendations stemming from that data.
Investigating the back end of your data collection process is also important in meeting another key tenet of data collection and analytics - “trust, but validate.” In many cases, the data flowing into databases might come from from other people or departments in your organization. While back end investigation and process/import mapping can be time-intensive, it’s important to validate and verify others’ data collection and transfer, since blind trust may get you burned - especially if those team members are not familiar with data management or analysis.
Back end work doesn’t just encompass the programs and applications, however. It also means you should know how the raw data is being collected. Knowing your company’s collection methods can help you ensure data entering your databases is eligible under any regulations that might govern your industry or business. While compliance procedures can be tedious, ignorance of regulations isn’t a defensible plea if compliance officers or auditors come knocking.
Keeping the Regulators Happy
Nowhere is the need for regulatory understanding more evident than with the recently implemented GDPR regulations. Apart from requiring many data gathering entities to develop new processes for the collection of data, notification of end users, and deletion of data upon end user request, the biggest point for many companies is the potential for huge fines in instances of noncompliance. European Union GDPR rules allow for fines up to 4% of annual global turnover or €20 million, whichever is greater. For any company, those are high stakes, but for small or medium businesses who play in the global marketplace, they could be ruinous.
While the GDPR rules are being hailed as a win for consumer privacy groups in the EU, they’re largely based upon security and data collection practices that weren’t inherently malicious, but instead simply lacking priority for many companies. The rules have set a high bar for data management professionals, and industry leaders view them as a framework for future compliance frameworks. This is especially the case in the US, where the EU GDPR rules are forming the benchmark for future legislation by US consumer data protection groups.
I's Dotted and T's Crossed
Once you’ve checked your processes, mapped your import points, and checked your regulatory compliance, there are still opportunities for improving data integrity. Setting up cross-referencing points for data verification, doing intermittent sampling, and running test groups can all help ensure your data is valid and usable. Looking out for common sources of data collection errors - ambiguous field labels, incorrect field types, overcollection of non-useful information, missing data sets - can help build a clean data flow throughout your entire analytics operation.
Once you’re comfortable with the veracity of the data, the processes, and the compliance with any governing regulation, it’s all about moving data within and between systems in order to optimize analysis. Making sure data moves or copies the way you need it to hinges on things like schema and replication, which can be tricky in transfers between systems. Matching up the data that’s coming from one system to what’s allowed and interpreted by another system can be an exercise in futility depending on a data source’s supported integrations.
This is where Panoply really shines. Because the Panoply data warehouse architecture is designed to accept any type of schema, it facilitates data transfers between data sources without mismatches. Because it’s cloud-based, it also allows replication of large data sets without tying up valuable server room or RAM. Flexibility of schema and replication, when paired with Panoply’s built-in integrations, makes it easy to mix, match, pair, and unpair data sets from multiple sources.
Managing data is a never-ending job - as new data is created, it will require vetting for usability; as regulations change, processes will be require amendment to maintain compliance. But one thing is constant - Panoply helps through it all.
- Understand where you're getting your information from
- Do you know what programs, applications, etc your data is flowing into your database from? If it's data from someone else in your org, don't blindly rely on others - ask for a clear view of the process they went through to get data
- Trust, but validate
- Challenge in getting to a single source of truth when you're only looking through one lens
- Know the regulations your collection efforts might fall under
- Add example regulations, including GDPR
- Consequences for lack of compliance?
- Not malicious, just ignorance leads to what kind of outcomes
- Common compliance frameworks?
- Cross-reference collection points when possible to ensure data integrity
- What are common "oops" experiences with data collection?
- What does a smooth data collection process look like?
Once you get good data, then it's a matter of ensuring it transfers as you need it to:
- Panoply! :)