Implementing modern data management best practices can optimize voluminous amounts of disparate data. This facilitates insights by embracing full transparency across your data lifecycle, allowing seamless extraction of the most useful data, all at the speed of business.
This is part two of a short series on the intersection of cloud computing and data management; check out part one, which focuses on applicable concepts and techniques for data management in an increasingly cloud-based world.
Data management is more than linked to the quality of your data analytics stack; the two are woven together. A high performance, high ROI data stack is the result of a clear vision on how to best implement solid data management concepts. And the efficiency and output of that stack drives continuous monitoring of how the data is being managed, ensuring quality controls gauged toward outperformance and/or constant improvement.
Your data analytics stack is based on the framework of four main components: the sources of your data, the ETL process, the data warehouse, and the BI tools for visualizing data. The goal is equally straightforward: being able to access a stack that provides end users a 360-degree view of their data.
This journey starts with one very specific question:
What data jobs are essential? These are the top business needs to be addressed and continuously monitored, and leads to a natural follow up:
What data jobs are desired, or should be incorporated (but have thus far not been possible)? These are business needs you realize need attention, want to address, but haven’t yet been able to.
Those with business domain knowledge are required for asking the right questions; once formulated, the data team can implement the steps necessary that reverts back to the central tenet of our quest, the one regarding velocity toward accessibility: How fast can the end user get new (and useful!) datasets into their hands?
In today’s world, that quest of implementing the most efficient data strategy entails acquiring the best data storage (and access to it) available. This is where the capabilities possible when integrating a cloud data platform into your organization become readily quantifiable.
We live in a world where everything is automated, or becoming so. And as the scale of omnipresent automation and device connectivity reveals realistic questions of practicality (load app preferences and “toast smarter,” anyone?) or, more significantly, issues of data privacy (we all know who). But what everyone can emphatically agree upon is that waiting for deliverables or actionable intel culled from data generation, sourcing, storage, and visualization is something no one wants to do.
One common trait affecting business leaders is that they know there exists volumes of data that they either aren’t getting and would like to, or aren’t getting fast enough. Another is the knowledge that data they need exists in disparate, multiple locations, and they have neither the time nor the tech to store, assemble, and deliver it properly. Finally, there is the scenario where “because it’s always been done that way,” an inefficient system is allowed to continue with the ramifications of how counterintuitive it has become remaining largely invisible.
Managing data means accessing data, and retrieving it from silos only increases the time component, a “luxury” no one should afford. When seeking an effective data warehouse solution, look for automated, end-to-end data management—from initial data collection to analysis and reporting.
Ideally, and regardless of source, or whether the data is structured, unstructured, or semi-structured, data should be automatically aggregated as it streams in. The objective is to accurately visualize and analyze desired data almost immediately, and without any data configuration, schema, or modeling.
As organizations progress with their journey into the cloud, applying the best practices of data management will help guide use cases toward success in cloud-based environments while simultaneously meeting data standards of compliance. Agility is helpful, as some adjustments will likely be required during the transition; however, existing best practices overall work quite well, and the adjustment should be minimal.
One key data management concept any business leader or PM can embrace is the use of data stewards throughout the organization. These are subject matter experts and direct points of contact who can help clean, verify, and add qualitative aspects to the quantity of data. They are fluent in the organization’s data model, which means they comprehend the complexities of metrics ascribed to datasets, and can oversee the execution of day-to-day tasks with authority, helping ensure that data is sufficiently prepared for visualization and analysis.
Those tasked with data stewardship may also be held accountable through the ownership of data decision making (determining the importance of incoming data flows, the way data should be stored and for how long, etc.). Furthermore, active participation in properly defining the meaning and purpose of each data field or metric means they are fully conversant in the company’s data dictionary, and are able to contribute to any updates as needed.
Examples might be a product line manager who has special insight into thousands of products and what it takes to make them, or a distribution head who oversees supply chain management for a firm, and is responsible for connecting shippers and carriers, etc. Or picture a marketing director tracking the performance of ad campaign traffic through UTM codes. Extracting insight from touching all the useful data allows that data steward to report back the results alongside suggestions on how to improve product exposure and for an upcoming launch, or better conversion rates for an existing product.
By building a community of data stewards across different spheres of expertise, this encourages a ripple effect of participation from contributors and data citizens alike, all inspired to enhance data quality throughout the enterprise. This empowers data democratization—a good thing!—while displaying what efficiency in data management looks like.
As data moves to and from the cloud, or perhaps originates there, manage the data in cloud applications (as you would with on-premise applications) by being mindful of how these data will be integrated to create a single picture or product. While cloud-based applications may have added layers of complexity (though choosing the right cloud service should offset this at the end user level, and effectively streamline your data management as a whole), the goal remains uniform: moving data from multiple locales and/or disparate systems to an organized database, rendered clean and delivered useful for insights.
In other words, as your operation integrates fully with the cloud, you shouldn’t accept anything less than a data warehouse that is comfortable with fast-paced environments, able to continuously manage capacity and performance as schemas and workloads rapidly evolve.
Presenting an attentive posture should be seen as an imperative. Without it, an enterprise-wide ecosystem will invariably trend toward inefficiency. With the data landscape constantly changing, maintaining awareness allows you to remain anchored yet agile when necessary, and prevents avoidable lapses in performance. As mentioned above, it’s essential for a proper data management protocol to flourish by warehousing your data through a company that not only handles the rate of your data growth, but does so without handicapping speed, usability, and cost.
If your current data integration infrastructure is on-premise, extending and integrating it into a cloud environment is entirely possible. Make sure any necessary modifications to the tools or overall architecture are addressed and resolved before cloud integration, as having to rebuild later is both risky and (the wrong kind of) disruptive.
Like any management discipline, when overseeing the sourcing, extraction, transformation, storage, and visualization of data, consider the power of having tools do the work for you. This extends past the old fashioned way of siloing data and running queries through IT (and then waiting), compared to implementing smart cloud storage and self service BI tools today. As you continue to modernize, it also means how you render the data—will you processes the data in the cloud or on-premise?—and which tools are the most efficient to execute the tasks. As an example, does your enterprise operate more effectively with batch processing or stream processing?
Batch processing lets the data build up, while stream processing renders the data more steadily accessible, spreading the processing over time. Batch processing is moving data in bulk, validating and cleaning the data, and then staging it in a database before loading the data to target tables within the data warehouse (an important check in the pipeline, making it easier to roll back if something goes wrong).
Stream processing is moving data at speed, and should be viewed as preferable, perhaps mandatory, when the caliber of insight and value possible from touching data immediately decreases as the time component increases. There are some very powerful tools out there for either scenario, and it’s always a good idea to familiarize yourself with options.
Here is an excellent rule to keep in mind: Speed of insight depends on the velocity toward accessibility. Chances are more likely that developing data pipelines with an ETL tool will be simpler and faster, with the initial outlay of funds more than making up for itself over time, particularly with projects that are large or possess increasing complexity. The same mindset is applicable when seeking a data warehouse solution that provides automated end-to-end data management. In fact, the full scope of capabilities possible with a vendor’s tool may only be revealed after specific and disparate use cases are run.
Just as we seek providing end users with a comprehensive, 360-degree view of data, so to must we approach data management with an earnest desire to keep an all-encompassing “worldview” of our data ecosystem. That includes personal data, and it is always good policy to never lose sight of the following “Five Ws”:
Who has access to personal or corporate data? Is the data properly stored? Is it just transient?
Where is this personal data being kept? And, if transferred at any time, where to?
Why is the personal data under your control? Should it be? Does it need to be? And if not under your control, who does own the data?
When are you releasing or destroying those personal records? How long are you keeping them? Are there situations where you’re sharing this data?
What mechanisms do you have in place to protect personal data? This should go without saying, but we’ll say it anyway.
It’s important to remember that there is more metadata about the data than there is data itself. Using the right tools and optimal partner for data storage also transfers aspects of data security and data integrity away from being entirely your responsibility, with data lineage documentation auto-generated from a metadata repository. Data quality should go hand-in-hand with data governance, always kept up to date as you scale.
A final, holistic example, one analogous to the importance of properly managing data, might be considering the significance of—and dependence upon—water in our day-to-day lives. We need it to drink, to produce food, to manufacture goods, and keep ourselves and our surroundings clean. But because the resource is so important, municipalities dedicate resources to the cleanliness, storage, delivery, and consumption of water. Similarly, organizations depend on data to operate and thrive. To the extent that we need to manage our water systems, organizations need to manage their data systems.
Please check out part one of this series, Data Management Concepts and Techniques in a Cloud-Based World.