What Do Data Engineers Do That Keeps Them Awake At Night

Written by Yaniv Leven | Mar 23, 2017 12:09:10 PM

As a data engineer, I enjoy my work, but when it comes to integrating and managing data there are quite a few tasks that are downright tedious. From data remodeling to juggling a hodgepodge of software to address disaster recovery, sometimes, we are simply drowning. As they say, it’s a tough job, but somebody's gotta do it!

Data remodeling

When you start building a solution from scratch, you are dealing with a clean slate. The infrastructure is built, all the data has been modeled perfectly in an almost artistic manner and is now flowing in an elegant and clean fashion.

However, problems start real soon. Basically the minute analysts and data scientists get their hands on your magnificent creation. They keep asking for data augmentation, integration of new sources of data and different data models for asking different kinds of questions. It falls on your shoulders, and you feel like you are somehow personally at fault for preventing your company from data-driven value creation, while they hold no appreciation for the remarkable work of art that you’ve created.

Data modeling progresses from conceptual model to logical model to physical schema, establishing a robust world of data objects and their relationships to other data objects. Mainly it provides the analysts the ability to ask the questions they need in a fast cost-effective and simple manner. It is precisely its robust nature, combined with the frequency of changes needed, that creates the never ending cycle of integration, data preparation and modeling.

Scaling clusters

Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive, cumbersome, and time-consuming. They also stretch physical infrastructure capacity, creating risk for under-provisioned resources.

The problematic nature of scaling clusters is even greater with RDBMS, given their architecture is run on a single server. In practical terms, this means that when a RDBMS needs to scale, you must buy bigger, more complex, and expensive proprietary hardware with more processing power, memory, and storage capacity. All this involves lengthy downtime and configurations to make the change.

Scaling clusters in cloud infrastructure generally solves the scalability issue. BigQuery’s most significant advantage, for instance, is the seamless and fast resizing of a cluster, up to petabyte scale. Unlike Redshift, there is no need to constantly track and analyze the cluster size and growth in an effort to optimize its size to correspond to the current dataset requirements.

Redshift architecture scalability allows users to enjoy an increase in performance when resources including memory and I/O capacity increase. To get the best for their buck, Panoply seamlessly scales clusters on demand, ensuring the data warehouse performance is well balanced with costs.

Backup and disaster recovery

Traditionally, businesses have cobbled together multiple software solutions to address disaster recovery (DR), backup and archival as a part of the larger data protection practices. This approach is mind-bogglingly inefficient for the data engineer tasked with managing each of these solutions. It is also a relatively expensive approach. A different way of dealing with DR is to leverage cloud architecture for secondary workflows such as backup, archival and disaster recovery (DR).

BigQuery automatically replicates data to ensure its availability and durability. However, complete loss of data due to disaster is less common than the need for fast, instant restore of, for example, a specific table or even a specific record. For both purposes, Redshift automatically stores backups to S3 and enables you to revisit data at any point in time over the last ninety days. In all cases, retrieval includes a series of actions that can make the instant recovery a cumbersome, lengthy operation.

ETL and data prep

Loading your data in a clean and logical way to the analytics architecture is not only the first step in data management but definitely one of its more important stages. The difference in loading methodologies can affect scalability of the process and the richness of the data loaded, and in most cases parts of raw data are actually lost.

When loading data we usually think to load only the important data, meaning the actual data that we want to analyze. The problem with this is that in most cases (most cases being all), you never know on day one which data is going to be important for analysis in the future, and thus long thought experiments and tests of how and which data to load become worthless the minute after analysis begins.

Essentially, the ETL and data warehousing must fulfill the function of being a screening protocol that ultimately produces clean data sets that are easy for analytics programs to scrutinize.

Very frequently you end up running supporting software on large numbers of servers so they can warehouse data from multiple sources, including different OLTPs.

Due to the wide range of possible data inconsistencies and the sheer data volume, data cleaning is considered to be one of the biggest challenges in data warehousing. A number of tools of varying functionality are available to support ETL tasks. Even with these tools, a significant portion of the cleaning and transformation work has to be done manually or by low-level programs.

Snowflake supports simultaneous users’ queries through different virtual warehouses. you can run your overnight ETL processes run on slower and less expensive warehouse resources, and then enable real-time ad-hoc queries through a more powerful warehouse during business hours.” However, with Redshift scale and operational efficiency, ETL can be referred to as a rigid and outdated paradigm.

Leveraging Redshift scalability with Panoply, you can forget about overnight ETL processes. Panoply follows the ELT process, whereby all of the raw data is instantly available in real-time and transformations happen asynchronously on query time.

Performance troubleshooting

A bottomless pit, an abyss, the void…. Should I keep going? If the system is slow, just about anything could be the problem. It could be poor data modeling, improper indexing, network issues, storage hardware, basically anything.

Then, once you begin, especially in the case of an issue with a private cloud or unmanaged solutions, the data-engineer clears one bottleneck only to bump into another one. That troubleshooting the data warehouse can be a bit intimidating. Sometimes even figuring out where to begin to look can be difficult.

More often than none, and more frustrating than anything, at least from my personal experience, is when the trouble lays in the type of queries being run by irresponsible analysts, like some SELECT * over an enormous table. I try raising knowledge and awareness of these as part of my tasks. Other frequent issues I see are around the way data is structured in the warehouse, for instance in a case where data is modeled with many subtables, analysis on even a small set of data can be lengthy.

It's time for a better way

Panoply is a cloud data platform designed to make it easier to get your hands on the data you need. You can connect data sources, automatically store your data, and send it to a BI tool, all in just a few clicks. That frees up your time so you can focus on what really matters: driving your business forward.

View full post