Data Industry and Trends

What is the difference between a data engineer and a data scientist?

Written by Yaniv Leven|June 22, 2017

Putting it bluntly

“Data engineers are the plumbers building a data pipeline, while data scientists are the painters and storytellers, giving meaning to an otherwise static entity.”  

Urthecast’s David Bianco notes

Data engineers are curious, skilled problem-solvers who love both data and building things that are useful for others.  Either way, data engineers together with data scientists and business analysts are a part of the team effort that transforms raw data in ways that provides their enterprises with a competitive edge.

In this blog post, I will discuss what differentiates data engineers from data scientists, what unites them, and how  their roles are complimenting each other.

Redshift vs BigQuery: Performance Benchmarks Report

 Data Engineers vs Data Scientists

There is a significant overlap between data engineers and data scientists when it comes to skills and responsibilities. The main difference is the one of focus.  Data Engineers are focused on building infrastructure and architecture for data generation.  In contrast, data scientists are focused on advanced mathematics and statistical analysis on that generated data.  

 Data Scientists are engaged in a constant interaction with the data infrastructure that is built and maintained by the data engineers, but they are not responsible for building and maintaining that infrastructure. Instead, they are internal clients, tasked with conducting high-level market and business operation research to identify trends and relations—things that require them to use a variety of sophisticated machines and methods to interact with and act upon data.

In contrast, data engineers work to support data scientists and analysts, providing infrastructure and tools that can be used to deliver end-to-end solutions to business problems.  Data engineers build scalable, high performance infrastructure for delivering clear business insights from raw data sources; implement complex analytical projects with a focus on collecting, managing, analyzing, and visualizing data; and develop batch & real-time analytical solutions.

 Simply put, data scientists depend on data engineers. Whereas data scientists tend to toil away in advanced analysis tools such as R, SPSS, Hadoop, and advanced statistical modelling, data engineers are focused on the products which support those tools. For example, a data engineer’s arsenal may include SQL, MySQL, NoSQL, Cassandra, and other data organization services.

 As noted in the beginning of this blog, data engineers are the plumbers in the data value-production chain.  And, as with any infrastructure:  while plumbers are not frequently paraded in the limelight, without them nobody can get any work done.

 Data engineers and data scientists complement one another

Leveraging Big Data is no longer “nice to have”, it is “must have”.   Both skillsets, that of a data engineer and of a data scientist are critical for the data team to function properly. It is highly improbable that you will be able to land a “unicorn”- a single individual who is both a skilled data engineer and and expert data scientist. Therefore, you will need to build a team, where each member complements the other’s skills. And it is critical that they work together well.

In order for this to happen, it is important to recognize the different, complementary roles that data engineers and data scientists play in your enterprise’s big data efforts.  It is impossible to overstate not only how important the communication between a data engineer and a data scientist is, but also how important it is to ensure that both data engineering and data scientist roles and teams are well envisioned and resourced.  This is because data “needs to be optimized to the use case of the data scientist. Having a clear understanding of how this handshake occurs is important in reducing the human error component of the data pipeline.”

Failing to prepare adequately for this from the very beginning, can doom your enterprise’s big data efforts.  A situation to be avoided is one in which data scientists, are onboarded without a data pipeline being adequately established.  This leaves them in the uncomfortable—and expensive—position of either being compelled to dig into the hardcore data engineering needed or remaining idle.  Neither option is a good use of their capabilities or your enterprise’s resources.  

To learn about how Panoply utilizes machine learning and natural language processing (NLP) to learn, model and automate the standard data management activities performed by data engineers, sign up to our blog.  

From raw data to analysis in under 10 minutes.

Sign up now for a demo or a free trail of the platform.

Learn more about platform features