I recently sat down with Aron Clymer to talk about his approach to architecting and building data stacks. Aron has built data analytics stacks and developed analytics programs at Salesforce, PopSugar, Tact.ai, The Latest Scoop, and many others. He now helps organizations of all sizes develop analytics programs and build data stacks as the founder of Data Clymer, an analytics consulting firm based in San Francisco.
Building a Data Stack
Data Clymer has helped a lot of organizations build their data analytics stacks. Tell us about the typical scenario where an organization comes to you to build a stack.
There are two main scenarios. Some small clients don't have anything and they are just starting with a clean slate. Other times we help organizations replace a poorly architected or failing system with something new.
These companies have done their strategic planning and know how data can help them but they realize that the systems they’re using - which are usually just the systems they use to run their business - are not powerful enough or don’t provide the level of detail for their analytics practice.
Situation 1: Building a data stack from scratch
Tell us about the first situation where an organization is starting from scratch.
The Latest Scoop is a great example. The company had fairly powerful reports from their POS vendor and they could do some interesting things. They are a boutique lifestyle retailer so they could already report on standard stuff like sales trends by brand, product, store, segment, etc. However, they needed more real-time data. Their vendors were giving them 24-hour batches of data but they wanted to be nimble throughout the day and modify their sales strategy in real time. In the end, we built a system for them that provided reporting with near real-time data.
Give us an example of a real-time retail problem.
The retailer has daily sales targets per store. By 11:00 a.m. if the store manager realizes they're not selling enough product in a category, let’s say dresses, they might redirect a few of the sales people on the floor to a certain area or they might move product around the floor. They had all sorts of strategies to increase sales in the right categories to achieve their ultimate sales goals.
That's interesting. It sounds like analytics insight should be delivered at the rate the organization can kind of action them.
Exactly, for a lot of companies, real-time insight isn't critical but for some, latency is a big problem. The other challenge is bringing multiple data sets together in one place to see the whole 360-degree view of the business. That's one of the biggest reasons why companies choose to build their own data stack.
Situation #2: Re-building a data stack
Can you tell us about the problems of an organization that is deciding to rebuild a data stack?
A lot of companies understand the value of an analytics environment to direct their corporate strategy, product strategy, or marketing strategy but data management is not their strong point. The biggest challenge is that they don’t have the skills in-house.
These companies will try to leverage internal technical resources to write data scripts. And these folks–it's not their primary role. They are often software engineers building the product. Or sometimes it's their I.T. folks who can write code. They do their best and hack together a data stack, but it’s usually far from ideal. Not by any fault of their own, but because they simply don’t have the expertise or experience. Unfortunately, they don't understand how hard it can be to have data jobs that run efficiently without failure—or recover elegantly from failure so that engineers don’t have to be up late at night trying to get reports and analytics available the next morning.
That sounds like a nightmare.
Yeah, and it happens all the time.
Data stack technology; data warehouses, BI and analytics vendors have improved by leaps and bounds over the past decade. But the fundamental root difficulties of dealing with data hasn't really gone away.
Data is messy. It's structured in very different ways. Understanding what keys are used to join different data sets together is still difficult. These challenges don't go away even when you throw modern technology at them. And people don't recognize how challenging it can be. As a result, it ends up becoming a second job for somebody on nights and weekends.
Yep. No fun.
When is it time to build a data stack?
When does an organization recognize it’s time to invest in a more comprehensive data stack?
It's clear— when the consumers of the data, and often it’s the executive team, becomes frustrated because they can't access the data in a timely manner. They might need daily reports and it might take several days because the data wasn't properly loaded into their data warehouse. This can cause operational problems and then they realize, “OK we've got to fix this.”
What are the fail points of a data stack?
When you say “fix this,” what are the fail points of a sub-optimal stack?
There are so many different kinds of failure.
We've got time.
Sometimes the schema of the source has changed—a new column has been added or something's changed in the structure of the source data in such a way that the downstream systems aren't handling it correctly.
A schema change is easy to understand. If new columns are added somewhere along the way and you haven't used the right tools, that can break your Extract, Transform, Load (ETL) process. Or in the best case, that your column doesn't make it to your end users hands. Better ETL tools will handle that, and if you architect your stack right, that will happen automatically.
Other times it's more about data volume increases and it's stressing the system to the point where data jobs will fail or they run out of memory because they don’t have enough processing power.
As companies scale, their data stacks don’t always scale with them.
Money vs Process
Scale is often a blessing and a curse. With scale, is it a question of throwing money at the problem or finding a better way?
It's about doing it better. You can throw more money at some problems and it will fill the gap for a while but usually, the answer is better solution architecture and a completely different set of tools.
I can think of one of my clients that hadn’t used any ETL tools to build their data pipelines. Instead, they were running a bunch of Python scripts to extract data.
For example, one of our clients had been extracting and transforming a large amount of data from Salesforce so that their senior strategist could perform some complex analytics on the data. But the engineer that was tasked with doing this would run this script and it would sometimes take all day to run and it had to be run manually.
Ideally, the analyst would have access to this data daily, but the engineer was so busy with other things that it might run once a week or even less often. Then, half the time the job would fail because it took so long.
In that case, the client needed a better solution. So, we implemented a modern ETL tool and automated the process to run daily. Now it runs smoothly and everyone gets more sleep, and the engineer can get back to his day job of building the client’s core product.
Are people aware of the “data stack” as a solution?
Are people just unaware that there are solutions out there that aren't home baked?
I think a lot of times this comes down to culture and mindset. It’s the classic “if you give engineers a problem, they will want to build it.”
Yep. I’ve seen a situation before where I proposed to a vendor that we could pay $200 a month and it would solve our problem ten times over, and engineering wanted to redraw the roadmap to build the solution.
That's right. And this happens all the time. Engineers are often leery of any vendor because it a could be a black box that they can't access. And they want to explore and get as low level into the code and ultimately learn something. To be fair, they are doing it for their benefit and for their careers, and they are engineers for a reason! But often their incentives aren’t well aligned with that of the business overall.
Subscribe to our blog to get get the next post in this series. “Architecting Trust into the Data Stack”