How To Become a Big Data Architect: A Guide

By Cheryl Adams | August 24, 2017 | Updated On: January 30, 2021 | Working in Data

Organizations looking to leverage big data impose a larger and different set of job requirements on their data architects versus organizations in traditional environments.

If you want to become a great big data architect, and have a great understanding of data warehouse architecture start by becoming a great data architect or data engineer. In any data environment — big or otherwise — the data architect is responsible for aligning all IT assets with the goals of the business. And just as a homeowner employs an architect to envision and communicate how all the pieces will ultimately come together, so too will business owners employ data architects to fill a similar role in their domain. But instead of lumber, concrete, and tradespeople, a data architecture encompasses data, software, hardware, networks, cloud services, developers, testers, sysadmins, DBAs, and all other resources of an IT infrastructure.

An ideal data architecture correctly models both how the infrastructure and its components will align with business requirements and also how an implementation plan will realize the model in day-to-day operations — recognizing that requirements change constantly. That model includes the resources themselves, optimized data formats and structures, and the best policies for handling data by systems and people. That means that great data architects — just like their home building counterparts — must have in-depth technical knowledge. But they must also know how to employ that knowledge in the context of what owners want (or should want if they had the technical knowledge themselves). So architects must be able to converse comfortably with an organization’s leaders. Nor can they just rely on the business people to tell them what’s important. Data architects should also bring to these conversations their own knowledge of the business — its priorities, processes, politics, strategy, and market environment.

Three Special Job Requirements

That is a very big role already, so what makes big data architects special?

What’s special are the data, the systems, the tools, and management’s expectations. Organizations that look to leverage big data are qualitatively different from those that don’t. That’s because: 1) they simply have much have more data to deal with — typically petabytes, not terabytes, 2) that data comes from many different sources in many different formats, and 3) all that data serves one or possibly two core strategies. One strategy is to generate critical insights at near real-time speed. The other is to automate massively scaled operations in real time (think Netflix videos or GE’s remote predictive maintenance on its customers’ jet and locomotive engines). In both strategies, big data enables a business model differentiated by speed, scale, agility, and intelligence.

So special job requirement #1, then, is the ability to understand and communicate how big data drives the business — whether operationally or through better, faster management insights, or both. Special job requirement #2 is the ability to work with highly diverse data. That is data from a wide variety of sources, in a wide variety of formats, and employed by a wide variety of what are likely to be highly siloed systems.

A big data architect might be tasked with bringing together any or all of the following: human resources data, manufacturing data, web traffic data, financial data, customer loyalty data, geographically dispersed data, etc., etc. — each of which may be tied to its own particular system, programming language, and set of use cases. Some of those use cases may no longer be relevant to the current business, although many will likely still be relevant. Why programs were written a certain way, or why data is formatted a certain way (e.g., why a customer loyalty number has 18 digits, not 15) may not be obvious or even documented. All of which means that big data architects are more likely than other data architects to encounter ETL challenges and risks. So they need to be better at performing forensic system analysis, at knowing the right questions to ask without necessarily being prompted, and at applying best practices for streamlining complex ETL processes while mitigating data loss.

Which brings up special job requirement #3: deep skills in big data tools and technologies (like those listed in most big data architect job postings). Those include data warehouse technologies like Accumulo, Hadoop, Panoply, Redshift architecture, MapReduce, Hive, HBase, MongoDB, and Cassandra as well as data modeling and mining tools like Impala, Oozie, Mahout, Flume, ZooKeeper, and Sqoop. Relevant programming languages include Java, Linux, PHP, and Python. BI and visualization tools include Apache Zeppelin, Chartio, R Studio, and Tableau. A big data architect should obviously also be experienced designing and implementing large on-prem and cloud-based data warehouse solutions utilizing cluster and parallel RDMS and NoSQL architectures.

Getting Qualified

So how do you become that architect — fulfilling those three special job requirements — if you are already working as a data architect? A good start is getting certified in the types of products listed above where those certification opportunities exist — which you can do on our own. But you’ll also need experience — which you can also do on your own if you have to. Seek out assignments in your current position where you map multiple data sources into a single warehouse to support big data analytics. Or, if that’s not possible, build your own big data solution in a free AWS account. That’s demonstrating kind of drive that big data driven organizations love to see.

It’s also the best part about becoming a great big data architect. If you want to become a big data architect, no one can stop you. Opportunities are expanding at a pace proportionate to the growth of data itself. And now there are more tools and resources than ever available to help you become an expert.

Would you like to learn more about Redshift cluster? Check: Redshift cluster.