When we say “big data”, many think of the Hadoop technology stack. Is this the big data stack?
Well, not anymore. Big data concepts are changing. The data community has diversified, with big data initiatives based on other technologies:
The common denominator of these technologies: they are lightweight and easier to use than Hadoop with HDFS, Hive, Zookeeper, etc. Some are offered as a managed service, letting you get started in minutes.
We propose a broader view on big data architecture, not centered around a specific technology. How do organizations today build an infrastructure to support storing, ingesting, processing and analyzing huge quantities of data? This is the stack:
At the bottom of the stack are technologies that store masses of raw data, which comes from traditional sources like OLTP databases, and newer, less structured sources like log files, sensors, web analytics, document and media archives.
Increasingly, storage happens in the cloud or on virtualized local resources. Organizations are moving away from legacy storage, towards commoditized hardware, and more recently to managed services like Amazon S3.
A few examples:
To create a big data store, you’ll need to import data from its original sources into the data layer. In many cases, to enable analysis, you’ll need to ingest data into specialized tools, such as data warehouses. This won’t happen without a data pipeline. You can leverage a rich ecosystem of big data integration tools, including powerful open source integration tools, to pull data from sources, transform it, and load it to a target system of your choice.
A few examples:
Thanks to the plumbing, data arrives at its destination. You now need a technology that can crunch the numbers to facilitate analysis. Analysts and data scientists want to run SQL queries against your big data, some of which will require enormous computing power to execute. The data processing layer should optimize the data to facilitate more efficient analysis, and provide a compute engine to run the queries.
Data warehouse tools are optimal for processing data at scale, while a data lake is more appropriate for storage, requiring other technologies to assist when data needs to be processed and analyzed.
Data Processing Tools
A few examples:
You’ve bought the groceries, whipped up a cake and baked it—now you get to eat it! The data layer collected the raw materials for your analysis, the integration layer mixed them all together, the data processing layer optimized, organized the data and executed the queries. The analytics & BI is the real thing—using the data to enable data-driven decisions.
Using the technology in this layer, you can run queries to answer questions the business is asking, slice and dice the data, build dashboards and create beautiful visualizations, using one of many advanced BI tools. Your objective? Answer business questions and provide actionable data which can help the business.
A few examples:
Should you pick and choose components and build the big data stack yourself, or take an integrated solution off the shelf? Until recently, to get the entire data stack you’d have to invest in complex, expensive on-premise infrastructure. Today a new class of tools is emerging, which offers large parts of the data stack, pre-integrated and available instantly on the cloud.
Another major change is that the data layer is no longer a complex mess of databases, flat files, data lakes and data warehouses, which require intricate integration to work together. Cloud-based data integration tools help you pull data at the click of a button to a unified, cloud-based data store such as Amazon S3. From there data can easily be ingested into cloud-based data warehouses, or even analyzed directly by advanced BI tools.
Panoply's cloud data platform covers all three layers at the bottom of the stack:
Data—Panoply's cloud based storage can hold petabyte-scale data at low cost.
Integration/Ingestion—Syncing data takes just a few minutes thanks to Panoply's intuitive UI.
Data Processing—Panoply lets you perform on-the-fly queries on the data to transform it to the desired format, while keeping the original data intact. It connects to all popular BI tools, which you can use to perform business queries and visualize results.
Most importantly, Panoply does all this without requiring data engineering resources, as it's fully integrated, right out of the box.
For a long time, big data has been practiced in many technical arenas, beyond the Hadoop ecosystem. Big data is in data warehouses, NoSQL databases, even relational databases, scaled to petabyte size via sharding. Our simple four-layer model can help you make sense of all these different architectures—this is what they all have in common:
By infusing this framework with modern cloud-based data infrastructure, organizations can move more quickly from raw data to analysis and insights. Data engineers can leverage the cloud to whip up data pipelines at a tiny fraction of the time and cost of traditional infrastructure.
As an analyst or data scientist, you can use these new tools to take raw data and move it through the pipeline yourself, all the way to your BI tool—without relying on data engineering expertise at all.