As a species we’ve invented some pretty nifty things. It’s practically consensus that the wheel, the printing press and the internet top the list but we don’t need to look that far back to find innovations with the potential to change our lives. You may have heard of the ultra cool Metallic Microlattice which, while around from 2011, has been all the hype for past few months. As one of the lightest structural materials known to science it has impressive applications in aerospace engineering and the automotive industry. That said, no one in their right mind would expect humanity to chuck out all other forms of metal and now work exclusively with Metallic Microlattice. Unfortunately, this is exactly the fallacy that the data industry has fallen for with Hadoop.
Hadoop frenzyness
The data industry is in a frenzy over Hadoop, bending over backwards trying to come up with new applications, techniques and use-cases where Hadoop will perform phenomenally at solving yet another complex task. Don’t get me wrong, I’m not knocking it. Hadoop is a powerful data tool capable of handling massive amounts of semistructured data, with relatively low effort. I’ve used Hadoop for years in many different projects and I’ll definitely use it again. But it’s just one hammer in a huge toolbox of varied inter-connected technologies that can be employed to construct innovative and robust data architectures. Seriously, if someone were to walk up to you now and say Python is the only programing language you will ever need and you can forget about everything else, would you believe them? No.
In no field has any innovation completely and utterly wiped out what came before it or negated the need to continue expanding the horizons of innovation. The wheel has not stopped us from walking nor did it prevent us from inventing flight.
History has shown us that innovation is never a one stop solution and the same holds true for data management.
Columnar databases
One obvious example of this is the rise of Columnar MPP (Massively Parallel Processing) data stores, like Amazon’s Redshift architecture, Vertica and Greenplum. All are state of the art technologies, designed for fast analysis of highly structured datasets. Greenplum, for example, which has recently been open sourced, provides an array of columnar Postgres implementations working in unison across the cluster to achieve high throughput for analysis queries.
Zettabyte Database (ZBDB), a less known columnar database, has an interesting twist; it uses GPU for query computations. Unlike general purpose CPU, GPU typically contains hundreds or thousands of microprocessors, working in parallel on small dedicated tasks, usually around numerical calculations. While textual manipulations are less optimized by the GPU many analytical queries involving numerical computations enjoy vast performance benefits provided by this architecture.
Textual search
More exotic example are databases like Snowflake or even Elasticsearch. While the latter isn’t usually considered an obvious choice for big data analysis it has unique benefits. Think of creative ways we can harness the power for instant textual or tagged searches across big datasets; searching through aggregated data, looking up specific dimensions or users in a database even textual searches for data explorations in use cases where the user doesn’t necessarily knows what he’s looking for and wants to explore the available data. If we’re already on the subject check out Kueri which is similarly relevant.
Traditional row-oriented databases
Finally, there are the good old row-oriented databases, like Postgres, which are still superior in many use cases. For example when we want to analyze full rows, aggregate data per row (think of a users table), or even support operational queries such as wanting to review a specific user, payment or session. In these cases, you definitely don’t want to run these queries on your production database, and sometimes not even on the replications. You need a separate analytical data platform for your organization.
Choose well your options
When designing your organization’s data platform you should think of it as a data backbone for the use of the entire organization, not a database for analysts. In essence you need to architect a robust system that will become the single silo the organization uses for its day-to-day analytical operations. Of course you don’t want to over-engineer, and every tool adds it’s own set of complexities, caveats and maintenance, but you also don’t want to go to the other extreme of dumbing-down your selection of tools to just a single technology that’s reverse engineered to cover all use cases.
Hadoop is overshadowing tons of other exciting technologies. Even dominant technologies like Spark and Kafka are receiving much less exposure that you would expect and they deserve. Now we’ve been down this road a few years back with HTML5, Ruby on Rails and Node.js. Again, then the industry fell madly in love with the concept of a single technology to rule them all. But like everything else in tech the reality turned out to be way more complex than that ideal. Our job as engineers is to assemble the best possible tools, in interesting ways, produce outstanding results.
We need to look at technologies as opportunities to learn new paradigms and techniques. Applying them in innovative architectures that fit our needs. This is why I think every engineer should learn Lisp; not necessarily so to use it in production, but to understand the beautiful programming concepts Lisp provides. Same goes for Erlang on parallelism design. Linux has a beautiful philosophy, that every tool should do one thing, and do it extremely well. To be more than the sum of our parts we must embrace the diversity of data technology. Only through the unique combinations of these tools and technologies can we build truly powerful data platforms.