Long before there was big data, there were computing solutions designed to store, organize, and analyze data. But just like the job of a big data architect, it's both bigger and different than that of other data architects, so too is the job of the solutions on which those architects rely. That means adopters should consider other factors when selecting a specific big data solution beyond those they might have considered in the past. That applies to big data requirements in general and for the specific big data requirements of an individual use case — whether it’s a particular data warehouse, storage system, BI tool, data analytics tool or other solution you are looking to select.
As with almost any product, one useful way to approach selection criteria is simply as a list of questions. So here’s our top 10:
Consider adopting a big data solution if your use case requires very deep data (e.g., terabytes or petabytes) or extremely wide data (e.g., millions of columns). Another good indicator is that your organization employs or contracts with a data scientist or data engineer. And a third is if the big data use case is recurring and not just a one-off.
Before you can figure out what to buy, you need to figure out what you need. For example, marketing analytics, bio analytics, and predictive maintenance spec differently for things like computing power, size of storage, latency, security, skill sets required, software functionality, and software interface. Only by closely defining your use case first can you define the environment to support it.
A good way to answer this question is to identify the key potential solution bottlenecks — as in the solution doesn’t provide this function well, the use case will fail or severely degrade. That might mean supporting a distinct format (e.g., object, file, SAS, relational), programming (e.g., R, Python,drag-and-drop), or specific integrations between various tools.
Again, if you don’t have a data scientist or engineer on board already, you likely don’t need a big data solution. The next most important skill — the one most analysts spend the most time on — is preprocessing, i.e., integrating different formats and cleaning the data. Big data usually means big diversity in data formats and data condition.
One way to overcome the preprocessing bottleneck is with a data management platform like Panoply that does a lot of that integration and data cleaning automatically. Another way is with tools that all support the same formats out of the box. If users do the data preprocessing manually, R and Python can handle virtually any format (e.g., SAS, CSV, MongoDB). The human factor should also be considered. One reason some users adopt Microsoft’s PowerBI, for example, is that they are already familiar with its user interface and can easily move tables and charts between PowerBI, Excel, and PowerPoint.
On this one, a lot depends on your users. Those who like to code may prefer open source tools because they can “look under the hood” and tweak the software. Others may prefer the drag-and-drop style more likely found in proprietary tools. Other factors to consider are extensibility and type of support (see questions 6 and 10).
Do you expect your needs to grow; such as for storage, table size, or compute capacity? If so, how much, how fast, and how easily will the solution you’re considering scale? One big advantage of the cloud, of course, is that capacity can grow (a lot) automatically and only as you need it. But even some cloud tools have limits. For example, Amazon’s Redshift warehouse enforces maximum limits on the table per cluster (9.900) and some columns in a table (1,600) — but open source Hadoop does not.
Many big data use cases call for adding functionality to off-the-shelf solutions. So you may want to select solutions that support scripting languages like R or Python. You may also want open APIs as well as support for RESTful APIs that let solutions consume external cloud-based services. Open source solutions tend to be more extensible by definition.
If you’re going to start running solutions that handle a lot of data, particularly if it’s sensitive (like patent data), you want to make sure elements of your infrastructure — like networks, firewalls, system software, server resources — are sufficiently robust, balanced, and partitioned. Otherwise, you may face performance and compliance issues both within the big data use case itself and in other IT operations as well.
If you need an immediate phone call or online chat support (“Why isn’t this script running?”), then you probably want a proprietary solution that offers this. Open source solutions probably won’t. On the other hand, open source solutions can’t exist without huge communities of experts ready to volunteer their help. So there tend to be lots of community chat rooms, searchable discussion threads, and other extensive resources online that you can leverage for free, provided you have the technical expertise and the time.
If your organization is looking for a new tool to support a big data use case, it may be because things aren’t working out with your existing tools — or you expect that they won’t — perhaps for one of the reasons cited in these ten questions. Thoroughly understanding your use case first, and the functionality you need to support it second, are the important first steps in selecting any big data solution. Once that’s done you are well on your way to filling in the gaps. If you would like to learn about a specific technology, like Redshift clusters, more information is available.