Long before there was big data, there were computing solutions designed to store, organize, and analyze data. But just like the job of a big data architect, it's both bigger and different than that of other data architects, so too is the
As with almost any product, one useful way to approach selection criteria is simply as a list of questions. So here’s our top 10:
- Do you need a big data solution?
Consider adopting a big data solution if your use case requires
- What is your big data use case?
Before you can figure out what to buy, you need to
- What specific functionality is required?
A good way to answer this question is to identify the
- What skills does your team have (or will need to acquire)?
Again, if you don’t have a data scientist or engineer on board already, you likely don’t need a big data solution. The next most important skill — the one most analysts spend the most time on — is preprocessing, i.e., integrating different formats and cleaning the data. Big data usually means
- How well does the solution play with others?
One way to overcome the preprocessing bottleneck is with a tool like Panoply that does a lot of that integration and data cleaning automatically. Another way is with tools that all support the same formats out of the box. If users do the data preprocessing manually, R and Python can handle virtually any format (e.g., SAS, CSV, MongoDB). The human factor should also
- Open source or proprietary?
On this one, a lot depends on your users. Those who like to code may prefer open source tools because they can “look under the hood” and tweak the software. Others may prefer the drag-and-drop style more likely found in proprietary tools. Other factors to consider are extensibility and type of support (see questions 6 and 10).
- How scalable?
Do you expect your needs to grow; such as for storage, table size, or compute capacity? If so, how much, how fast, and how easily will the solution you’re considering scale? One
- How extensible?
Many big data use cases call for adding functionality to off-the-shelf solutions. So you may want to select solutions that support scripting languages like R or Python. You may also want open APIs as well as support for RESTful APIs that let solutions consume external cloud-based services. Open source solutions tend to be more extensible by definition.
- Is your infrastructure big data ready?
If you’re going to start running solutions that handle a lot of data, particularly if it’s sensitive (like patent data), you want to make sure elements of your infrastructure — like networks, firewalls, system software, server resources — are sufficiently robust, balanced, and partitioned. Otherwise, you may face performance and compliance issues both within the big data use case itself and in other IT operations as well.
- What about support services?
If you need an immediate phone call or online chat support (“Why isn’t this script running?”), then you probably want a proprietary solution that offers this. Open source solutions probably won’t. On the other hand, open source solutions can’t exist without huge communities of experts ready to volunteer their help. So there tend to be lots of community chat rooms, searchable discussion threads, and other extensive resources online that you can leverage for free, provided you have the technical expertise and the time.
If your organization is looking for a new tool to support a big data use case, it may be because things aren’t working out with your existing tools — or you expect that they won’t — perhaps for one of the reasons cited in these