This is the first part in a 101 series covering Big Data concepts, terminology and technology. Starting with the data warehouse.
Over the past few months I have poured over a wide variety of articles and sources on data warehousing and the paraphernalia of concepts and technologies associated with big data. Despite the fact that I consider myself to be technology inclined, the vast majority of them left me with a mild headache and 17 new tabs opened for each source read. I’m not a big fan of explanations that leave you more confused than you began with, so while I’m without a doubt a data padawan, I’ll nonetheless take a stab at explaining some of the concepts and technologies in this space, starting with data warehousing.
A few of the “beginner” articles I read on data warehousing explained the concept by paralleling it to brick ‘n’ mortar warehouses. While the comparison holds true, scale gets lost in translation. When most of us think of a warehouse we think of the self-pickup zone in IKEA or packing blessing boxes. The point is that we scale back to something that we can wrap our minds around. We simplify the problem and frame it from the perspective of our experience. We just don’t think of warehouses on the scale of football fields. So instead of attempting to paraphrase a data warehouse as a warehouse full of boxes let’s put everything in the context of a library.
A data warehouse is an organizational methodology, not a technology. Today we have technologies like Redshift, Hadoop, etc., that power data warehouses but if you strip all that away it is methodology that differentiates databases from data warehouses. A building with piles of books in it is not a library. A building with books (data), and shelves (tables), labeled with call numbers (metadata), organized by a classification system (indexing), searchable (queries) by a catalog (keys) is a library. Once upon a time we didn’t even use technology to do all this. We just had row upon row of index cards. It is its architecture that makes a data warehouse more than a collection of databases.
Now let’s look at the difference between a bookstore and a library. Or in other words a building with books in it, organized for transactional purposes versus analytical purposes. Bookstores are optimized to help you find what you want and then buy it. So when a book is scanned at a cashier one simple thing happens, a transaction. The more people you have in line, the longer it takes for transactions to be processed, the higher the chance that a customer will abandon their purchase. For that exact reason a manager wouldn’t start tallying each cashier’s transactions during work hours. The same premise holds true for transactional databases, the more non optimized operations you run on them, the slower their core operations become. A library’s, just like an analytical data warehouse’s, main operation is queries. They are organized explicitly and optimally for this purpose.
The Coralville community library and the library of Congress were not created equal. Yes, they both have books and yes, they both are libraries but their organizational methodologies are different. Marts, warehouses and lakes are organizational concepts for data. They are similar yet not identical. For simplicity’s sake let’s take a look at one library with a mart, warehouse and lake. A library can be made up of a couple department’s; main, reference, sorting, media, storage, etc., depending on its size and needs. Likewise, analytical data can be organized into marts, warehouses, lakes or any hybrid of them. We already explained warehouses, our main library, so let’s skip ahead to reference departments, our data marts. Some books are in such demand that not only will we not let them be checked out, we will also allocate them their own special section, with its own separate catalog in order to facilitate access to them. This is a data mart, data that is so important, such as the last 24 hours of a store’s activity, that we will allocate it a separate section of our data infrastructure. By doing so we will be able to isolate real-time analytical activity and respond in real-time.
Many libraries also have storage facilities. Duplicate volumes, unsorted material, etc., will be held there until needed or sorted. These are our data lakes. We more or less know what is stored there but for a variety of reasons we have yet to restructure them to fit into the organizational system that we are using. Organizationally data lakes contain vast amounts of structured, semi-structured and unstructured data.
Data infrastructure is a concept, optimized for analytics, powered by technologies which can be constructed from any mix or hybrid mix of organizational concepts. It all depends on the business, its needs and the type of data at its fingertips.
The next part in this series will discuss types of data including, data sources, structured vs. unstructured data, ETL processes and how they impact your data infrastructure. Read it here