This is the second part in a 101 series covering Big Data concepts, terminology and technology.
By this point it should be clear that there is much more to a data driven company than technology. I remember complaining throughout high school that solving for X was an exercise in stupidity as it had no real world applications. I promptly ate that shoe later in life when I took my first finance class and discovered that while perhaps high school had nothing to do with the real world, X most certainly does. The second shoe fell upon discovering that all the equation sheets and financial calculators in the world mean precisely squat if you do not understand how to apply the theory. As in math so in data; It is your understanding of which theory and methodology will best serve your business’s needs that will dictate your data strategy’s success or failure.
We’ve all been taught that in order to solve equations we need to reduce them to the lowest common denominator, balance them and solve for variables, and data is no different. In mathematics, in order to evaluate expressions for rigor a precise language of notations must be agreed upon. In data, in order to analyze, discrepancies in naming conventions, structure, etc. must be resolved. Today we know how to solve many of these data equation’s but not all of them. In order to understand why, there are two fundamental truths that must be recognized. First, we are working with data on a scale that makes it impossible to manually and/or intuitively arrive at conclusions and/or recognize patterns. To do so we must use machines. Period. Second, machines and humans do not speak the same language. There is a huge difference between reading and understanding and machines ability to understand is limited by our ability to provide them with vocabulary, grammar and syntax.
Structured data is the easiest type of data for machines to interact with as they have a high degree of organization. Data resides in fixed fields within records or files according to its data model. Which defines what fields of data will be stored, how that data will be stored, and any restrictions on the data input. The obvious forms of structured data are relational databases; a model where data is organized into one or more tables. Therefore, there is a relation between the columns and the rows and relations can also be created between tables using foreign keys. Foreign keys are what makes the database understandable in the same context that a human being knows that two tables are related to each other. Excel, can be an exception. Theoretically, excels are relational tables but because there are no restrictions on the method of input they can easily turn into semi structured data that needs to be normalized for machine reading. Simple things such as changing a column name from “Monthly Average Users” to “MAU” can break the table relationships.
Unstructured data has not been organized into a format that makes it easier to access and process. Photos and graphic images, videos, PDF files, PowerPoint presentations, emails, etc. are more akin to how we speak than how machine language. Human beings do not communicate and think in terms of databases. The vast majority of our communication is textual and at times riddled with discrepancies, omissions, misspellings and 101 other things that we intuitively understand yet would stump a computer.
Given that unstructured information will account for 90% of all data created over the next decade, it is no surprise that companies are investing in NLP (Natural Language Processing) and data mining technologies in order to tap into, and one day fully interpret, this well of data.
Semi structured data is somewhere in the middle (Hence the name…). In the case of XML, JSON and object oriented databases they are not truly structured because they are flexible. Let’s look at the following two tweets and see how their structure differs between relational table and a JSON.
The main difference is that in a relational table fields will continue to be written regardless of null values while JSON will write only relevant values. Both structures are correct but they have different implications on storage and compute. Generalizing, if you want to know a specific value your query will run faster on JSON but if you want to know the sum of a column of values your query will run faster on a table.
Extract, Transform, and Load
ETL refers to a process in database usage and especially in data warehousing that extracts data from data sources, transforms the data for storing it in the proper format or structure for the purposes of querying and analysis and loads it into the final target destination.
The extraction process is mainly focused on getting all the data into a single format, balancing the equation, and the load process deals with uploading all the data to the destination, usually a data warehouse. On the point of transformation, I am going to veer off from traditional ETL definition. Note the following sentence, “…transforms the data for storing it in the proper format or structure for the purposes of querying and analysis…”, in essence, to load data to your warehouse you must first understand the business logic that drives your queries and analysis and apply it to your data preload. Table joins, aggregations, value translations must all be addressed in advance. The glaring flaw apparent here is the assumption that a) business logic can be fully understood in advance, b) business logic is fixed. Suffice to say that neither is true. The result is a patchwork of unaligned extractions and transformations that tend to break with every change in business logic and technical update of data sources.
Getting data into the warehouse can be a tedious headache but it is the assumption that transformations must be performed in advance that makes data warehouse management an ongoing migraine.
Data infrastructure is a concept, optimized for analytics, powered by technologies which can be constructed from any mix or hybrid mix of organizational concepts. Now we also understand the different types of data that organizations have at their beck and call and if they can be analyzed in data warehouses.
The next part in this series will discuss how analysis, BI and data visualization can impact your data infrastructure. Or more importantly why they shouldn’t.