One of the significant challenges to data-driven companies comes when integrating different application systems.
Major public cloud providers, such as Amazon or Google, offer robust products that are ready for your analytics needs. But when it comes to exploring your data, things might not be that simple.
Your source data often comes from files with unknown formats of data, which makes an analyst's job a nightmare.
Some cases could be more fluid in this integration, such as when your data has nested values. Exploding complex structures, such as JSON files, into a tabular format, can consume most of your time when you're exploring new data.
That's where Spectrum and Athena shine. They let you use SQL to analyze data without changing it from the source. There's no need for complex Python code if you don't want to use it on initial data profiling tasks.
Pretty neat, right?
This article will show you how to explore your data on Amazon S3 using Athena and Redshift Spectrum. Below, you'll find the necessary steps to create a table on the AWS Glue catalog and use it to access your data in Amazon S3.
How do Spectrum, Athena & S3 fit together?
While Amazon's data products aren't quite as extensive as its famous ecommerce shop, there's still a lot going on.
Spectrum is a Redshift component that allows you to query files stored in Amazon S3.
Your team can narrow its search by querying only the necessary columns for your analysis. Also, it's possible to consult existing tables from your Redshift cluster, which means that instead of querying the full table all the time, you can select the required columns for your report using SQL.
So when you're querying your data, you get only the needed columns from your data instead of returning unnecessary fields and rows. The also opens up the possibility of querying data stored directly on Amazon S3.
Athena makes it easier to create shareable SQL queries among your teams—unlike Spectrum, which needs Redshift. You can then create and run your workbooks without any cluster configuration.
Athena makes it possible to achieve more with less, and it's cheaper to explore your data with less management than Redshift Spectrum.
Amazon S3 is the managed object storage option that Amazon offers. It's the best option to store your semi-structured data, such as server logs from your applications.
S3 also allows "delete protection" and "version control" of your objects, making your data safer and easier to track back to its original source.
How to create tables from files
Now that you have an overall idea of each product, it's time to get your hands dirty and create some tables!
We'll also assume that you have your Redshift cluster ready with the necessary IAM roles attached to it (when using Redshift Spectrum).
Okay, so far, so good! Let's move on to creating the tables.
First, you need to create the database where the tables will be stored.
For this tutorial, we'll count on AWS Glue Data Catalog for this job. Just remember that other options are available, such as the Hive metastore.
Glue Data Catalog is a better option if you want to have fluid integration with additional data sources without starting extra services. This is a good resource for details about the different options for your needs.
The script below creates a database on Athena if it doesn't exist already:
CREATE DATABASE IFNOT EXISTS ext_data_suppliers COMMENT 'Landing Zone for S3 buckets loaded by external Data Suppliers' LOCATION 's3://test-12343210/';
Here's the Spectrum version of it:
create external schema ext_data_suppliers from data catalog database 'ext_data_suppliers' iam_role 'arn:aws:iam::123456789012:role/RSRoleApiData' create external database ifnot exists;
As you can see in both cases, your code will create a Glue catalog database if one doesn't exist already.
Once you have that, you'll need the table definition, which will let you query the data directly from the file.
At this stage, I recommend not doing any transformations on the data because a minor modification—even a simple datatype conversion—can result in the loss of data. So let's avoid that, especially in the early stages.
Now, let's create a table definition that'll contain the data.
Below, you can see the Athena version:
CREATE EXTERNAL TABLE IF NOT EXISTS ext_data_suppliers.zillow_sample_file ( `index` int, `liv_space_in_sqft` int, `beds` int, `baths` int, `zip` int, `year` int, `list_price_in_usd` int ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = ',', 'field.delim' = ',') LOCATION 's3://test-12343210/' TBLPROPERTIES ('has_encrypted_data'='false');
And here's the Spectrum version:
CREATE EXTERNAL TABLE ext_data_suppliers.zillow_sample_file ( index int, liv_space_in_sqft int, beds int, baths int, zip int, year int, list_price_in_usd int ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = ',', 'field.delim' = ',') LOCATION 's3://test-12343210/';
So, which should you choose: Spectrum or Athena?
Most data from data suppliers isn't optimized to answer the questions you may have. As this data is in a raw state, tools like Athena or Spectrum will make your exploration tasks much more straightforward.
Both tools allow you to explore your data without loading it into a database. All you need to do is say what your data structure is and where it resides. After that, you're good to go—no more delays on your data pipelines to start creating your dashboards. As your data becomes available on your Amazon S3 bucket, your team can consume it right away.
My advice? Choose Athena if you don't have a Redshift cluster already in place.
With Athena, it becomes easier to create shareable queries among your team without managing extra services and raising your cloud bill unnecessarily.
Summing it up and going deeper
To recap, we've covered 2 important topics:
- The benefits of having a data exploration tool that allows your analysts to run SQL commands on top of your object-storage-type solution.
- Running SQL commands on files stored in Amazon S3, using Athena and Redshift Spectrum.
As you saw, both scripts are very similar. In both, we used serialization/deserialization (
SerDe for short) to create a table-like structure correctly. This structure lets you access the data in the CSV format without loading it into native tables.
Here's another thing I'd like you to remember: Grant only necessary permissions to services. Narrowing access to your services will help you sleep better.
How Panoply can help
Tools such as Athena and Redshift Spectrum can be priceless from an analytical standpoint. But these tools have some limitations concerning data access and fewer integration options, to name a few.
By using Panoply, you can enhance these capabilities with your cloud data platform solution. You can enable more data sources thanks to more built-in connectors, which means fewer worries as you manage your data pipeline.
In addition, you're in charge of who has access to which object in a hassle-free environment, deployed smoothly with little coding required. Thanks to this centralized control, your data exploration tasks get done with fewer concerns about the nitty-gritty of how things work under the hood.
If that perks up your ears, you should consider a 14-day free trial of Panoply.