Connecting Tableau to AWS DynamoDB
One thing we see a lot of here at Panoply is customers looking to use Tableau to analyze data stored in DynamoDB instances. If you’ve ever looked into this, or tried it yourself, you probably know that it’s not particularly easy. Tableau doesn’t currently natively support connecting to DynamoDB, so those looking to make this happen will have to track down an ODBC driver that connects DynamoDB to Tableau, install it, and get everything in place before they can get started with analysis.
How we do it: the easy way
If you’re a Panoply user, though, things are a bit easier. Because Panoply has native connectors for DynamoDB and built-in support for Tableau, the process is pretty streamlined in comparison. We’re going to lay it out in a step-by-step guide below; if you want to follow along, make sure you track down your AWS Access Key and AWS Access Secret before getting started. If you’ve got one you are ready to use for this, skip to Pulling data from DynamoDB into Panoply below. If you’d like a quick walkthrough on how to set up a secure connection between DynamoDB and Panoply using IAM roles, we’re going to lay that out next.
Creating an IAM role for Panoply
If you want to generate a set of Panoply-specific AWS Access Keys and AWS Access Secrets (recommended), head to the AWS IAM console and choose Users from the navigation pane. Next, choose Add user in the upper left (For the purposes of this guide, we’ll just create a new user on the account, but you can always add these privileges to an existing user if you’d like).
Choose a name for the new account and check the box for programmatic access so that you can generate an access key ID and secret access key.
After you’ve entered the name you’d like to set for the account, click Next: Permissions. You can now add this user to an existing group, set up a new group for the Panoply user account, or just add IAM policies directly to the account you created for this guide. We’ll choose Attach existing policies directly from the options up top. Filter the existing policies by entering DynamoDB into the search box, then select the policy for read only access to DynamoDB as pictured below:
Click Next and set tags on the user if you want, but we’ll skip that step for now, bringing us to the Review step. If everything looks right, click Create user in the bottom right.
The next page will show your access key ID and secret access key. Make sure to store them securely (i.e. by downloading as a .csv), as you won’t be able to get them from Amazon again. You can always generate new keys, but you won’t be able to recover these particular keys if you lose track of them. Your keys should look something like this:
- Access key ID: AKIAIOSFODNN7EXAMPLE
- Secret access key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Keep these keys safe, and don’t share your secret key with anyone outside your organization!
Pulling data from DynamoDB into Panoply
The first step is to connect your DynamoDB instance to Panoply (note: if you’ve already done this, feel free to skip ahead to a later section). From your Panoply dashboard, click on Data Sources in the left pane, then hit Add Data Source in the upper right:
Which will take you into the data source selection section. DynamoDB will be under Databases (in the leftmost column):
Selecting DynamoDB will open a new pane, which will require you to enter your credentials:
At this point, you might also need to set up IP whitelisting, depending on how your AWS access settings are configured. Restricting the IP ranges that are allowed to connect to your AWS instances is a straightforward way of managing your databases’ security, and if you’re following standard security practices with your AWS instance, you’ll definitely need to set up IP whitelisting. Whitelisting Panoply’s IP range will add Panoply to the list of approved connections for your database.
See the link above for the IP ranges Panoply’s data connectors use, and enter those in your DynamoDB instance’s IP whitelist.
When you’re all set, hit Next, choose the tables you want to import, and set any other advanced options. Note that you can take advantage of Panoply’s job parallelization features here to set up multiple, separate imports. So, for example, rather than importing every single table in one go, you might find it more efficient to set up multiple connectors for different segments of your data. Since we’re using Amazon’s movie sample data for this guide, we’ll select that in the next pane:
If you expand the Advanced options, you’ll see that you can fine-tune the data import by indicating a destination, primary key, and incremental key.
- Destination: allows you to give a name to the table that will be created in Panoply after importing, or a table-specific prefix if you’re planning on importing multiple tables
- Primary Key: allows you to set a primary key that will act as a unique identifier for each record in your table(s), and will also allow you to link related data across tables. If you set a primary key when setting up your DynamoDB instance, you can just re-use that here. This is the most important Advanced setting to pay attention to. If you already have a field in your tables called “id” or something similar, Panoply will use it as a primary key if no other parameter is set. If you don’t set a primary at all, you won’t see any deduplication in your data imports, so every single piece of data from the collection run will be added to your tables--you might not want that.
- Incremental Key: make sure to set this up if you’re planning on making multiple, recurring pulls from this data source. Doing so will make future imports more efficient, as Panoply will only collect data that has been updated since the last time you pulled from that source. Note that incremental keys will only work properly if you are pulling from a single table with this connector.
- Exclude: you can use this field to exclude specific data elements from your import. If there are specific types of data (e.g. irrelevant or sensitive data) that you would like to exclude from your Panoply data warehouse, you can use this setting to manage that.
When you’re all set, hit Collect. The Data sources - DynamoDB pane will go gray, and a green bar will appear above it to indicate progress. You can or check the “Jobs” pane to monitor its progress if you want, but you’’ll get an alert from Panoply once the collection has finished or check the “Jobs” pane to monitor its progress, but if you need to get data from another source while you’re waiting, you can just start that up as well. When your data collection is finished, head to Tables on the navigation pane to check up on your data. Because we used Amazon’s movie sample data on our end, our data went from a JSON object containing data that looked like this:
To a set of tables that look like this:
Panoply will often create multiple tables from a single table in DynamoDB because of the way it’s set up to deal with nested data. In order to flatten the nested objects that make up a NoSQL database, Panoply will create multiple, related tables to fit into the relational database structures it uses.
Connecting Tableau to your DynamoDB data
Now that you’ve collected the data from your DynamoDB and gotten it sorted into tables in Panoply, you can start in on analysis. To do that, you’ll need to connect Tableau to your Panoply warehouse. In this case, we’ll use Tableau Public, but the process should be the same if you’re using the paid version.
Finding your Panoply connection parameters
First, you’ll need to get the hostname, port and authentication details that Tableau will use to connect to your Panoply warehouse. You can find this by clicking Connect on the navigation pane in Panoply, which will bring up your warehouse’s details, like so:
Setting up the connection in Tableau
Next, you’ll need to enter this information on your Tableau client, so that it can access the data you want to analyze. There are a variety of ways you can get to the add new data source dialog in the Tableau Public client, but the most straightforward is to select Data > New Data Source from the menu:
Next, choose Amazon Redshift. This will bring up a dialog box where you can enter the data you got from the Panoply end:
This will take you to a section in the Tableau client where you can select which data you’d like to work with. Select the Public schema on the left hand side, then drag the tables you’d like to analyze to the top top pane (indicated by the text “Drag tables here”). You should see the tables you selected laid out in a relational graph now. Since we used Amazon’s movie sample data on our end, our relational graph will look like this:
Tableau will automatically set up joins for your various tables, but you may have to tweak them depending on how you have your keys set up. So, for instance, if you’re using Amazon’s movie sample data like we did, you’ll want to make sure that the keys in the two different tables match. In our example data warehouse, that means using “Moviedata Info Id” as the key on which to join the two tables.
Analyzing your DynamoDB data in Tableau
Now that you’ve got your DynamoDB > Panoply > Tableau pipeline all set up, you can start getting busy on the analysis side of things (i.e. the fun part). Let’s dig into the movie data sample set a little bit more. Suppose you had a feeling that movies were getting longer, and wanted to investigate it. Let’s take a look at average movie length by year in our sample dataset. Once you’ve connected Panoply to Tableau and selected the tables you want to pull your data from, you should have the option of starting a new Worksheet. You might even see a little popup in the lower left corner, like this:
Go to that worksheet. In the side pane, you’ll see your tables laid out with the various fields they contain. Drag your two data fields into the sheet to collect the data so you can chart it. We want to look at average running time for movies in the data set, so we’ll have to change the default measure (SUM) to AVERAGE.
Let’s look at this data as a continuous line graph:
Ok, so it looks like we’re actually seeing a downward trend in average running time of movies since a peak in the late 70s--they’re getting shorter, not longer. And what’s going on with that dip in the early 2010s? But let’s suppose we don’t put a ton of faith in the average as a good descriptor of the data (we shouldn’t). What if we wanted to get a fuller picture of what the distribution of running times looked like? Let’s make a box plot. In order to do this, we’ll need to disaggregate the running times measure, so that each individual running time can be binned by year. That will produce something like this:
First, you can see that the mystery of the early 2010s dip is solved--it’s just an artifact caused by the fact that there’s only a single movie from the year 2012 in our dataset, and it seems to have had a relatively short running time compared to the rest of the set. Secondly, looking at median running times, it still looks like movies are getting shorter overall since the late 70s. Of course, the data isn’t fully balanced across decades, so you might want to refine your analysis even further and dig deep on these trends, but that’s all the time we have for this guide for now. Stay tuned for more guides to analyzing other data sources using Panoply and your favorite BI tool.
Ready to get started? Try Panoply for free for 21 days.