Connecting Power BI to AWS DynamoDB
One thing we see a lot of here at Panoply is customers looking to use Power BI to analyze data stored in DynamoDB instances. If you’ve ever looked into this, or tried it yourself, you probably know that it’s not particularly easy. Power BI doesn’t currently natively support connecting to DynamoDB, so those looking to make this happen will have to track down an ODBC driver that connects DynamoDB to Power BI, install it, and get everything in place before they can get started with analysis.
How we do it: the easy way
If you’re a Panoply user, though, things are a bit easier. Because Panoply has native connectors for DynamoDB and built-in support for Power BI, the process is pretty streamlined in comparison. We’re going to lay it out in a step-by-step guide below; if you want to follow along, make sure you track down your AWS Access Key and AWS Access Secret before getting started. If you’ve got one you are ready to use for this, skip to Pulling data from DynamoDB into Panoply below. If you’d like a quick walkthrough on how to set up a secure connection between DynamoDB and Panoply using IAM roles, we’re going to lay that out next.
Creating an IAM role for Panoply
If you want to generate a set of Panoply-specific AWS Access Keys and AWS Access Secrets (recommended), head to the AWS IAM console and choose Users from the navigation pane. Next, choose Add user in the upper left (For the purposes of this guide, we’ll just create a new user on the account, but you can always add these privileges to an existing user if you’d like).
Choose a name for the new account and check the box for programmatic access so that you can generate an access key ID and secret access key.
After you’ve entered the name you’d like to set for the account, click Next: Permissions. You can now add this user to an existing group, set up a new group for the Panoply user account, or just add IAM policies directly to the account you created for this guide. We’ll choose Attach existing policies directly from the options up top. Filter the existing policies by entering DynamoDB into the search box, then select the policy for read only access to DynamoDB as pictured below:
Click Next and set tags on the user if you want, but we’ll skip that step for now, bringing us to the Review step. If everything looks right, click Create user in the bottom right.
The next page will show your access key ID and secret access key. Make sure to store them securely (i.e. by downloading as a .csv), as you won’t be able to get them from Amazon again. You can always generate new keys, but you won’t be able to recover these particular keys if you lose track of them. Your keys should look something like this:
- Access key ID: AKIAIOSFODNN7EXAMPLE
- Secret access key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Keep these keys safe, and don’t share your secret key with anyone outside your organization!
Pulling data from DynamoDB into Panoply
The first step is to connect your DynamoDB instance to Panoply (note: if you’ve already done this, feel free to skip ahead to a later section). From your Panoply dashboard, click on Data Sources in the left pane, then hit Add Data Source in the upper right:
Which will take you into the data source selection section. DynamoDB will be under Databases (in the leftmost column):
Selecting DynamoDB will open a new pane, which will require you to enter your credentials:
At this point, you might also need to set up IP whitelisting, depending on how your AWS access settings are configured. Restricting the IP ranges that are allowed to connect to your AWS instances is a straightforward way of managing your databases’ security, and if you’re following standard security practices with your AWS instance, you’ll definitely need to set up IP whitelisting. Whitelisting Panoply’s IP range will add Panoply to the list of approved connections for your database.
See the link above for the IP ranges Panoply’s data connectors use, and enter those in your DynamoDB instance’s IP whitelist. When you’re all set, hit Next, choose the tables you want to import, and set any other advanced options. Note that you can take advantage of Panoply’s job parallelization features here to set up multiple, separate imports. So, for example, rather than importing every single table in one go, you might find it more efficient to set up multiple connectors for different segments of your data. Since we’re using Amazon’s movie sample data for this guide, we’ll select that in the next pane:
If you expand the Advanced options, you’ll see that you can fine-tune the data import by indicating a destination, primary key, and incremental key.
- Destination: allows you to give a name to the table that will be created in Panoply after importing, or a table-specific prefix if you’re planning on importing multiple tables
- Primary Key: allows you to set a primary key that will act as a unique identifier for each record in your table(s), and will also allow you to link related data across tables. If you set a primary key when setting up your DynamoDB instance, you can just re-use that here. This is the most important Advanced setting to pay attention to. If you already have a field in your tables called “id” or something similar, Panoply will use it as a primary key if no other parameter is set. If you don’t set a primary at all, you won’t see any deduplication in your data imports, so every single piece of data from the collection run will be added to your tables--you might not want that.
- Incremental Key: make sure to set this up if you’re planning on making multiple, recurring pulls from this data source. Doing so will make future imports more efficient, as Panoply will only collect data that has been updated since the last time you pulled from that source. Note that incremental keys will only work properly if you are pulling from a single table with this connector.
- Exclude: you can use this field to exclude specific data elements from your import. If there are specific types of data (e.g. irrelevant or sensitive data) that you would like to exclude from your Panoply data warehouse, you can use this setting to manage that.
When you’re all set, hit Collect. The Data sources - DynamoDB pane will go gray, and a green bar will appear above it to indicate progress. You can check the “Jobs” pane to monitor its progress if you want, but you’ll get an alert from Panoply once the collection has finished or . If you need to get data from another source while you’re waiting, you can just start that up as well. When your data collection is finished, head to Tables on the navigation pane to check up on your data. Because we used Amazon’s movie sample data on our end, our data went from a JSON object containing data that looked like this:
To a set of tables that look like this:
Panoply will often create multiple tables from a single table in DynamoDB because of the way it’s set up to deal with nested data. In order to flatten the nested objects that make up a NoSQL database, Panoply will create multiple, related tables to fit into the relational database structures it uses.
Connecting Power BI to your DynamoDB data
Now that your data is all loaded into Panoply and organized into standardized columnar tables, you’ll find it’s a lot easier to analyze with a BI tool like Power BI. Connecting Power BI to your Panoply data warehouse is a pretty simple process, since it uses Power BI’s native Redshift connector.
Finding your Panoply connection parameters
Before you start, though, you’ll need to gather your connection details from Panoply. You can find this by clicking BI Connection on the navigation pane in your Panoply dashboard, which will open a new pane with all of the relevant connection details, like so:
Note down the Host name, Database name, Port and User/password combination.
Setting up the connection in Power BI
From Power BI desktop, click Get Data in the Home tab.
Type ‘amazon’ in the search box, click Amazon Redshift, and proceed in clicking the Connect button.
A new dialog box will appear. Fill out the fields for Server and Database. Putting values in the Batch Size field is optional. You can use it if you want to batch load the data in specific chunk sizes, whether by number of bytes or by rows. For Data Connectivity Mode, we recommend using Import for faster processing of data.
Once you click OK, fill out the fields for username and password on the next dialog box. Afterwards, click Connect.
Choose tables that you want to import. Click the table to see a preview of what it contains or check it to be ready for loading or editing. Next, click the Load button if you want to dive directly into visualizing your data--or select Transform if your data needs to be further cleaned or prepared prior to analysis (this will usually be the case).
Analyzing your DynamoDB data in Power BI
For this analysis, we will use the moviedata table from our dataset. First, let’s take a closer look at our dataset.
Our dataset contains only 1,000 rows, movies from selected years, and 2 useful fields (year and title) for our analysis. With this limited resource, let’s be creative with our analysis using this dataset only.
In the raw dataset, the only part of a date is the Year. Power BI reads the Year column as an integer and there’s no quick way to convert it to year type. The workaround is to create a custom column in Power Query Editor like the screenshot above. Once created, change the data type of the column into Date.
We will also be needing a separate column that counts the number of words for each movie title. See the screenshot below for an example formula.
Additionally, we duplicated the Title column and split it by the first character. After our data transformation, analysis and data visualization made, here’s the final look of the dashboard:
As mentioned earlier, we have 1,000 rows for this dataset but the count of unique movie titles is just 996. This is because in our dataset, four movies appeared in 2 different years. How can you check for validation? Just create a table with title and title count next to each other.
Standard deviation (years) is the measure of the amount of variation or dispersion of a set of values (year).
As you can see in the above screenshot, movies that only have two words in their titles dominate our dataset, and the most frequent first character is letter T. Probably due to the fact that many movies start with the word ‘The’ followed by a 1-word subject.
If you want to explore more of this dashboard, click this link.
This is just a taste of what you can do with your new DynamoDB > Panoply > Power BI workflow, so make sure to play around and see what else you can do.