How To Analyze MongoDB Data In Tableau

The speed, flexibility and ease of setup of NoSQL-style databases have all contributed to the rapid rise in popularity of MongoDB in recent years. Despite its advantages in speed and flexibility, though, MongoDB can still be a bit of a pain when it comes to actually analyzing your data, because so many of the analytic approaches and techniques used by data analysts and data scientists currently depend on the more structured, tabular format common to traditional relational databases. At Panoply, we get a lot of customers who are looking for ways to make analyzing their MongoDB data easier, especially when it needs to be integrated with the rest of their data in an analytics warehouse. Today, we’re going to explore how you can use Panoply to collect, process and prepare your MongoDB-stored data for analysis in just a few steps by taking advantage of our automated data warehousing approach, and we’ll do it using Tableau, one of the most popular BI suites.

If you want to follow along, we’ll be using a pre-made dataset of city inspection data available from Github that you can load into your own MongoDB instance before starting--or just use your own data.

Pulling data from MongoDB into your Panoply Data Warehouse

First, you’ll need to connect your MongoDB instance to Panoply (note: if you’ve already done this, feel free to skip ahead to a later section). From your Panoply dashboard, click on Data Sources in the left pane, then hit Add Data Source in the upper right:

Which will take you into the data source selection section. MongoDB will be under Databases (in the leftmost column):

Selecting MongoDB will open a new pane where you can enter your connection details, which will look something like this:

Note that you’ll need a couple different pieces of information to connect Panoply to your MongoDB database:

  • Username: the username you use to connect to your MongoDB database remotely
  • Password: the password associated with that username
  • Host: the address where your MongoDB database is hosted
  • Port: the port your MongoDB database uses for connections
  • Database name: the name of your MongoDB database

As you can see in the screengrab above, you’ll need to plug all that information into a MongoDB connection URI, which will have the general form of:

mongodb://[username:password@]host1:[port1],[...hostN:portN]][/[database][?options]]

Or, more concretely, it should look something like this:

mongodb://myDBReader:D1fficultP%40ssw0rd@mongodb0.example.com:27017/admin

Note that Panoply encrypts all login information provided by our users. Once you’ve entered your URI, you can expand the Advanced Options section and set your preferences. In the Advanced Options section, you’ll be able to set the following:

  • Destination: allows you to give a name to the table that will be created in Panoply after importing, or a table-specific prefix if you’re planning on importing multiple tables
  • Primary Key: allows you to set a primary key that will act as a unique identifier for each record in your table(s), and will also allow you to link related data across tables. If you already have a field in your tables called “id” or something similar, Panoply will use it as a primary key if no other parameter is set. If you don’t set a primary at all, you won’t see any deduplication in your data imports, so every single piece of data from the collection run will be added to your tables--you might not want that. Since your are using MongoDB, the default will be _id.
  • Incremental Key: make sure to set this up if you’re planning on making multiple, recurring pulls from this data source. Doing so will make future imports more efficient, as Panoply will only collect data that has been updated since the last time you pulled from that source. Note that incremental keys will only work properly if you are pulling from a single table with this connector.
  • Exclude: you can use this field to exclude specific data elements from your import. If there are specific types of data (e.g. irrelevant or sensitive data) that you would like to exclude from your Panoply data warehouse, you can use this setting to manage that. You can exclude nested fields with dot notation. So if my "users" top level object had a field in it called "email", you could exclude that by putting “users.email” in the exclude section in Panoply. Learn more here.

Once you’ve set all that up, click Collect. The Data sources - MongoDB pane will go gray while the process is underway, and a green status bar will appear at the top of the pane. You don’t need to stay and watch it, though. Panoply is designed to allow for multiple, parallel data collection processes, so you can move on to your next data source--and the one after that--while your MongoDB data is being ingested. Panoply will send you an alert when your data collection run is finished, but you can also monitor progress directly or cancel jobs in the Jobs pane.

Once all your data collection is finished, you can head to the Tables pane and get a bird’s eye view of all your fresh, processed data. If you used the same sample data we did, your data should have gone from a collection called Inspections looking something like this:

{"_id":{"$oid":"56d61033a378eccde8a8354f"},"id":"10021-2015-ENFO","certificate_number":9278806,"business_name":"ATLIXCO DELI GROCERY INC.","date":"Feb 20 2015","result":"No Violation Issued","sector":"Cigarette Retail Dealer - 127","address":{"city":"RIDGEWOOD","zip":11385,"street":"MENAHAN ST","number":1712}}

{"_id":{"$oid":"56d61033a378eccde8a83550"},"id":"10057-2015-ENFO","certificate_number":6007104,"business_name":"LD BUSINESS SOLUTIONS","date":"Feb 25 2015","result":"Violation Issued","sector":"Tax Preparers - 891","address":{"city":"NEW YORK","zip":10030,"street":"FREDERICK DOUGLASS BLVD","number":2655}}

{"_id":{"$oid":"56d61033a378eccde8a83551"},"id":"10084-2015-ENFO","certificate_number":9278914,"business_name":"MICHAEL GOMEZ RANGHALL","date":"Feb 10 2015","result":"No Violation Issued","sector":"Locksmith - 062","address":{"city":"QUEENS VLG","zip":11427,"street":"214TH ST","number":8823}}

to something more like this:

flattening-nested-json-mongodb

In the Tables section of your Panoply dashboard, the resulting tables will be listed like this:

Exploring your Data in Panoply

For a quick check of the data’s structure you can explore the new tables inside Panoply. The query would look like:

SELECT 
  inspections.business_name, 
  address.city, 
  address.street 
FROM 
  inspections 
  JOIN inspections_address address on inspections.id = address.inspections_id 
LIMIT 10

image14Connecting Tableau to your MongoDB data

Now that your data is all loaded into Panoply and organized into standardized columnar tables, you’ll find it’s a lot easier to analyze with a BI tool like Tableau. Connecting Tableau to your Panoply data warehouse is a pretty simple process, since it uses Tableau’s native Redshift connector.

Finding your Panoply connection parameters

Before you start, though, you’ll need to gather your connection details from Panoply. You can find this by clicking Connect on the navigation pane in your Panoply dashboard, which will open a new pane with all of the relevant connection details, like so:

Note down the Host name, Database name, Port and User/password combination.

Setting up the connection in Tableau

Now open a new Tableau session, and start a new connection by going to Data > New Data Source from the menu:

Choose Amazon Redshift. This will open a dialog where you can enter the connection details you got from Panoply:

 

Once Tableau is connected to your Panoply data warehouse, you will be able to select which data tables you want to use for analysis. Select the Public schema on the left hand side, then drag the tables you’d like to analyze to the top top pane (indicated by the text “Drag tables here”). You should see the tables you selected laid out in a relational graph now. Since we used the city inspections dataset from the Github collection mentioned above, our data from Panoply now looks like this in Tableau:

Tableau will try to automatically identify the keys on which to join your data. In this case, it worked perfectly, correctly inferring that we would want to join the two tables on Id = Inspections Id. But if your data isn’t joining right, you can tweak the keys here by clicking on the blue portion of the join diagram, which will bring up a dialog that looks like this:

 

And use the dropdown menus on either side of the join to change keys.

Analyzing your MongoDB data in Tableau

You’ve now created an analysis pipeline by linking your MongoDB database to your Panoply instance and are all set up to analyze your MongoDB in Tableau. Let’s take a look at the dataset we pulled from Github. A brief glance at the raw JSON shows that it’s got records of New York City inspections from all over the city representing a number of different business types. It also shows whether a violation was issued after the inspection, but it doesn’t show the type of violation. Let’s start by looking at the geographical distribution of violations using one of Tableau’s pretty mapping features. Once you’ve connected to your Panoply data source, you can head to your first worksheet. At this point, Tableau has probably even popped up a little flag in the lower left corner of your screen to help you along:

Once you’re there, if you’re using the same data source we are, you can start by double clicking the zip code data on the lefthand side of the screen. This will start the process of making a map-based view of your data. Next, because the result column in our data contains multiple possible values, we’ll filter it down to just violations, since that’s what we want to look at.

Under the Dimensions section in the left pane, click result, then Show Filter. Result should now appear in the Filters box. Click on result in the Filters box and then click Edit Filter. Clear all selections, then select violation issued. Now create a Count measure for result, and drag it to both the color and size marks.

If you’ve set everything up right, the main window in your worksheet should now be filled with a nicely colored map of the greater NYC area, with zipcodes shaded by count of violations, like this:

 

If you’re familiar with NYC geography, you can probably spot a few hotspots that may or may not be surprising to you. Looks like the Brooklyn Heights / Downtown Brooklyn area is a real hotspot for inspection violations! Someone should tell the mayor.

This is just a taste of what you can do with your new MongoDB > Panoply > Tableau workflow, so make sure to play around and see what else you can do.

Get a free consultation with a data architect to see how to build a data warehouse in minutes.
Request Demo
Read more in:
Share this post:

Work smarter, better, and faster with weekly tips and how-tos.