The speed, flexibility and ease of setup of NoSQL-style databases have all contributed to the rapid rise in popularity of MongoDB in recent years.
Despite its advantages in speed and flexibility, though, MongoDB can still be a bit of a pain when it comes to actually analyzing your data, because so many of the analytic approaches and techniques used by data analysts and data scientists currently depend on the more structured, tabular format common to traditional relational databases.
At Panoply, many customers are looking for ways to make analyzing their MongoDB data easier, especially when it needs to be integrated with the rest of their data in a data warehouse. Today, we’re going to explore how you can use Panoply to collect, process and prepare your MongoDB-stored data in Power BI, a leader for Gartner’s Magic Quadrant for Analytics and BI Platforms.
If you want to follow along—we’ll be using a pre-made dataset of city inspection data available from Github that you can load into your own MongoDB instance before starting—or you can just use your own data.
First, you’ll need to connect your MongoDB instance to Panoply (note: if you’ve already done this, feel free to skip ahead to a later section). From your Panoply dashboard, click on Data Sources in the left pane, then hit Add Data Source in the upper right:
This will take you into the data source selection section. MongoDB will be under Databases (in the leftmost column):
Selecting MongoDB will open a new pane where you can enter your connection details, which will look something like this:
Note that you’ll need a couple different pieces of information to connect Panoply to your MongoDB database:
As you can see in the screengrab above, you’ll need to plug all that information into a MongoDB connection URI, which will have the general form of:
`mongodb://[username:password@]host1:[port1],[...hostN:portN]][/[database][?options]]`
Or, more concretely, it should look something like this:
`mongodb://myDBReader:D1fficultP%40ssw0rd@mongodb0.example.com:27017/admin`
Note that Panoply encrypts all login information provided by our users. Once you’ve entered your URI, you can expand the Advanced Options section and set your preferences. In the Advanced Options section, you’ll be able to set the following:
Once you’ve set all that up, click Collect. The Data sources - MongoDB pane will go gray while the process is underway, and a green status bar will appear at the top of the pane. You don’t need to stay and watch it, though. Panoply is designed to allow for multiple, parallel data collection processes, so you can move on to your next data source—and the one after that—while your MongoDB data is being ingested. Panoply will send you an alert when your data collection run is finished, but you can also monitor progress directly or cancel jobs in the Jobs pane.
Once all your data collection is finished, you can head to the Tables pane and get a bird’s eye view of all your fresh, processed data. If you used the same sample data we did, your data should have gone from a collection called Inspections looking something like this:
```
{"_id":{"$oid":"56d61033a378eccde8a8354f"},"id":"10021-2015-ENFO","certificate_number":9278806,"business_name":"ATLIXCO DELI GROCERY INC.","date":"Feb 20 2015","result":"No Violation Issued","sector":"Cigarette Retail Dealer - 127","address":{"city":"RIDGEWOOD","zip":11385,"street":"MENAHAN ST","number":1712}}
{"_id":{"$oid":"56d61033a378eccde8a83550"},"id":"10057-2015-ENFO","certificate_number":6007104,"business_name":"LD BUSINESS SOLUTIONS","date":"Feb 25 2015","result":"Violation Issued","sector":"Tax Preparers - 891","address":{"city":"NEW YORK","zip":10030,"street":"FREDERICK DOUGLASS BLVD","number":2655}}
{"_id":{"$oid":"56d61033a378eccde8a83551"},"id":"10084-2015-ENFO","certificate_number":9278914,"business_name":"MICHAEL GOMEZ RANGHALL","date":"Feb 10 2015","result":"No Violation Issued","sector":"Locksmith - 062","address":{"city":"QUEENS VLG","zip":11427,"street":"214TH ST","number":8823}}
```
to something more like this:
In the Tables section of your Panoply dashboard, the resulting tables will be listed like this:
For a quick check of the data’s structure you can explore the new tables inside Panoply. The query would look like:
SELECT inspections.business_name, address.city, address.street FROM inspections JOIN inspections_address address on inspections.id = address.inspections_id LIMIT 10;
Now that you’ve collected the data from your MongoDB and gotten it sorted into tables in Panoply, you can start in on analysis. To do that, you’ll need to connect Power BI to your Panoply warehouse.
First, you’ll need to get the hostname, port and authentication details that Power BI will use to connect to your Panoply warehouse. You can find this by clicking Connect on the navigation pane in Panoply, which will bring up your warehouse’s details, like so:
After opening the Power BI Desktop, go to the Home tab and click Get Data.
Find Amazon Redshift connector from the list or quickly find it through the search box. Once found, click the Connect button.
Next, fill out the fields for Server and Database in the following dialog box. You can fill the Batch Size field if you want to batch load the data per specific number of size or rows. Choose Import for Data Connectivity Mode for faster processing of data.
Once you click OK, fill-out the fields for username and password on the next dialog box. Afterwards, click Connect.
Choose tables that you want to import. Click the table to see a preview of what it contains, or to inspect it to make sure it’s ready for loading or editing. Next, click the Load button if you want to dive directly into visualizing your data--or select Transform if your data needs further cleaning and preparation before analysis (which will be the case most of the time).
For this analysis, we will use the inspections table and inspection_address table from our dataset. First, let’s check out our dataset.
With 80,148 rows and 9 useful fields (id, date, business_name, certificate_number, sector, result, city, street, zip), we can certainly extract something valuable from our data.
Let’s start by transforming the data to make it easier to work with. We’ll do the following transformations:
With all these transformations set up, we can put together a dashboard like the one in the screenshot below. If you want to explore more, you can play around with an interactive version of the dashboard at this link.
So, what can we see from this data? Certificate issuance rises as it approaches the mid-year and then gradually decreases with the exception of December due to its sudden peak. In fact, it looks like certificates are issued at the end of every quarter, with more issued in the first half of the year. If you’re a business, it seems like you’re best off trying to get a certificate in Q1 or Q2, based on this data.
In terms of inspection results, more than half of the business in our dataset don't have any violations issued. Denoted by light yellow background, only four inspection results are positive and Pass results are less than half of that of inspections resulting in No Violation Issued.
From our dataset, overall, it’s clear that the sector for Cigarette Retail Dealer has the most businesses, as well as the highest number of issued certificate of inspection. But which sector has the highest rate of inspection failure? Find out by exploring the interactive dashboard (hint: from the inspection results table, click the Fail result).
As far as our map display goes, there’s room for improvement here due to the limitations of the dataset itself, as well as Power BI’s geospatial visualization capability. If you want to make a better map view, you can enrich your data by adding more geographical features from external data sources. Try using exact latitude and longitude for more accurate plotting of points.