Connecting A Jupyter Notebook With PostgreSQL For Python Data Analysis

Jupyter notebooks have quickly become one of the most popular, if not the most popular way, to write and share code in the data science and analytics community. Jupyter allows you to write code, add text and images, analyze data and generate visualizations all in one simple, easy-to-use interface. When connected to remote data sourcessuch as a Postgres database running on an AWS instance or Panoply’s data warehouseJupyter notebooks can become powerful tools for streamlining, documenting and replicating your data analysis workflow.

This article will show you how to connect any PostgreSQL database to a local Jupyter notebook. And since Panoply lives on top of Redshift, you’ll also be able to connect any notebook directly to your Panoply data warehouse with the same code and get up and running quickly with tools you’re probably already familiar with.

If you want to follow along, get the Jupyter notebook here.

Requirements

To start off, you’ll need to have installed Jupyter and set up a Postgres database. If you don’t have Jupyter installed on your local machine, you can get it up and running with pip. If you’ve previously installed Anaconda, you already have Jupyter installed. Check out our other guides if you need help setting up an AWS Redshift or open source PostgreSQL database.

You will also need to install two python packages to connect to your database through your notebook:

  • SQLAlchemy - generates SQL statements from python
  • Psycopg2 - communicates your SQL statements to your Postgres database

Note: You can do everything we’re about to do with Psycopg2 alone, but SQLAlchemy (which is dependent on Psycopg2) makes things a bit easier when working in python. Also, this guide is written for Python 3 users, but you can easily adapt the code if you’re working in Python 2.

Both of these packages can be installed from the command line with pip or conda:

pip install sqlalchemy psycopg2
conda install -y sqlalchemy psycopg2

After that, you’re all set to pull in, analyze and visualize your data from your Postgres db in Jupyter with your visualization library of choice (e.g., matplotlib or seaborn).

GETTING STARTED

We’ll start by importing all the packages we’ll need:

Next we’ll set up the connection to our Postgres db. In order to connect, you’ll need:

  • The hostname or IP address where the database is located
  • The port your database is listening to (this defaults to 5439 for Postgres)
  • Your username
  • Your password
  • The name of the database

If you connected your Postgres instance to Panoply, you would use db.panoply.io for the hostname, 5439 for the port, the username and password you used to set up your panoply account, and the name you gave your database when you set up your Panoply account.

In the next cell of your notebook, you’ll enter:

from sqlalchemy import create_engine
# Postgres username, password, and database name
POSTGRES_ADDRESS = 'db.panoply.io' ## INSERT YOUR DB ADDRESS IF IT'S NOT ON PANOPLY
POSTGRES_PORT = '5439'
POSTGRES_USERNAME = 'username' ## CHANGE THIS TO YOUR PANOPLY/POSTGRES USERNAME
POSTGRES_PASSWORD = '*****' ## CHANGE THIS TO YOUR PANOPLY/POSTGRES PASSWORD POSTGRES_DBNAME = 'database' ## CHANGE THIS TO YOUR DATABASE NAME
# A long string that contains the necessary Postgres login information
postgres_str = ('postgresql://{username}:{password}@{ipaddress}:{port}/{dbname}'
  .format(username=POSTGRES_USERNAME,
   password=POSTGRES_PASSWORD,
   ipaddress=POSTGRES_ADDRESS,
   port=POSTGRES_PORT,
   dbname=POSTGRES_DBNAME))
# Create the connection
cnx = create_engine(postgres_str)

create_engine() creates an engine object that allows you to connect to your database and send queries using python.

From here, we’ll use some sample data available through Panoply. More specifically, we’ll be using the Pokemon data set that you may have imported as part of Panoply’s getting started guide.

Loading Data From Postgres to a Pandas Dataframe

Let’s take a look at the pokemon data we imported to our Panoply database. We can take advantage of the fact that pandas has a method for easily reading in data from SQL queries to make a nice, readable dataframe right in our notebook with the following code:

pd.read_sql_query('''SELECT * FROM pokemon LIMIT 5;''', cnx)

Which gives us the following output (minus a few columns for the purpose of this screenshot):

 Analyzing your Postgres Data in Jupyter

If you’re following along in your own Jupyter notebook, you’ll see that there’s actually quite a bit of data here for each Pokemon. Let’s drill down on this data a bit. There’s something amusing about the fact that the weights of these imaginary creatures has been calculated out to the centigram, so let’s start by looking at the weight distributions. Using our Pandas/SQLAlchemy stack, we’ll run the following:

Which will show us a the first 5 values of our new weights-only dataframe:

Now, let’s see what the general weight distribution looks like in the Pokemon population. Using some of the built-in visualization capabilities in Pandas, we can generate a simple histogram of our pokemon weights using poke_weights.weight.hist(), which will output the following through matplotlib:

If we wanted to make the plot a little more fine-grained, we could use poke_weights.weight.hist(bins=500) to increase the number of weight bins, giving us a plot that looks like this:

Ok, now we have a pretty good idea of the weight distribution of our Pokemon. Let’s make a new dataframe with some more data from the main Pokemon table and see if we can find any meaningful relationships between the different data types. Since we also have a speed attribute in our table, let’s see how much our Pokemon obey the laws of physics by looking at a scatter plot showing weight and speed:

 

Looks like speed and weight are not highly correlated in the Pokemon universe. Curious! Try playing around with some of the other data in the table and see if you discover anything new about the Pokemon universe.

Wrapping Up

Now that you’ve got the basics down, you can start writing more complex queries to your Postgres database and analyzing your data with your preferred techniques in a Jupyter notebook. Using Panoply as your data warehouse, you’ll have a fast, reliable and easily accessible repository of any data that can be stored in a PostgreSQL-style database. Check out our other guides to see how you can do these same analyses directly through Panoply’s platform and connect your warehoused data to over 100 other BI tools like Tableau, Looker and Chartio.

Get a free consultation with a data architect to see how to build a data warehouse in minutes.
Request Demo
Read more in:
Share this post:

Work smarter, better, and faster with weekly tips and how-tos.