For the final entry in our SQL and pandas series, we’re going to be talking today about closing the loop. We’ve talked about the difference between pandas and SQL, how to fit each of them into your workflow, and how to load the results of a SQL query into a pandas dataframe.
Following the workflow we outlined in our introductory post, the next logical step is loading data back into your SQL database after manipulating/analyzing/whatever-ing the data in pandas. Luckily, pandas has several helper functions that make this relatively painless—but let’s not get ahead of ourselves here. Let’s talk requirements first. If you’ve already gone through the whole process in our last post, feel free to skip ahead to the good part:
In case you ended up here without going through the other two posts first, there are a couple of packages (aside from pandas itself, obviously) that you’ll need to install before doing anything here:
Install these via pip
or your favorite Python package manager before trying to follow along here.
Note: This is a Postgres-specific set of requirements. I prefer PostgreSQL (so do Amazon’s Redshift and Panoply’s cloud data platform). If you favor another dialect of SQL, you can easily adapt this guide and make it work by installing a different adapter that will allow you to interact with MySQL, Oracle, and others directly through your Python code.
And, of course, in addition to all that you’ll need access to a SQL database, either remotely or on your local machine. We’ll use Panoply’s sample data, which you can access easily if you already have an account (or set up a free trial), but again, these techniques are applicable to whatever data you might have on hand.
The last step before we get started is to set up the database engine, which will allow your Python code to talk to your SQL database. Since we’re using PostgreSQL here, we’ll define a PostgreSQL connection string, like so:
from sqlalchemy import create_engine
# Defining our connection variables
username = YourUserName # replace with your username
password = YourPassWord # replace with your password
ipaddress = db.panoply.io # change this to your db’s IP address is if not Panoply
port = 5439 # this is the standard port for Postgres, but change it to your port if needed
dbname = YourCoolDb # change this to...you guessed it...the name of your db
# A long string that contains the necessary Postgres login information
postgres_str = f'postgresql://{username}:{password}@{ipaddress}:{port}/{dbname}'
# Create the connection
cnx = create_engine(postgres_str)
Now you're ready to get started!
Let’s assume you pulled data down from a Postgres database, cleaned it, transformed it, and did some calculations on your local machine. Now you want to load it back into the SQL database as a new table. pandas makes this incredibly easy. For a given dataframe (df
), it’s as easy as:
df.to_sql(‘my_cool_table’, con=cnx, index=False)
# set index=False to avoid bringing the dataframe index in as a column
This short line of code:
Which is kind of magic if you’re someone like me who finds defining a Postgres table schema by hand to be about as enjoyable as falling down a long flight of stairs, especially when the table is decently wide. The only drawback is that it’s what my colleague calls a “go and get a cup of coffee” function. That is to say, it’s super slow. Luckily, there are ways to speed it up.
If you want to speed things up, but still don’t want to have to name and describe every single column in your dataframe like they’re your own children, there’s a way to have your cake and eat it too. What you’ll do is use pandas to_sql()
to create and define the table, then use the slightly less easy but still very good and streamlined SQLAlchemy
package to load the data into the table you created, like so:
from io import StringIO
df.head(0)to_sql(‘my_cool_table’, con=cnx, index=False) # head(0) uses only the header
# set index=False to avoid bringing the dataframe index in as a column
raw_con = cnx.raw_connection() # assuming you set up cnx as above
cur = raw_con.cursor()
out = StringIO()
# write just the body of your dataframe to a csv-like file object
df.to_csv(out, sep='\t', header=False, index=False)
out.seek(0) # sets the pointer on the file object to the first line
contents = out.getvalue()
cur.copy_from(out, 'table_name', null="") # copies the contents of the file object into the SQL cursor and sets null values to empty strings
raw_con.commit()
I like this method because it uses the fastest and most convenient methods from both worlds: you get the convenience of using pandas
to define your table schema automatically, but you can load the contents of the table much more quickly by taking advantage of PostgreSQL’s lightning fast COPY
.
There are a number of other ways you can go about this speed-up, and the most recent versions of pandas have added support for speedier SQL upload, so be sure to poke around and find the method that works best for you if this example isn’t your favorite. We are talking about optimization here, after all.
We’ve covered the creation and population of new SQL tables from pandas, but another obvious use case is updating entries in an existing table in a database after doing some fine manipulations in pandas. Unfortunately, there isn’t a totally straightforward method like df.to_sql()
for simply updating tables. One option is to simply replace the entire old table with an updated table generated from your dataframe:
df.to_sql(‘new_cool_table’, con=cnx, if_exists=’replace’)
And of course, if you just want to append new data to an existing table, you can change the if_exists
argument to ’append’
, like so:
df.to_sql(‘new_cool_table’, con=cnx, if_exists=’append’)
However, this can become impractical pretty quickly if you have a table with, say, millions of rows and only want to update ~1,000 entries. In that case, you can create a temporary table to hold your updates and then use SQL’s native update methods to update the larger table. That would look something like:
# Create the temporary table using pandas (can speed this up with method above if needed)
df.to_sql('temporary_table', con=cnx, if_exists='replace')
# Define a string that will run as a SQL command when executed
# uses SQL UPDATE to update table_to_be_updated from temporary_table
sql = """
UPDATE table_to_be_updated AS f
SET col1 = t.col1
FROM temporary_table AS t
WHERE f.id = t.id
"""
with cnx.begin() as connection:
connection.execute(sql)
And there you have it! Your SQL table should update with the new data from pandas.
That’s pretty much it for the other end of the pandas-SQL pipeline. As you can see, there isn’t one easy method to do everything quickly and efficiently through pandas, but there plenty of approaches that are both straightforward and efficient enough to make it worth your time to experiment with and find what works best for you. This also marks the end of our 3-part pandas and SQL series, so if you missed the earlier posts, be sure to go back and check them out!