Welcome back, data folk, to our 3-part series on managing and analyzing data with SQL, Python and pandas. In our first post, we went into the differences, similarities, and relative advantages of using SQL vs. pandas for data analysis.
One of the points we really tried to push was that you don’t have to choose between them. SQL and pandas both have a place in a functional data analysis tech stack, and today we’re going to look at how to use them both together most effectively. We suggested doing the really heavy lifting directly in the database instance via SQL, then doing the finer-grained data analysis on your local machine using pandas—but we didn’t actually go into how you could do that.
Today, we’re going to get into the specifics and show you how to pull the results of a SQL query directly into a pandas dataframe, how to do it efficiently, and how to keep a huge query from melting your local machine by managing chunk sizes.
Requirements to follow along
Before we dig in, there are a couple different Python packages that you’ll need to have installed in order to replicate this work on your end. There are other options, so feel free to shop around, but I like to use:
- SQLAlchemy - an ORM package for Python. Basically, it lets you interact with SQL databases directly in your Python code
- Psycopg2 - A PostgreSQL adapter for Python. You’ll need this in addition to SQLAlchemy in order to be able to make the whole thing work.
Install these via pip
or whatever your favorite Python package manager is before trying to follow along here.
I’ll note that this is a Postgres-specific set of requirements, because I prefer PostgreSQL (I’m not alone in my preference: Amazon’s Redshift and Panoply’s cloud data platform also use Postgres as their foundation).
If you favor another dialect of SQL, though, you can easily adapt this guide and make it work by installing an adapter that will allow you to interact with MySQL, Oracle, and other dialects directly through your Python code.
And, of course, in addition to all that you’ll need access to a SQL database, either remotely or on your local machine. We’ll use Panoply’s sample data, which you can access easily if you already have an account (or if you've set up a free trial), but again, these techniques are applicable to whatever data you might have on hand.
Setting up the connection
Once you’ve got everything installed and imported and have decided which database you want to pull your data from, you’ll need to open a connection to your database source. To do that, you’ll create a SQLAlchemy connection, like so:
from sqlalchemy import create_engine
# Postgres username, password, and database name
POSTGRES_ADDRESS = 'db.panoply.io' ## INSERT YOUR DB ADDRESS IF IT'S NOT ON PANOPLY
POSTGRES_PORT = '5439'
POSTGRES_USERNAME = 'username' ## CHANGE THIS TO YOUR PANOPLY/POSTGRES USERNAME
POSTGRES_PASSWORD = '*****' ## CHANGE THIS TO YOUR PANOPLY/POSTGRES PASSWORD
POSTGRES_DBNAME = 'database' ## CHANGE THIS TO YOUR DATABASE NAME
# A long string that contains the necessary Postgres login information
postgres_str = ('postgresql://{username}:{password}@{ipaddress}:{port}/{dbname}'
.format(username=POSTGRES_USERNAME,
password=POSTGRES_PASSWORD,
ipaddress=POSTGRES_ADDRESS,
port=POSTGRES_PORT,
dbname=POSTGRES_DBNAME))
# Create the connection
cnx = create_engine(postgres_str)
POSTGRES_ADDRESS
as well.
Pulling the results of a basic SQL query into a pandas dataframe
Now that we’ve got the connection set up, we can start to run some queries. Let’s use the pokemon
dataset that you can pull in as part of Panoply’s getting started guide.
The simplest way to pull data from a SQL query into pandas is to make use of pandas’ read_sql_query()
method. So if you wanted to pull all of the pokemon
table in, you could simply run
df = pandas.read_sql_query(‘’’SELECT * FROM pokemon’’’, con=cnx)
As the name implies, this bit of code will execute the triple-quoted SQL query through the connection we defined with the con
argument and store the returned results in a dataframe called df
.
Of course, there are more sophisticated ways to execute your SQL queries using SQLAlchemy, but we won’t go into that here. Basically, all you need is a SQL query you can fit into a Python string and you’re good to go.
And those are the basics, really. If you only came here looking for a way to pull a SQL query into a pandas dataframe, that’s all you need to know. If you want to learn a bit more about slightly more advanced implementations, though, keep reading.
pandas read_sql_query vs read_sql
You might have noticed that pandas has two “read SQL” methods: pandas.read_sql_query
and pandas.read_sql
. Either one will work for what we’ve shown you so far. read_sql
was added to make it slightly easier to work with SQL data in pandas, and it combines the functionality of read_sql_query
and read_sql_table
, which—you guessed it—allows pandas to read a whole SQL table into a dataframe.
Can pandas read a SQL view?
Yes! It’s the same as reading from a SQL table. If you’ve saved your view in the SQL database, you can query it using pandas using whatever name you assigned to the view:
df = pandas.read_sql_query(‘’’SELECT * FROM my_view’’’, con=cnx))
my_view
is whatever name you assigned to the view when you created it. You could even create a view as part of a query that you ran through pandas.read_sql_query()
if you wanted, though that’s probably madness.
Parameterized queries with SQL and pandas
Now suppose you wanted to make a generalized query string for pulling data from your SQL database so that you could adapt it for various different queries by swapping variables in and out. Since we’ve set things up so that pandas is just executing a SQL query as a string, it’s as simple as standard string manipulation.
For example, if we wanted to set up some Python code to pull various date ranges from our hypothetical sales
table (check out our last post for how to set that up) into separate dataframes, we could do something like this:
query = ‘’’SELECT product_name, product_id, quantity, unit_price, total_price
FROM sales
WHERE product_id = 123456
AND
timestamp BETWEEN {} AND {}
‘’’ # Using triple quotes here allows the string to have line breaks
dstart = ‘YYYY-MM-DD HH:MI:SS’ # Enter your desired start date/time in the string
dfinish = ‘YYYY-MM-DD HH:MI:SS’ # Enter your desired end date/time in the string
time_df = pandas.read_sql_query(query.format(dstart, dfinish), con=cnx)
Now you have a general purpose query that you can use to pull various different date ranges from a SQL database into pandas dataframes.
Controlling the amount of data you pull down
If you’re working with a very large database, you may need to be careful with the amount of data that you try to feed into a pandas dataframe in one go. Dataframes are stored in memory, and processing the results of a SQL query requires even more memory, so not paying attention to the amount of data you’re collecting can cause memory errors pretty quickly.
Luckily, pandas has a built-in chunksize
parameter that you can use to control this sort of thing. The basic implementation looks like this:
df = pd.read_sql_query(sql_query, con=cnx, chunksize=n)
Where sql_query
is your query string and n
is the desired number of rows you want to include in your chunk.
Of course, if you want to collect multiple chunks into a single larger dataframe, you’ll need to collect them into separate dataframes and then concatenate them, like so:
dfs = []
for chunk in pandas.read_sql_query(sql_query, con=cnx, chunksize=n):
dfs.append(chunk)
df = pd.concat(dfs)
Optimizing your pandas-SQL workflow
In playing around with read_sql_query
, you might have noticed that it can be a bit slow to load data, even for relatively modestly sized datasets. Managing your chunk sizes can help make this process more efficient, but it can be hard to squeeze out much more performance there.
If you really need to speed up your SQL-to-pandas pipeline, there are a couple tricks you can use to make things move faster, but they generally involve sidestepping read_sql_query
and read_sql
altogether.
If you’re using Postgres, you can take advantage of the fact that pandas can read a CSV into a dataframe significantly faster than it can read the results of a SQL query in, so you could do something like this (credit to Tristan Crockett for the code snippet):
def read_sql_tmpfile(query, db_engine):
with tempfile.TemporaryFile() as tmpfile:
copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
query=query, head="HEADER"
)
conn = db_engine.raw_connection()
cur = conn.cursor()
cur.copy_expert(copy_sql, tmpfile)
tmpfile.seek(0)
df = pandas.read_csv(tmpfile)
return df
Doing things this way can dramatically reduce pandas memory usage and cut the time it takes to read a SQL query into a pandas dataframe by as much as 75%. This sort of thing comes with tradeoffs in simplicity and readability, though, so it might not be for everyone.
Wrapping up
That’s it for the second installment of our SQL-to-pandas series! Hopefully you’ve gotten a good sense of the basics of how to pull SQL data into a pandas dataframe, as well as how to add more sophisticated approaches into your workflow to speed things up and manage large datasets.
Check back soon for the third and final installment of our series, where we’ll be looking at how to load data back into your SQL databases after working with it in pandas.