SQL and pandas: A Rosetta Stone for Analysts

By Peter Weinberg | December 1, 2020 | Updated On: January 29, 2021 | Data Analysis

SQL and pandas are both invaluable tools for any analyst or data scientist. They tend to be mentioned in the same breath when talking about data manipulation and analysis—and can do a lot of the same things—but they have pretty distinct roles in the data workflow.

Today we’re going to take a look at when to use SQL vs pandas, what the differences are between them, and how to translate basic queries from SQL into Python-based pandas queries and vice versa.

Background

If you somehow found your way here without knowing much about SQL, or know pandas mostly as an autoimmune disorder in children, we’ll break it down super quickly here. If you’re already familiar with the basics, feel free to skip ahead. We won’t mind.

What is SQL?

SQL, first and foremost, is its own language. In fact, SQL stands for Structured Query Language. For data analysis and management, SQL makes for a nice, straightforward, and easy-going language.

Once you get the hang of some of SQL’s idiosyncrasies, it becomes incredibly easy to pull data points out of their respective tables, join them, manipulate them and move them to new tables. Most databases you’re likely to come across as an analyst or engineer are going to be some flavor of SQL.

What is pandas?

pandas is a Python package that has become an essential tool for data scientists, data analysts, and Python devs who have to interact with data. It’s built on top of some powerful Python libraries, like numPy, which allows it to squeeze extra power and performance out of what might otherwise be fairly slow pure Python code.

With pandas, you can easily create, manipulate and combine data frames right in your Python scripts, and its direct integration with matplotlib, Python’s original data visualization library, makes it easy to plot your data once you’ve got it where you want it.

When to use SQL...and when to use pandas

Now that we’ve covered the basics, let’s talk about where these two tools fit in your tech stack. Obviously, these choices are subject to a strong degree of personal preference, but a good general framework for figuring out where SQL and pandas fit in your life is to think about it in terms of the size of the data you’re going to be working with.

SQL, being a language designed to work directly with a relational database, is better suited for work with the large, separate tables you might find there, and doesn’t really have an upper limit when it comes to the amount of data it can reliably manipulate. Pandas, on the other hand, will most likely choke when asked to deal with the full contents of anything bigger than a modestly large database—but that’s fine, that’s not really what you want it for!

At the most basic level, a good way to use both SQL and pandas effectively in your data workflow would be something like this:

Use SQL to do the big, coarse manipulations first: Join your (hopefully) giant sales records table to your (hopefully) massive customer data table, and filter for the specific years you’re interested in directly in SQL.
Do your finer-grained analysis and visualization in pandas second: Once you’ve selected out the specific data you need for analysis, you can load it into pandas and take advantage of its easy integration with the rest of Python’s analysis libraries. You can do all sorts of complex calculations directly in SQL, but pandas makes it a lot easier and cleaner.

Of course, this is just a rule of thumb. There’s plenty of overlap, in terms of capabilities, between the two, and plenty of analysts do their work entirely in SQL or pandas. But used together, SQL and pandas can make for a powerful and efficient tech stack for dealing with large, complex datasets.

Translating SQL to pandas and vice versa

For those of you who are familiar with SQL but have never used pandas, pandas might still seem a bit foreign. For those of you who are fluent in pandas but don’t know a lick of SQL: what are you doing with your life? Either way, this next section will show you how to translate commands between SQL and pandas to help you get set up with both of these powerful analytic tools.

Getting the first n records in SQL vs pandas

Let’s look at basic data exploration first. Suppose you have a hypothetical table, sales, and you want to just get a general feel for what’s in it. You don’t remember exactly which columns are in this table, and you just want to see the first 15 or so records. In MS SQL Server, you might run:

SELECT TOP 15 *
FROM sales

In PostgreSQL or MySQL, you would run:

SELECT *
FROM sales
LIMIT 15

But in pandas, assuming you loaded your sales data into a dataframe called sales, the easiest way to do it is simply to run:

sales.head(15)

The .head() method in pandas is a built-in means of getting the top n records in a dataframe. You can also run .tail() which will show you the last n records in the dataframe as well.

Filtering data with SQL vs pandas

Now that we’ve covered how to scope out the top of the table, let’s dig into how to filter your data for the specific records you’re looking for. In SQL, this is done using the WHERE clause. So, if we return to our hypothetical sales table, here’s how we could select specific sales records by product ID in SQL:

SELECT product_name, product_id, quantity, unit_price, total_price
FROM sales
WHERE product_id = 123456

This query will return the product name, product ID, number sold, price per unit, and total price of the transaction for each transaction record where the product ID matches the one you provide. Note that, in the above example, product_id is stored as a number. If you were trying to match a string, you would need to enclose the filter criteria in quotes.

In pandas, things look a bit different:


sales[sales[‘product_id’] == 123456]
[[‘product_name’, ‘product_id’, ‘quantity’, ‘unit_price’, ‘total_price’]]

Will return the same data as the SQL query above. You can increase the complexity of your filters by adding conditions with the AND clause in SQL:

SELECT product_name, product_id, quantity, unit_price, total_price
FROM sales
WHERE product_id = 123456 AND quantity > 5

Or the & operator in pandas:


sales[(sales[‘product_id’] == 123456) & (sales[‘quantity’] > 5)]
[[‘product_name’, ‘product_id’, ‘quantity’, ‘unit_price’, ‘total_price’]]

As you can see, you can do a lot in a single line in pandas, which is great! It’s not always super easy to read, though, so if you’re new to pandas, you can think about the code here like this:

The first portion, sales[(sales[‘product_id’] == 123456) & (sales[‘quantity’] > 5)] creates a logical mask of the entire dataframe, tagging every entry as either True or False according to whether or not it meets the criteria in your query.
The second portion of the command, [[‘product_name’, ‘product_id’, ‘quantity’, ‘unit_price’, ‘total_price’]], selects the specific columns you want to see in the output dataframe and pulls out all the rows marked True by the filter mask.

Don’t worry, even if it’s not as immediately intuitive as a SQL query, you get used to pandas syntax pretty quickly.

Joining data in SQL vs pandas

We touched on joins earlier in the article, but it’s time to take a closer look at how it works when you want to combine data from separate sources in SQL and pandas. In SQL, this comes down to the JOIN function, which is designed to merge data from different tables into a single table by using a unique key that is common across both tables. There are a couple different types of JOIN in SQL, and we won’t go too in depth on that topic here, but there are plenty of resources out there if you’re interested in learning more.

Let’s look at an example now, turning back to our hypothetical sales table for reference, and pulling in another hypothetical table, customers, for a join. Suppose our sales table has a customer_id column, which allows us to link it to our customers table, which has address and demographic information about our customers. If we wanted to know the geographic distribution of sales of product SKU 123456, we could do something like this:

SELECT city, state, country, total_price
FROM customers
JOIN sales
ON customers.customer_id = sales.customer_id
WHERE sales.product_id = 123456

This query will create a table with the locations of customers who bought item #123456, as well as the total sale price for their purchases. If you wanted to go a step further, you could look at the total sales by country like this:

SELECT country, SUM(total_price) as total_sales
FROM customers
JOIN sales
ON customers.customer_id = sales.customer_id
WHERE sales.product_id = 123456
GROUP BY country 
ORDER BY total_sales DESC

Which would give you the total sales of the product by country, listed in descending order by sales total.

If you wanted to do the same join in pandas, you would need to take advantage of the merge() function. So let’s suppose you have two dataframes, sales and customers. You could create a new dataframe that incorporated data from both by doing:


new_df = customers.merge(sales, on=’customer_id’)

Which would give you new_df, a new dataframe with the data from both in it. If you wanted to do just a subset of the data, like above, you could do something like


geo_sales = customers[[‘city’, ‘state’, ‘country’, ‘total_price’, ‘customer_id’]]
.merge(sales[[‘product_id’, ‘total_price’, ‘customer_id’]], on=’customer_id’)

As you can see, though, the basic implementation of a pandas join isn’t quite as flexible and dynamic as it is in SQL, where you can call up columns at will, use them in joins and produce new tables with a few lines of code. With pandas, joins and merges can be a little bit clunkier.

But if you wanted to, say, group your data by multiple columns for further analysis, it’s quite a bit easier in pandas. These are the sorts of things that make a SQL-first, pandas-second analysis flow most productive—let each tool do the thing it’s best at and your analysis will be a lot smoother for it.

Takeaways

So that’s our brief introduction to how to translate SQL queries into pandas queries. Both SQL and pandas can do a ton more than we’ve demonstrated here, so make sure to explore some more in-depth resources on this if you want to learn more. We’re also going to cover some more examples of how to use both SQL and pandas in your data analysis workflow over the next few posts, so check those out as well. For now, though, let’s go back to the big picture:

Use SQL for storing, merging, moving and querying the big data directly in your database. SQL is very good at dealing with big datasets, and is optimized for moving data points around, joining related tables on keys and doing rough calculations quickly (among other things, of course). Once you start getting into advanced analysis, though, things can get complicated pretty quickly if you try to do it all in SQL. Not only can it be cumbersome to write out the SQL query for certain types of intricate calculations, running that query can also put unnecessary load on databases in production, unless you’re using a dedicated analytics warehouse.

Use pandas for finer analyses of smaller datasets. Using pandas for analysis not only gives you access to a whole suite of built-in methods and visualization capabilities, it also allows you to bring in various other powerful (and easy to use) math, statistical, and machine learning packages.

Most SQL queries can be translated relatively painlessly into pandas commands—and vice versa. It’s not always pretty, but the degree of overlap in functionality between SQL and pandas means that just about anything you can do in one language can be done in the other.

How Panoply can help make your data workflow even better

You might have noticed us mentioning dedicated analytics warehouses above, and how they can make your SQL-based analytics life easier. But it doesn’t just apply to SQL. Dedicated analytics warehouses like Panoply can be just as useful in a pure Python/pandas workflow—or one that mixes both approaches, like we talked about above.

Whatever data you’re trying to analyze, and however you’re trying to do it, Panoply’s managed ETL pipelines and data warehousing can make it easier and lower maintenance. In the next post, we’re going to show exactly how you can combine Panoply, SQL, and pandas into a super effective analytics stack.