SQL and pandas are both invaluable tools for any analyst or data scientist. They tend to be mentioned in the same breath when talking about data manipulation and analysis—and can do a lot of the same things—but they have pretty distinct roles in the data workflow.
Today we’re going to take a look at when to use SQL vs pandas, what the differences are between them, and how to translate basic queries from SQL into Python-based pandas queries and vice versa.
If you somehow found your way here without knowing much about SQL, or know pandas mostly as an autoimmune disorder in children, we’ll break it down super quickly here. If you’re already familiar with the basics, feel free to skip ahead. We won’t mind.
SQL, first and foremost, is its own language. In fact, SQL stands for Structured Query Language. For data analysis and management, SQL makes for a nice, straightforward, and easy-going language.
Once you get the hang of some of SQL’s idiosyncrasies, it becomes incredibly easy to pull data points out of their respective tables, join them, manipulate them and move them to new tables. Most databases you’re likely to come across as an analyst or engineer are going to be some flavor of SQL.
pandas is a Python package that has become an essential tool for data scientists, data analysts, and Python devs who have to interact with data. It’s built on top of some powerful Python libraries, like numPy, which allows it to squeeze extra power and performance out of what might otherwise be fairly slow pure Python code.
With pandas, you can easily create, manipulate and combine data frames right in your Python scripts, and its direct integration with matplotlib, Python’s original data visualization library, makes it easy to plot your data once you’ve got it where you want it.
Now that we’ve covered the basics, let’s talk about where these two tools fit in your tech stack. Obviously, these choices are subject to a strong degree of personal preference, but a good general framework for figuring out where SQL and pandas fit in your life is to think about it in terms of the size of the data you’re going to be working with.
SQL, being a language designed to work directly with a relational database, is better suited for work with the large, separate tables you might find there, and doesn’t really have an upper limit when it comes to the amount of data it can reliably manipulate. Pandas, on the other hand, will most likely choke when asked to deal with the full contents of anything bigger than a modestly large database—but that’s fine, that’s not really what you want it for!
At the most basic level, a good way to use both SQL and pandas effectively in your data workflow would be something like this:
Of course, this is just a rule of thumb. There’s plenty of overlap, in terms of capabilities, between the two, and plenty of analysts do their work entirely in SQL or pandas. But used together, SQL and pandas can make for a powerful and efficient tech stack for dealing with large, complex datasets.
For those of you who are familiar with SQL but have never used pandas, pandas might still seem a bit foreign. For those of you who are fluent in pandas but don’t know a lick of SQL: what are you doing with your life? Either way, this next section will show you how to translate commands between SQL and pandas to help you get set up with both of these powerful analytic tools.
Let’s look at basic data exploration first. Suppose you have a hypothetical table, sales
, and you want to just get a general feel for what’s in it. You don’t remember exactly which columns are in this table, and you just want to see the first 15 or so records. In MS SQL Server, you might run:
SELECT TOP 15 * FROM sales
In PostgreSQL or MySQL, you would run:
SELECT * FROM sales LIMIT 15
But in pandas, assuming you loaded your sales data into a dataframe called sales
, the easiest way to do it is simply to run:
sales.head(15)
The .head()
method in pandas is a built-in means of getting the top n records in a dataframe. You can also run .tail()
which will show you the last n records in the dataframe as well.
Now that we’ve covered how to scope out the top of the table, let’s dig into how to filter your data for the specific records you’re looking for. In SQL, this is done using the WHERE
clause. So, if we return to our hypothetical sales
table, here’s how we could select specific sales records by product ID in SQL:
SELECT product_name, product_id, quantity, unit_price, total_price FROM sales WHERE product_id = 123456
This query will return the product name, product ID, number sold, price per unit, and total price of the transaction for each transaction record where the product ID matches the one you provide. Note that, in the above example, product_id is stored as a number. If you were trying to match a string, you would need to enclose the filter criteria in quotes.
In pandas, things look a bit different:
sales[sales[‘product_id’] == 123456]
[[‘product_name’, ‘product_id’, ‘quantity’, ‘unit_price’, ‘total_price’]]
Will return the same data as the SQL query above. You can increase the complexity of your filters by adding conditions with the AND
clause in SQL:
SELECT product_name, product_id, quantity, unit_price, total_price FROM sales WHERE product_id = 123456 AND quantity > 5
Or the &
operator in pandas:
sales[(sales[‘product_id’] == 123456) & (sales[‘quantity’] > 5)]
[[‘product_name’, ‘product_id’, ‘quantity’, ‘unit_price’, ‘total_price’]]
As you can see, you can do a lot in a single line in pandas, which is great! It’s not always super easy to read, though, so if you’re new to pandas, you can think about the code here like this:
sales[(sales[‘product_id’] == 123456) & (sales[‘quantity’] > 5)]
creates a logical mask of the entire dataframe, tagging every entry as either True
or False
according to whether or not it meets the criteria in your query. [[‘product_name’, ‘product_id’, ‘quantity’, ‘unit_price’, ‘total_price’]]
, selects the specific columns you want to see in the output dataframe and pulls out all the rows marked True
by the filter mask. Don’t worry, even if it’s not as immediately intuitive as a SQL query, you get used to pandas syntax pretty quickly.
We touched on joins earlier in the article, but it’s time to take a closer look at how it works when you want to combine data from separate sources in SQL and pandas. In SQL, this comes down to the JOIN
function, which is designed to merge data from different tables into a single table by using a unique key that is common across both tables. There are a couple different types of JOIN
in SQL, and we won’t go too in depth on that topic here, but there are plenty of resources out there if you’re interested in learning more.
Let’s look at an example now, turning back to our hypothetical sales
table for reference, and pulling in another hypothetical table, customers
, for a join. Suppose our sales
table has a customer_id
column, which allows us to link it to our customers
table, which has address and demographic information about our customers. If we wanted to know the geographic distribution of sales of product SKU 123456
, we could do something like this:
SELECT city, state, country, total_price FROM customers JOIN sales ON customers.customer_id = sales.customer_id WHERE sales.product_id = 123456
This query will create a table with the locations of customers who bought item #123456, as well as the total sale price for their purchases. If you wanted to go a step further, you could look at the total sales by country like this:
SELECT country, SUM(total_price) as total_sales FROM customers JOIN sales ON customers.customer_id = sales.customer_id WHERE sales.product_id = 123456 GROUP BY country ORDER BY total_sales DESC
Which would give you the total sales of the product by country, listed in descending order by sales total.
If you wanted to do the same join in pandas, you would need to take advantage of the merge()
function. So let’s suppose you have two dataframes, sales
and customers
. You could create a new dataframe that incorporated data from both by doing:
new_df = customers.merge(sales, on=’customer_id’)
Which would give you new_df
, a new dataframe with the data from both in it. If you wanted to do just a subset of the data, like above, you could do something like
geo_sales = customers[[‘city’, ‘state’, ‘country’, ‘total_price’, ‘customer_id’]]
.merge(sales[[‘product_id’, ‘total_price’, ‘customer_id’]], on=’customer_id’)
As you can see, though, the basic implementation of a pandas join isn’t quite as flexible and dynamic as it is in SQL, where you can call up columns at will, use them in joins and produce new tables with a few lines of code. With pandas, joins and merges can be a little bit clunkier.
But if you wanted to, say, group your data by multiple columns for further analysis, it’s quite a bit easier in pandas. These are the sorts of things that make a SQL-first, pandas-second analysis flow most productive—let each tool do the thing it’s best at and your analysis will be a lot smoother for it.
So that’s our brief introduction to how to translate SQL queries into pandas queries. Both SQL and pandas can do a ton more than we’ve demonstrated here, so make sure to explore some more in-depth resources on this if you want to learn more. We’re also going to cover some more examples of how to use both SQL and pandas in your data analysis workflow over the next few posts, so check those out as well. For now, though, let’s go back to the big picture:
You might have noticed us mentioning dedicated analytics warehouses above, and how they can make your SQL-based analytics life easier. But it doesn’t just apply to SQL. Dedicated analytics warehouses like Panoply can be just as useful in a pure Python/pandas workflow—or one that mixes both approaches, like we talked about above.
Whatever data you’re trying to analyze, and however you’re trying to do it, Panoply’s managed ETL pipelines and data warehousing can make it easier and lower maintenance. In the next post, we’re going to show exactly how you can combine Panoply, SQL, and pandas into a super effective analytics stack.