Sometimes, the only thing standing between an analyst and a catastrophic management decision could be knowing the difference between the tables:
_web\_user\_behavior\_v2_
and
_web\_user\_behavior\_new_
Effective naming conventions for your data warehouses allow data analysts to do an exponentially better job—but it’s far from easy.
In this article, you’ll learn the best practices for data warehouse naming conventions and generate sample names to demonstrate them.
First, it’s important to define what a data warehouse is. On the surface, an enterprise data warehouse (EDW) looks almost identical to a database—both are accessible via a structured query language (SQL).
But that’s where the similarities end; a data warehouse is fundamentally different from a database.
A database is responsible for managing, storing, retrieving, and updating data in a structured, tabular context. It often supports the automation of software applications’ activities and processes. In this context, a database is most relevant for software engineers.
On the other hand, when data analysts need data, they’re rarely interested in creating new records. Instead, their primary concern is consuming existing data to support the decision-making process.
That’s where the EDW comes in. It’s not suited for quick-write operations, but it’s excellent for reading data at mind-boggling speed.
But speed isn’t the only thing that matters.
For a data professional to be truly productive, they structure the data comprehensively—this is where naming conventions come into play.
When setting up data pipelines, it’s tempting for data engineers to add mental cues to tables and columns. But the truth is that business users don’t care about versions, partitions, or table owners—let alone cryptic internal acronyms.
For them, a table named _wub\_v2\_date\_johndoe_
is useless.
It would be best if you always created a data warehouse’s naming conventions for analysts’ productivity first and foremost.
Here are some helpful guidelines to follow.
The following naming conventions are relevant for all data objects:
Consistency
The single rule all analytics engineers and database architects should live by is consistency. Set naming conventions, and don’t ever deviate from them. That’s how you make it easy for analysts to find the data assets they need.
Without consistency, everything else is irrelevant.
Beginner-friendly naming
Corporate meetings and reports are bloated with acronyms. While acronyms might save experienced analysts a few keystrokes, it makes a beginner’s job unnecessarily difficult.
Only use approved acronyms that are expected to be known inside your organization.
Casing
This topic has generated considerable conflict among analysts—for example, Snowflake stores all objects by their uppercase version, by default. BigQuery, on the other hand, is case insensitive—its object names default to lowercase. But let’s be honest: SQL keywords are designed with uppercase in mind.
With this in mind, give your data objects a lowercase name if you have the choice. The humans reading the queries afterward will thank you for it.
Separation of words
Underscores, hyphens, spaces?
Microsoft’s tabular data ecosystem (SQL Server, MS Access, Excel) supports column names with spaces in them, and they can be referred to by using square brackets. Snowflake supports it too, and space-separated column names can be referred to using double quotes.
On the other hand, BigQuery doesn’t, and neither does Amazon Redshift.
The same goes for hyphens: most EDWs support them, but they have to be encapsulated. That means extra keystrokes, and every keystroke that doesn’t aid query descriptiveness should be avoided. The takeaway: use underscores.
The following naming conventions are relevant for databases, tables, and views:
Compartmentalization
A data warehouse often contains data from many data sources, supporting users of all expertise levels. Compartmentalization can be achieved on the database(s) level or a level below.
_schema_
_dataset_
There are use cases for both compartmentalization on a database or schema-level, and it often has to do with how a company is organized:
_malwart_.sales_.transactions\__<further specification in table name\>
_sales_._transactions_.<further specification by table name\>
Ordinality
In an ELT context, when using tools such as dbt or Panoply, the “stage” of the data object is also relevant.
Some tables only act as sinks for an extraction job.
Other tables are only intermediate because multiple output tables rely on them. These tables aren’t relevant for analysts, but they are for analytics engineers because the essential compartmentalization also represents the ordinality of the tables—even when they’re hidden from the final consumers of the data.
Granularity
The granularity of a table refers to what each record represents.
In a time series, it could be an aggregation per second, day, or quarter. In retail, transactions and line items are very relevant—but it could be clicks, heartbeats, tickets, stars, etc.
The table name should reflect that granularity:
_sales.transactions.line\_items_
_sales.transactions.transactions_
_sales.transactions.returns_
Prefixes vs. suffixes
Traditionally, OLAP cubes worked with fact tables and dimension tables. However, since most EDWs are columnar, all data is flattened, meaning facts and dimensions are part of the same tables.
For this reason, dim\_
and
fact\_
tables are disappearing.
On the other hand, _views_ and _materialized views_
often have a _\_v_ or _\_mv_
suffix.
Using suffixes instead of prefixes ensures related tables are grouped when ordered alphabetically.
Singular vs. plural
Of all naming conventions, this is the only one where personal taste is the sole determinant.
While plural naming makes more sense in that a table’s records describe multiple data objects, singular naming results in more intuitive queries when the query engine requires you to specify the table name in some clauses.
For example:
SELECT
WHERE
customer.country\_of\_residence = 'Denmark'
The following naming conventions are relevant for columns:
Let’s summarize the core data warehouse naming conventions you’ve learned:
You should now have a good understanding of best practices for data warehouse naming conventions.
While each naming convention practice we’ve discussed has its place, consistency is the most important. The other components should generally fall into place if you’re consistent with your data warehouse naming conventions.
Panoply is a seamless, all-in-one solution for data organization and synchronization. In addition, Panoply can help take care of your data warehouse maintenance, so you and your team can focus on data analysis.