Sometimes, the only thing standing between an analyst and a catastrophic management decision could be knowing the difference between the tables:
Effective naming conventions for your data warehouses allow data analysts to do an exponentially better job—but it’s far from easy.
In this article, you’ll learn the best practices for data warehouse naming conventions and generate sample names to demonstrate them.
Why are data warehouses important?
First, it’s important to define what a data warehouse is. On the surface, an enterprise data warehouse (EDW) looks almost identical to a database—both are accessible via a structured query language (SQL).
But that’s where the similarities end; a data warehouse is fundamentally different from a database.
A database is responsible for managing, storing, retrieving, and updating data in a structured, tabular context. It often supports the automation of software applications’ activities and processes. In this context, a database is most relevant for software engineers.
On the other hand, when data analysts need data, they’re rarely interested in creating new records. Instead, their primary concern is consuming existing data to support the decision-making process.
That’s where the EDW comes in. It’s not suited for quick-write operations, but it’s excellent for reading data at mind-boggling speed.
But speed isn’t the only thing that matters.
For a data professional to be truly productive, they structure the data comprehensively—this is where naming conventions come into play.
Data warehouse naming conventions
When setting up data pipelines, it’s tempting for data engineers to add mental cues to tables and columns. But the truth is that business users don’t care about versions, partitions, or table owners—let alone cryptic internal acronyms.
For them, a table named
_wub\_v2\_date\_johndoe_ is useless.
It would be best if you always created a data warehouse’s naming conventions for analysts’ productivity first and foremost.
Here are some helpful guidelines to follow.
Relevant for all data objects
The following naming conventions are relevant for all data objects:
The single rule all analytics engineers and database architects should live by is consistency. Set naming conventions, and don’t ever deviate from them. That’s how you make it easy for analysts to find the data assets they need.
Without consistency, everything else is irrelevant.
Corporate meetings and reports are bloated with acronyms. While acronyms might save experienced analysts a few keystrokes, it makes a beginner’s job unnecessarily difficult.
Only use approved acronyms that are expected to be known inside your organization.
This topic has generated considerable conflict among analysts—for example, Snowflake stores all objects by their uppercase version, by default. BigQuery, on the other hand, is case insensitive—its object names default to lowercase. But let’s be honest: SQL keywords are designed with uppercase in mind.
With this in mind, give your data objects a lowercase name if you have the choice. The humans reading the queries afterward will thank you for it.
Separation of words
Underscores, hyphens, spaces?
Microsoft’s tabular data ecosystem (SQL Server, MS Access, Excel) supports column names with spaces in them, and they can be referred to by using square brackets. Snowflake supports it too, and space-separated column names can be referred to using double quotes.
The same goes for hyphens: most EDWs support them, but they have to be encapsulated. That means extra keystrokes, and every keystroke that doesn’t aid query descriptiveness should be avoided. The takeaway: use underscores.
Relevant for databases, tables, and views
The following naming conventions are relevant for databases, tables, and views:
A data warehouse often contains data from many data sources, supporting users of all expertise levels. Compartmentalization can be achieved on the database(s) level or a level below.
- In Snowflake, this is referred to as a
- In BigQuery, it’s referred to as a
There are use cases for both compartmentalization on a database or schema-level, and it often has to do with how a company is organized:
_malwart_.sales_.transactions\__<further specification in table name\>
_sales_._transactions_.<further specification by table name\>
In an ELT context, when using tools such as dbt or Panoply, the “stage” of the data object is also relevant.
Some tables only act as sinks for an extraction job.
Other tables are only intermediate because multiple output tables rely on them. These tables aren’t relevant for analysts, but they are for analytics engineers because the essential compartmentalization also represents the ordinality of the tables—even when they’re hidden from the final consumers of the data.
The granularity of a table refers to what each record represents.
In a time series, it could be an aggregation per second, day, or quarter. In retail, transactions and line items are very relevant—but it could be clicks, heartbeats, tickets, stars, etc.
The table name should reflect that granularity:
- This table’s rows contain the individual line items of the transactions. Each row in the table represents a physical row on a receipt.
- This table’s rows contain the individual transactions and represent a physical ticket.
- This table contains returned items (e.g. because they were broken, or because of the return policy).
Prefixes vs. suffixes
Traditionally, OLAP cubes worked with fact tables and dimension tables. However, since most EDWs are columnar, all data is flattened, meaning facts and dimensions are part of the same tables.
For this reason,
fact\_ tables are disappearing.
On the other hand,
_views_ and _materialized views_ often have a
_\_v_ or _\_mv_ suffix.
Using suffixes instead of prefixes ensures related tables are grouped when ordered alphabetically.
Singular vs. plural
Of all naming conventions, this is the only one where personal taste is the sole determinant.
While plural naming makes more sense in that a table’s records describe multiple data objects, singular naming results in more intuitive queries when the query engine requires you to specify the table name in some clauses.
customer.country\_of\_residence = 'Denmark'
Relevant for columns
The following naming conventions are relevant for columns:
- Relevancy: Only include meaningful columns. There’s no point adding 3 primary keys from various source systems; that will only confuse analysts or reduce trust in the data. If something isn’t relevant for analytical purposes, don’t include it.
- Standardization: Frustration is ensured when a column is called _macroregion_ in one table and _region\_level\_1_ in another. That’s why a data object should have only one name; columns in different tables describing the same thing should be named identically.
- Descriptiveness: Columns named _weight_, _name_, or _city_ are useless outside the context of the table they’re in. Without knowing the granularity of the table, it’s impossible to know exactly what is weighed or named. Even then, a column named _city_ could describe the city of residence, birth, delivery, etc. For this reason, reduce confusion by adding context: use _body\_weight_, _first\_name_, and _city\_of\_residence._.
- Decoded flags: Forcing analysts to use Boolean (1/0 or true/false) or numerical flags is a recipe for disaster. If your company sells 3 plans, is 3 the most expensive plan, or 1? Should 0 be included? What if a plan’s content changes: Is it still the same plan? It’s advisable to rename flags to meaningful values: “free plan,” “standard plan,” and “enterprise plan” can be understood by all.
Let’s summarize the core naming conventions you’ve learned:
- Use descriptive naming that’s beginner-friendly
- Keep object names lowercase
- Use underscores as separators
- Compartmentalize domains and transformation stages in databases and schemas
- Object names should reflect their granularity
- Use suffixes instead of prefixes
- Only include relevant columns
- Standardize column names
- Contextualize columns to make them understandable outside their table
- Decode flags to meaningful values
You should now have a good understanding of best practices for data warehouse naming conventions.
While each naming convention practice we’ve discussed has its place, consistency is the most important. The other components should generally fall into place if you’re consistent with your naming conventions.
Panoply is a seamless, all-in-one solution for data organization and synchronization. In addition, Panoply can help take care of your data warehouse maintenance, so you and your team can focus on data analysis.