A data dictionary defines all the critical metrics and terms your business uses. It's a simple idea that dramatically impacts how effectively your business can use its data.
But how do you create a data dictionary?
I'll be answering that and more in this article.
But first, let's go over the reasons you need a data dictionary.
Simply put, a data dictionary gets everyone on the same page regarding metric and term definitions in your company. Without a data dictionary, there is no single source of truth for employees to refer to, and different metrics may be interpreted and used in opposing ways.
The finance team may use a definition of 'cost of acquisition' that is different from the product team, which may use a different definition of 'engagement' from the data science team, and so on.
Unsurprisingly, teams can get territorial about their definitions (especially if changing them would make their numbers look worse).
It is best practice to create a data dictionary and thus a single source of truth for all your definitions. This process is one of the most valuable things the data team can provide for the company.
Let's dive into the best (and worst) methods you can use to create your data dictionary.
In this section, I'll focus on comparing data dictionary tools; if you want to know what info your data dictionary should contain, skip ahead to the final section titled "Best Data Dictionary Guide."
At first glance, our old friend, the spreadsheet seems like a straightforward method.
A spreadsheet is pretty fast to set up, but that is about the only positive we can think of; once you have created the first version, the nightmare begins.
Let's look at some of the problems.
How do you manage it?
Having the data team control everything could create a bottleneck, but allowing everyone to edit could lead to incorrect changes.
Some other problems to think about:
And on and on it goes.
Spreadsheets are better than nothing but will cause you headaches if you use them for any prolonged period. Even though they are speedy to set up, managing them will take much more time than using a tool designed for the purpose.
So, we recommend using spreadsheets as a last resort or as an intermediary while you get a more sophisticated tool set up.
All database management systems (DBMSs) give you the option to annotate your data, which is like a mini data dictionary service built into the system.
You can write comments or descriptions about all the data within the database. Plus, you can modify and track the comments (so maintenance is much easier than with a spreadsheet).
In addition to this, the data and its explanation are close together. This closeness reduces friction and means more users will serve themselves rather than opening tickets with the Data team. Great!
But what are the issues with database annotations?
For one thing, your DBMS provider limits the amount of information you can include in comments. You can only use the fields your database gives you, which may not be what you want.
If you have everything stored in one place, this issue is probably manageable, but if you use multiple databases, you will not have a single source of truth. Users would then have to hunt through databases to find definitions which adds a ton of friction; the editors are also pretty unwieldy.
The biggest problem is the limited space that DBMS comments give you, which is why you should note the difference between database annotations and data dictionaries.
The former provides a brief (often technical) overview of the term, perhaps using SQL syntax; the latter is often wordier and written in less technical language for a more general audience.
For example, how do you explain that the NaN values in September 2020 are there because of a known server error? How do you tell others you already know the query is inefficient, but you are happy with the speed right now for X and Y reasons?
These wordier questions are where database annotations stop, and data dictionaries take over.
While database annotations are a vital part of your data architecture, they are not a replacement for a data dictionary; they complement them.
Due to the unwieldy nature of the built-in database annotation tools discussed above, some companies have created database documentation tools.
These are purpose-built and designed to make documenting your database a breeze.
Most of them can connect to multiple databases from different providers simultaneously, read the data and automatically generate documentation. They also have write access; thus, you can update database annotations from within the tool.
This combination makes maintaining your docs a million times easier since whenever there is a change to the database, the docs are automatically updated and vice versa.
Moreover, you have a single source of truth because your metadata is stored in a separate, searchable repo from which all users can get answers. You also have total control over how many fields you want to include and what you want them to be titled.
Goodbye database annotation limitations!
It’s also considered best practice to create Entity Relationship Diagrams (ERDs) alongside your data dictionaries. Thankfully, many of these tools automatically generate ERDs so you can easily see how tables are related to each other.
Older tools (like RedGate SQL Doc) look quite outdated, but the new ones (like dbdocs and Dataedo) look slick and professional and have a great UX.
A few companies that find these tools useful are Coca-Cola, Deloitte, and Fujitsu which use Dataedo.
Are these tools free?
So, the one 'downside' is that database documentation tools cost money, but so do terrible tools!
You may not pay a monthly subscription, but your productivity falls, and analytical chaos could start reigning supreme if you’re not careful.
Database documentation tools are a great all-around choice to create a data dictionary (and more!) in one place. However, different users may require different levels of detail about the data.
For example, CEOs probably want high-level KPIs, and data engineers want as much info as possible.
So, let’s look at some additional tools you can use to cater to these different business users.
Most, if not all, of the best BI tools, let you:
These additions are not a data dictionary per se, but they fill that role for non-technical users.
Technical users may be happy to go hunting through SQL code to find answers, but non-technical users will not. Their goal is to play around in the BI tool with data and get answers.
We think it's vital that your BI tool is well-annotated with up-to-date information.
Non-technical users should be able to hover over things and get pop-ups telling them what they need to know. They don't need to know the complex data structures underneath the tables they are viewing; they're just looking for enough info to be able to use them!
However, more information is not better in this case. Keep it terse, or non-techies may get overwhelmed.
Note that these tooltips live with the data, not in a super detailed, separate repository. This closeness reduces friction with non-technical users and empowers them to do more work and get more insights from their own data analysis; this is the outcome you want!
The most technical users (I'm looking at you Data Scientists/Engineers) will want more than a data dictionary.
Since their work is inherently experimental, they'll benefit from a knowledge repo containing the code results of their experiments. Seeing how someone else used a dataset is the fastest way to learn how to use it yourself.
Finally, they may also want access to Airflow DAGs and the database architecture design docs to understand every design decision's reasoning.
We don't think they'll be looking at the latter very often, but it's good to keep in mind!
Now that you’ve seen all the tool options available: let’s look at some guidelines for developing a data dictionary.
First, here’s a summary of your options:
Here are some extra tips to make variable names easier to create and manage:
Whatever tool you choose, here is a checklist of info you want your data dictionary to contain:
Interested in learning more? Get in touch.