Designing and creating a data dictionary for your company is a big process with big returns. It's something you need to do if you want deep insights from your data and have a high-functioning analytics team.
While you can use one of several tools to create a data dictionary, in this article, we'll discuss the best practices for developing one to ensure your project is successful.
But first, let's go over the reasons you need a data dictionary.
A data dictionary gets everyone on the same page regarding metric and term definitions in your company. Without a data dictionary, there is no single source of truth for employees to refer to, and different metrics may get interpreted and used in opposing ways.
The finance team may use a definition of 'cost of acquisition' that is different from the product team, which may use a different definition of 'engagement' from the data science team, and so on.
Unsurprisingly, teams can get territorial about their definitions (especially if changing them would make their numbers look worse).
A good guideline is to distinguish between database annotations and data dictionaries.
For example, how do you explain that the NaN values in September 2020 are there because of a known server error?
Or, how do you tell others you already know the query is inefficient, but you are happy with the speed right now for X and Y reasons?
These wordier questions are where database annotations stop, and data dictionaries take over. Any junior, non-technical employee should be able to read it and understand what is going on.
It's best practice to create a data dictionary and thus a single source of truth for all your definitions. It's one of the most valuable things a data team can provide for the company.
There are many ways to make a data dictionary, but there is one process that we know works (because we've implemented it ourselves), and we think you would benefit from it.
The main ideas come from Carl Anderson, but we have modified them slightly to fit our needs (particularly the final steps).
The first step is to gather all the terms your business uses.
You are not collecting definitions (that will come later) but a list of all jargon and terms living in your BI tools and databases used by each team.
Go to every team and ask to look at their dashboards and the data they work with. Look at their charts and KPIs and note everything you see on every axis and table heading.
Store them in a table like this:
Team |
Term |
Data Type |
Measure / Dimension |
Example |
Data Source |
Notes |
Sales |
Sales Qualified Lead |
Integer |
Measure |
25 |
Salesforce |
|
Marketing |
CPC |
Float |
Measure |
$1.42 |
FB Ads Manager |
Cost-Per-Click |
Note that this is not the time for detailed descriptions; you are just gathering terms. The time for explanations and definitions will come, but doing it now will slow you down unnecessarily.
It makes sense to group terms by their team, e.g., keep all the sales team metrics close together.
There will probably be generic metrics like 'year' or 'date' or 'city' that you can put to one side as everyone uses them. But only put them aside once you know they are the same across the business!
You don't want one team using the 'date' based on UTC and another on PT.
Make the business team review the list because there may be terms they haven't seen before. This process is an excellent opportunity to get everyone on the same page.
Now that you have collected all the terms, it's time for you (the Data / BI Team) to create the first draft of the definitions.
Look at the documentation that already exists and put together as many as you can. There are two reasons to do this.
First, it saves other teams from having to do it, and second, it gives you a complete overview of the company's documentation and how each team manages their data knowledge.
Maybe the finance team has excellent documentation, but it's on a system siloed away from everyone else that no one can understand. Or perhaps HR uses no data at all.
This knowledge will help you to ease adoption further down the road.
Once you have done all you can, then (and only then!) should you sit down with the teams and go through the definitions you've created. You will make more progress more quickly if you start with an explanation of some kind rather than an empty space.
Ask the teams, "how should this term be defined?" Do not ask for the current definition.
You are building a data dictionary to get your company closer to the perfect data setup. Don't let them settle for current definitions if they would not be defined that way in an ideal world.
This is the moment you have been waiting for, the reason you need data dictionaries.
At this point, each team has approved the definitions you have written down. Now it's time for you to sit down, go through every definition, and find the ones that conflict with one another.
Make a list and arrange meetings with the teams that conflict.
Get all the teams in a room that have conflicting definitions about a particular term. We will assume that, at minimum, there are two teams in conflict (though there could be more).
Get them to talk about what they differ on and their reasons for doing so.
There are three possible outcomes of the meeting:
Nothing is going to happen unless upper management agrees to it.
It would be pointless putting in all this effort to create a unified data dictionary if upper management thinks it's a waste of time. Without support from above, there's a risk that teams will make their own data dictionaries, and now you're back to the drawing board.
Getting upper management on board increases the chances of your project's success dramatically.
The simplest option for publishing a data dictionary is a Google Doc, and a little better option is a searchable online Wiki, like Confluence. The best option is to use a data dictionary documentation tool such as Dataedo (paid) or dbdocs (free).
Whatever tool you choose, make sure the whole company has access to it and that they all hear about it.
From this point forward, everyone in the company needs to be using these definitions whenever they work with data. If someone disagrees or wants to modify a definition, there must be a process for them to follow.
You also need to make sure to write database annotations and add comments to your BI tool.
These additions provide different layers of complexity to your users.
The most technical users (Data Engineers/Data Scientists) require the highest level of sophistication. In contrast, non-technical business users will want more basic explanations that enable them to perform their own data analysis.
It's best practice to update everything simultaneously.
Ideally, you will set up an automatic data dictionary generation pipeline that's built-in with most database documentation tools. For other methods, use something like Jenkins.
New terms will be defined, and old terms will get updated.
Data dictionaries are dynamic documents, and you need a robust system to manage the regular updates that will occur.
You'll need to involve several teams:
You must clearly communicate all these changes to the company to ensure everyone stays on the same page. You also need to log these changes to see how the dictionary evolves.
Once you have made the first version of the data dictionary, the work does not stop; it has just begun!
Keeping everything in sync is critical.
If you let systems get out of sync, you will be back at square one and have different teams using different definitions, and be left without a single source of truth.
So there you have it; the best practices we recommend you follow to create your data dictionary.
This process will likely take several months as it requires working with literally every team in the company, organizing talks between them, and getting top-level sign-off.
We recommend doing all of this in one go.
Do not make separate data dictionaries for each team. One of the main reasons you are doing this is to identify conflicts between teams.
Suppose you already created a sales data dictionary three months ago, and there is a conflict (or many!) with the marketing team. They won't be happy sitting down and hashing it out over definitions they thought were already approved. When the whole company is working towards this goal simultaneously, having these discussions is much easier.
Finally, set a date in the future where everything will change.
On that date, you will move from inconsistent and conflicting interpretations to consistent and clear definitions from which the whole company will benefit. It's a big moment, one when you step up your data game and become a truly data-driven company ready to go out and change the world.
Good luck, we know you can do it!