A data dictionary defines all the critical metrics and terms your business uses. It's a simple idea that dramatically impacts how effectively your business can use its data.
But how do you create a data dictionary?
- Are there data dictionary best practices or guidelines to follow?
- What are the pros and cons of each?
- And can you show me the best data dictionary template so that I'm not entirely lost?
I'll be answering that and more in this article.
But first, let's go over the reasons you need a data dictionary.
What is a data dictionary and why do you need one?
Simply put, a data dictionary gets everyone on the same page regarding metric and term definitions in your company. Without a data dictionary, there is no single source of truth for employees to refer to, and different metrics may be interpreted and used in opposing ways.
The finance team may use a definition of 'cost of acquisition' that is different from the product team, which may use a different definition of 'engagement' from the data science team, and so on.
Unsurprisingly, teams can get territorial about their definitions (especially if changing them would make their numbers look worse).
It is best practice to create a data dictionary and thus a single source of truth for all your definitions. This process is one of the most valuable things the data team can provide for the company.
How to create a data dictionary using various methods
Let's dive into the best (and worst) methods you can use to create your data dictionary.
In this section, I'll focus on comparing data dictionary tools; if you want to know what info your data dictionary should contain, skip ahead to the final section titled "Best Data Dictionary Guide."
At first glance, our old friend, the spreadsheet seems like a straightforward method.
- The rows of your excel file will be the columns/terms/variables from your database. All you need to do is copy/paste them over.
- The columns of your excel file will describe and explain each of the terms
A spreadsheet is pretty fast to set up, but that is about the only positive we can think of; once you have created the first version, the nightmare begins.
Let's look at some of the problems.
How do you manage it?
Having the data team control everything could create a bottleneck, but allowing everyone to edit could lead to incorrect changes.
Some other problems to think about:
- How do you stop people from making a local copy and adding their own custom changes?
- How do you track changes?
- How do you alert people to new changes?
- How do you ensure that the right level of detail gets shown to the right users?
- Data Scientists want data in all its gory detail, but non-technical users simply want KPIs.
And on and on it goes.
Spreadsheets are better than nothing but will cause you headaches if you use them for any prolonged period. Even though they are speedy to set up, managing them will take much more time than using a tool designed for the purpose.
So, we recommend using spreadsheets as a last resort or as an intermediary while you get a more sophisticated tool set up.
All database management systems (DBMSs) give you the option to annotate your data, which is like a mini data dictionary service built into the system.
You can write comments or descriptions about all the data within the database. Plus, you can modify and track the comments (so maintenance is much easier than with a spreadsheet).
In addition to this, the data and its explanation are close together. This closeness reduces friction and means more users will serve themselves rather than opening tickets with the Data team. Great!
But what are the issues with database annotations?
For one thing, your DBMS provider limits the amount of information you can include in comments. You can only use the fields your database gives you, which may not be what you want.
If you have everything stored in one place, this issue is probably manageable, but if you use multiple databases, you will not have a single source of truth. Users would then have to hunt through databases to find definitions which adds a ton of friction; the editors are also pretty unwieldy.
The biggest problem is the limited space that DBMS comments give you, which is why you should note the difference between database annotations and data dictionaries.
The former provides a brief (often technical) overview of the term, perhaps using SQL syntax; the latter is often wordier and written in less technical language for a more general audience.
For example, how do you explain that the NaN values in September 2020 are there because of a known server error? How do you tell others you already know the query is inefficient, but you are happy with the speed right now for X and Y reasons?
These wordier questions are where database annotations stop, and data dictionaries take over.
While database annotations are a vital part of your data architecture, they are not a replacement for a data dictionary; they complement them.
Database documentation tools
Due to the unwieldy nature of the built-in database annotation tools discussed above, some companies have created database documentation tools.
These are purpose-built and designed to make documenting your database a breeze.
Most of them can connect to multiple databases from different providers simultaneously, read the data and automatically generate documentation. They also have write access; thus, you can update database annotations from within the tool.
This combination makes maintaining your docs a million times easier since whenever there is a change to the database, the docs are automatically updated and vice versa.
Moreover, you have a single source of truth because your metadata is stored in a separate, searchable repo from which all users can get answers. You also have total control over how many fields you want to include and what you want them to be titled.
Goodbye database annotation limitations!
It’s also considered best practice to create Entity Relationship Diagrams (ERDs) alongside your data dictionaries. Thankfully, many of these tools automatically generate ERDs so you can easily see how tables are related to each other.
What's a good data documentation tool?
A few companies that find these tools useful are Coca-Cola, Deloitte, and Fujitsu which use Dataedo.
Are these tools free?
- Dbdocs is currently in beta and is free; they plan to release a paid version.
- Dataedo is at least $59/user/month.
So, the one 'downside' is that database documentation tools cost money, but so do terrible tools!
You may not pay a monthly subscription, but your productivity falls, and analytical chaos could start reigning supreme if you’re not careful.
Database documentation tools are a great all-around choice to create a data dictionary (and more!) in one place. However, different users may require different levels of detail about the data.
For example, CEOs probably want high-level KPIs, and data engineers want as much info as possible.
So, let’s look at some additional tools you can use to cater to these different business users.
Using your BI/Analytics tools
Most, if not all, of the best BI tools, let you:
- add descriptions to your data, KPIs, and charts,
- specify data types,
- indicate if a field is a dimension or a measure,
- and do many other useful things.
These additions are not a data dictionary per se, but they fill that role for non-technical users.
Technical users may be happy to go hunting through SQL code to find answers, but non-technical users will not. Their goal is to play around in the BI tool with data and get answers.
We think it's vital that your BI tool is well-annotated with up-to-date information.
Non-technical users should be able to hover over things and get pop-ups telling them what they need to know. They don't need to know the complex data structures underneath the tables they are viewing; they're just looking for enough info to be able to use them!
However, more information is not better in this case. Keep it terse, or non-techies may get overwhelmed.
Note that these tooltips live with the data, not in a super detailed, separate repository. This closeness reduces friction with non-technical users and empowers them to do more work and get more insights from their own data analysis; this is the outcome you want!
Beyond data dictionaries
The most technical users (I'm looking at you Data Scientists/Engineers) will want more than a data dictionary.
Since their work is inherently experimental, they'll benefit from a knowledge repo containing the code results of their experiments. Seeing how someone else used a dataset is the fastest way to learn how to use it yourself.
Finally, they may also want access to Airflow DAGs and the database architecture design docs to understand every design decision's reasoning.
We don't think they'll be looking at the latter very often, but it's good to keep in mind!
Our best data dictionary guide
Now that you’ve seen all the tool options available: let’s look at some guidelines for developing a data dictionary.
First, here’s a summary of your options:
- Spreadsheet - Quick and easy but not a good solution for the long term.
- Database annotations - Add comments and quick explanations into your DBMS software. Essential for more technical users.
- Database documentation tools - Fully-fledged solution and well-working solution to document your database, e.g., Dataedo (paid) or dbdocs (free)
- In your BI/Analytics tool - Similar to database annotations but in your BI tool of choice. Your non-technical users will love you for this!
- ERDs, DAG diagrams, and architecture design docs - Extra detailed information for your most technical users.
Here are some extra tips to make variable names easier to create and manage:
- Use consistent naming standards - CamelCase or joined_with_underscores are the most common.
- Keep abbreviations to a minimum - e.g., prefer TotalRevenue to TotRev
- Favor long descriptive names over short and (potentially) confusing ones - Your data dictionary is there to shine light into the proverbial darkness that is your database. Don't make it more confusing than it already is.
- Auto-generate your data dictionary - It will be a nightmare for you and your company to set up everything by hand. Plus, maintenance will only get more difficult as your business scales. So, either use built-in automatic generation features or set up an automation pipeline with something like Jenkins.
Whatever tool you choose, here is a checklist of info you want your data dictionary to contain:
- Term - The term itself
- Department - The department(s) that own/use this term.
- Term Owner - Who owns this term? Individual or team?
- Contact Info (Term Owner) - The best way to contact the term owner, e.g., email or phone number.
- Data Type - The data type, e.g., string, int, float.
- Dimension / Measure - Is this a dimension or a measure?
- Description - An explanation of the term. What does it measure/describe?
- Example - An example value, e.g., ‘email@example.com’ or 13.45.
- Allowed values - The range of values this term can take, e.g., integers in the range [0, 150].
- Data source - Where the data came from, e.g. Salesforce, Google Analytics, Facebook Ads Manager.
- Calculated or Raw? - Is this term created from other terms (calculated), or is it a raw term from a data source?
- How Calculated? - Give basic business logic to calculate this term (if possible), e.g., Profit = Revenue - Costs.
- Last Updated - The date and time the data dictionary was last updated. It will change a lot, so you need to keep track of this!
- Change log - Log of past changes, when they happened, and what they were.
Interested in learning more? Get in touch.