Designing a database schema is the first step in building a foundation in data management. Ineffective schema design can creates databases that are heavy consumers of memory and other resources, poorly defined, difficult to maintain and administer. The objective of this discussion is to provide an example of an effective data schema for a data project based on PostgreSQL. This discussion will provide a general foundation in data schema design, illustrated by building the infrastructure of a fantasy universe.
Initially, we will create a database schema that will allow you to record data about a fantasy world. We will start by defining the primary components of this schema design.
After generating our framework we'll discuss how Panoply can pull data from multiple sources and allow you to query them from a single database.
Let's review some of the considerations that can be made when designing a schema:
1. Design a schema logically based on business requirements. This can be defined as building a logical model. This design step will play an important part in how the database is developed. These requirements can translate into useful information that programmers can utilize during the physical design of the database. Part of this process is also defining a set of rules, called constraints. These constraints define how the data in the various tables relate to each other.
2. Define the physical layout of the database. The physical schema defines the location of: data files, logs, controls and other related objects. The schema adds logic and clarity to the environment by clearly defining all of its attributes.
3. Define the security for the schema. Schema based security, bases itself on the user of the schema and determines the tables to which a given user, or given security level, have access.
4. Define and use the appropriate naming conventions. Understanding your data and the attributes surrounding each element is important. Each element of data will need to be considered as you are building out your scope. A successful build will avoid using collations, reserved words, or data types that will hinder development and query performance. Whether you are building based on existing data, or planning for additional data, this step is vital.
In this example let's consider the logical scope of the schema based on the requirements. The physical schema will vary based on the type of storage that is used.
Returning to our illustration, we can conclude that a planet would have a long list of data elements and attributes that need to be grouped. For our example, we will keep this scope relatively small, with a few defined data elements and constraints.
Let's start by defining a table called ‘planet’ and a schema called ‘universe’. This table will contain basic details about the planet including the name, number of continents, and size. The SQL statements shown below will generate the schema and table.
create schema universe;
CREATE TABLE universe.planet (
id int8 NOT NULL,
"name" varchar(255) NOT NULL,
"continents" varchar(255) NOT NULL UNIQUE,
"size" int8 NULL
This is a sample table script.
Let's take a closer look at the planet we are creating. The planet will contain continents and the names of tribes. In this fantasy world, Tribes (factions) can span across more than one continent. In this case we will not be limited to the constraints of a one to one ratio. Tribal data can grow based on a one to many relationship without conflicts. This is an example of how a well-designed schema can enable your data to grow exponentially.
Tribes have more elements we can track such as:
- name (name)
- what continents it's located on (continents)
- how big it is (member_count)
- how much of a given continent it controls (control_percent)
- what species it accepts, or if it accepts all species (species)
Each data element listed can be added to enhance the view of your planet data. As you continue to expand your data, you will be able to analyze all the data you are collecting in your growing schema. Additional source data can be collected such as tracking tribal control across several continents. For example, if you are collecting tribe data from various data sources you may need to have a repository or an additional database to ingest the data.
Once you have a clear understanding of the process you can continue to add data from various data sources. For example you can create additional tables to ingest geographic features found on various continents such as flora and fauna of the planet.
The goal of these steps is to identify as many elements of a dataset as we can and sort them into logical bulk associations. The following query will generate two additional tables, and should constrain the data in all three.
CREATE TABLE universe.continents (
id int8 NOT NULL,
"continent" varchar(255) NULL,
CONSTRAINT continents_planet_fk FOREIGN KEY (continent) REFERENCES universe.planet(continents) ON DELETE CASCADE ON UPDATE CASCADE
CREATE TABLE tribes (
"species" numeric(11) NOT NULL,
PRIMARY KEY (id),
CONSTRAINT factions_planet_fk FOREIGN KEY (continents) REFERENCES universe.planet(continents) ON DELETE CASCADE ON UPDATE CASCADE
These bulk associations become our tables (for example planet, tribes, geographic_features, flora, and fauna). Next, individual elements are used to create relationships between the tables. These relationships form our constraints and joins in the data.
Going back to our continents, one continent can connect to many factions, and vice versa. We can illustrate the one to many relationship as shown below:
(tribes are defined as factions in this illustration)
The ERD diagrams are created using Microsoft Visio. A SQL client tool can be obtained to write SQL scripts to create the schema, database and tables. The data elements needed would depend on the RDBMS used. A combined methodology would be to obtain an Entity Relation Diagram (ERD) tool, that allows you to design and convert diagrams into SQL files automatically.
Now that we have discussed the manual process of building a database with an associated schema, let's take a look at the capabilities of a data warehouse.
A data warehouse is a large collection of data from various sources that can be used for analytical purposes. A database contained within a data warehouse is specifically designed for OLAP (online analytical processing). For example, if you wanted to analyze all the data from various planets and continents this data management process is best suited for a data warehouse environment. What if each planet was storing different types of data in different data sources?
The solution would be the Smart Data Warehouse by Panoply. It has the ability to learn data as it is being queried. By selecting the appropriate integration for each source, data can be ingested into the data warehouse to be analyzed. It's intelligent algorithms methodically scans through all ingested data, and systematically performs optimization routines to improve performance.
Panoply analyzes databases for relevant data based on the schema and its own Natural Language Processing (NLP). It finds the most efficient methods to process the data and returns the results to the user. Panoply provides more than 80 integrations to ingest data and build your own data warehouse.
You can take a closer look at your data by creating a dashboard using your favorite analysis software. Or you can query your data inside Panoply's web interface using the Analyze workbench. Panoply can analyze an unlimited amount of data in minutes. Sign up for a free trial today.
Looking to squeeze more out of your data?