What is data architecture?
Data architecture defines information flows in an organization, and how they are controlled. A data architect is responsible for understanding business objectives and the existing data infrastructure and assets; defining data architecture principles; and shaping the enterprise data architecture to provide greater benefits to the organization.
A few basic concepts in data architecture:
- Conceptual / business data model—shows data entities such as customer, product and transaction, and their semantics.
- Logical/system model—defines the data in as much detail as possible, including relations between data elements, but without considering how data is stored or managed.
- Physical/technology data model—defines how the data is represented and stored, for example in a flat file, database, data warehouse, key-value store.
Who creates the data architecture—organizational roles
The following roles exist to help shape and maintain a modern data architecture:
- Data architect (sometimes called big data architects)—defines the data vision based on business requirements, translates it to technology requirements, and defines data standards and principles.
- Project manager—oversees projects that modify data flows or create new data flows.
- Solution architect—designs data systems to meet business requirements.
- Cloud architect or data center engineer—prepares the infrastructure on which data systems will run, including storage solutions.
- DBA or data engineer—builds data systems, populates them with data and takes care of data quality.
- Data analyst—an end-user of the data architecture, uses it to create reports and manage an ongoing data feed for the business.
- Data scientists—also a user of the data architecture, leveraging it to mine organizational data for fresh insights.
The data architect role
Which unique skills helps someone become a data architect?
- Understanding and communicating how data drives the business.
- Able to work with diverse data sources, understand their structure, content and significance.
- Deeply knowledgeable about data tools and technologies.
How are data architects trained?
Data architect training usually happens on the job, in data-related roles such as data engineer, data scientist or solution architect. There is no industry-standard certification or training program for data architects, but it’s valuable for architects to have certification in the primary data platforms used by their organization.
Data architect roles and responsibilities
- Translates business requirements to technical specifications—data streams, integrations, transformations, databases, and data warehouses.
- Defines data architecture framework, standards and principles—modelling, metadata, security, reference data such as product codes and client categories, and master data such as clients, vendors, materials, and employees.
- Defines a reference architecture—a pattern others in the organization can follow to create and improve data systems.
- Defines data flows—which parts of the organization generate data, which require data to function, how data flows are managed, and how data changes in transition.
- Collaboration and coordination—data projects often span multiple departments and stakeholders, as well as partners and external vendors; the data architect is a focal point that coordinates all parties around organizational objectives.
Technology and data automation
Many years ago, data infrastructure was monolithic. Organizations poured millions into large systems that would store and process all organizational data. With the advent of open source technology and agile methodologies, data systems are becoming simpler and more lightweight, and at the same time more performant and more flexible.
A few elements typically found in a modern data architecture:
- Data warehouse—a cornerstone of old school data infrastructure, data warehouses are still important, but are moving to the cloud and interacting with data lakes, traditional databases and unstructured data sources
- Relational database—the old incumbents like Oracle and SQL Server are still in use, but open source alternatives like MySQL and PostgreSQL are everywhere
- NoSQL database—stores massive amounts of semi-structured and unstructured data. Popular solutions are Redis, MongoDB, CouchDB, Memcached and Cassandra.
- Real-time streaming—new tools such as Apache Kafka, Flume and AWS Kinesis help stream large volumes of data from system logs and production systems.
- Containers—platforms like Docker and Kubernetes help spin up and deploy data infrastructure at the click of a button, and orchestrate complex systems in a flexible and scalable manner.
- Microservices and serverless computing—data systems built using microservices or functions as a service (FaaS) are independent units that expose a standard interface, allowing data architects to compose and arrange data environments to suit business needs.
Data architecture best practices
The following best practices can help you achieve an effective, holistic data architecture:
- View data as shared asset—eliminate organizational silos and view customer data holistically, combining data from all parts of the organization.
- Provide the right interfaces for users to consume data—the data is insignificant if it can’t be consumed in convenient ways. Interfaces could be web-based dashboards, BI, SQL queries, R, or anything else that business users or analysts use to derive insights.
- Ensure tiered security and access control—classify data according to its sensitivity and business significance, and carefully design access controls to ensure data is easily available, but only to those who require it.
- Encourage data stewardship—data stewards are subject matter experts who can help clean, verify and add to organizational data. For example, a product line manager who has special insight into thousands of products. Build a community of data stewards, contributors and data citizens who can enhance data quality for everyone.
- Data curation—consider which data is really actionable for different organizational roles, and build processes to curate the most relevant data. Beyond top-down curation, enable users to perform easy filtering and querying to get to relevant data quicker.
- Eliminate copies and data movement—in large organizations, it’s difficult to standardize data without strict rules that stifle creativity. Strive to create data formats and structures that encourage users to collaborate on the same data entities, rather than create multiple competing versions of the same entity.
- Automation—the key to an efficient data architecture
In recent years data pipelines have become increasingly agile, flexible and automatic. Organizations used to spend months building rigid processes to extract data from sources, transform it into specific formats and load it into repositories. Today the same types of processes can be achieved in hours using automated cloud-based tools.
Automation is having a huge impact on data architectures, because they allow data architects extreme flexibility in determining what is right for the organization right now. Previously, data architects were “stuck” with legacy data monoliths that required a tremendous effort to support change.
Today things are very different—if a certain business unit needs different data, an architect can easily design a pipeline to deliver that data. If the organization is generating new types of data, data architects can identify it, convert it to usable form and deliver it to users in a matter of hours or days, rather than weeks or months.
As an example of next generation automated data infrastructure, consider Panoply, the world’s first automated data warehouse. Panoply can automatically ingest data from a large variety of data sources, and uses Natural Language Processing and machine learning techniques to automatically prepare, clean, optimize, and transform data for analysis, making it possible to go from raw data to insights in minutes. For many organizations, Panoply is a main enabler of a new, automated data architecture.