Data Architecture is a blueprint to follow when building a data solution. It defines an high-level architectural approach and concept to follow, outlines a set of technologies to use, and states the flow of data that will be used to build your data solution to capture big data.

Data Architecture refers to the overall design and organization of data within an information system.

1. Evolution of Data Architectures

A Relational Database stores data in a structured manner, with relationships between the data elements defined by keys. Data is typically organized into tables, with each table consisting of rows and columns. Each row represents a single instance of data, while each column represents a specific attribute of the data.

Relational Database are designed to handle structured data, and they provide a framework for creating, modifying, and querying data using a standardized language known as SQL. The relational model was first proposed by Edgar F. Codd in 1970, and it becomes the dominant model for database management system since the mid -‘70 because most operational applications need to permanently store data.

In relational database, data are organized with an approach called schema-on-write. Schema refers to the forma structure that defines the organization of - and relationship between - tables, fields, data types, and constraints. Schema allows to ensure consistency, integrity, and efficient organization within the database. If we are working with relational database (and relational data warehouse) we must create the database and its tables, fields, and schema, then write the code to transfer the data into the database. With a schema-on-write approach, the data schema is defined and enforced when data is written or ingested into the database and data must adhere to the predefined schema, including data types, constrains, and relationship, before it can be stored.

By contrast, schema-on-read is an approach where the schema is applied when data is read or accessed, rather than when it’s written. Data can be ingested into the storage system without conforming to a strict schema. This approach offers of course more flexibility in storing unstructured or semi-structured data, and it’s commonly used in Data Lakes.

At high-level, Data Architectures provide a framework for organizing and managing data in a way that supports the needs of an organization. This involves defining how data is collected, stored, processed, and accessed, as well as maintaining data quality, security, and privacy. While data architectures can take many different form, some common elements include:

Data Storage: how data is stored, including the physical store medium (such as hard drives or cloud storage) and the data structures used to organize data.
Data Processing: how data is processed, including any transformations performed before data is stored.
Data Access: how to access data, including user interfaces and APIs that enable data to be queried.
Data Security and Privacy: how to ensure the security and privacy of data, such as access controls, encryption, and data masking.
Data Governance: data architectures need to provide frameworks for managing data, including quality standards, lineage tracking, and retention policies.

Data Architecture

Overall, the main goal of Data Architecture is to enable an organization to manage and leverage its data assets effectively in order to support its business objectives and decision-making processes.

Here’s an high-level comparison of the data architectures we’re going to talk about:

2. Relational Data Warehouse

Designing Data-Intensive Applications by Martin Kleppmann (p. 91).

At first, the same databases were used for both transaction processing and analytic queries. SQL turned out to be quite flexible in this regard: it works well for OLTP- type queries as well as OLAP-type queries. Nevertheless, in the late 1980s and early 1990s, there was a trend for companies to stop using their OLTP systems for analytics purposes, and to run the analytics on a separate database instead. This separate data‐ base was called a data warehouse.

Data Warehouses (or Relational Data Warehouse [aka RDW] or Traditional Data Warehouse) became more popular in the late 1980s thanks largely to both Barry Devlin and Bill Inmon. This architecture tries to overcome the limitations of relational databases.

RDW is a specific type of relational database that is designed for data warehousing and business intelligence applications, with optimized query performance and support for large-scale data analysis. A relational data warehouse uses relational model to organize data and it’s typically large in scale and is optimized for analytical queries.

RDWs have both a compute engine (it’s the processing power used to query the data) and a storage (it holds the data that is structured via tables, rows, and columns).

Some of the most important features of RDWs:

transaction supports: it ensures the data is processed reliably and consistently
audit trails: it keeps a record of all activity performed on the data in the system
schema enforcement: it ensures the data is organized and structured in a predefined way.

In the 1970s and 1980s, organizations were using relational databases for the so-called Online Transaction Processing (OLTP) applications such as order entry and inventory management. OLTP systems can make create, update, delete changes to data in a database. These operations are known as CRUD operations and they are the foundation of data manipulation and management in data architectures. You can run queries and generate reports on a relational database, but doing so uses a lot of resources and can conflict with other CRUD operations running at the same time and this can slow the system.

RDWs were invented in part to solve this problem. The data from the relational database is copied into a data warehouse, and users can run queries and generate reports against the data warehouse instead the relational database and this avoids the problem we mentioned before. RDWs also centralize data from multiple applications in order to improve reporting as shown in this picture:

3. Data Lake

Data Lake is appeared around 2010. It’s a simply storage and you can think of it as a filesystem (very similar to filesystem on our laptop). Some specs and differences with Data Warehouses:

It has no compute engine associated to it (unlike relational data warehouses), but there are many compute engines that can work with data lakes, so compute power is cheaper for a data lake than a relational data warehouse.
It uses object storage, which does not need the data to be structured into rows and columns (unlike data warehouses that use relational storage).
It’s schema-on-read, meaning that you can simply copy the files into it, like you would with folders on your laptop (unline data warehouse which is schema-on-write). You probably need to define the schema in a separate file
Data is stored in its natural (or raw) format. So it can include structured, semi-structured, unstructured, and binary data.

Data Lake storage technology started with Apache Hadoop Distributed File System (HDFS), a free ioen source technology hosted almost exclusively on-prem that was very popular in the early 2010s. HDFS is a scalable, fault-tolerant distributed-storage system designed to run on commodity hardware. It is a core component of the Apache Hadoop ecosystem. As cloud computing continued to grow in importance, data lakes were built in the cloud using a different type of storage, and most lakes now exists in the cloud.

Data Lakes technology was proposed as the solution to all problems with Relational Data Warehouses, including high cost, limited scalability, poor performance, data preparation overhead, and limited support for complex data types. They were proposed as “one technology to do everything”. The problem was that querying data lakes isn’t actually that easy because it requires some advanced skill sets: IT copies the data a user needs into the data lake and the user itself must use Hive and Python to build reports. Furthermore data lakes did not have some of the most important features that data warehouses had, such as transaction support, schema enforcement, and audit trails.

Data Lakes purpose has turned into another one: staging and preparing data.

4. Modern Data Warehouse

Since around 2011, Modern Data Warehouse technology tried to put together the benefits offered from both Data Warehouse and Data Lake technology:

the data lake is for staging and preparing data (and data scientist use it to build machine learning models);
the data warehouse is for serving, security, and compliance, and business users do their querying and reporting with it.

Quartz 4

Explorer

02. Types of Data Architectures

1. Evolution of Data Architectures

2. Relational Data Warehouse

3. Data Lake

4. Modern Data Warehouse

Graph View

Table of Contents