0. Resources

F2023 #03 - Database Storage Part 1 (CMU Intro to Database Systems)

Course Outline

The way we think about a DataBase Management System is like a bunch of layers and each layer provides different functionalities:

Query Planning
Operator Execution
Access Methods
Buffer Pool Manager
Disk Manager

We start from the bottom one: Disk Manager.

Disk-based Architecture

For the methods we’ll discuss in the course, we’re going to assume that the architecture we’re trying to build is called Disk-based Database System, meaning that the DBMS assumes that the primary storage location of the database in on non-volatile disk (SSD, Hard Drive, Cloud [S3]). All the things we’re going to build in the DBMS are really designed to coordinate the movement of data back and forth from Disk into Memory (that is the classical Von-Neumann architecture, where data at rest on disk cannot be operated on by the CPU until it’s brought into memory).

This is extremely hard work, and if you’re an application developer, you don’t want to handle this yourself in application code. You want a database system that knows how to do this reliably, safely, correctly, and efficiently.

The way to think about what Storage looks like is in therm of this hierarchy:

At the top of the storage hierarchy, you have the devices that are closest to the CPU. This is the fastest storage, but it is also the smallest and most expensive. The further you get away from the CPU, the larger, but slower the storage devices get. These devices also get cheaper per GB.
From our perspective the only thing we care about is the division between Volatile and Non-Volatile storage. We assume that data is going to be persistent.

Volatile Devices:

Non-persistent: if you pull the power from the machine, then the data is lost.
Random Access & Byte-Addressable: In memory, you can query each individual byte. If you have a one-megabyte file in memory and want only 64 bits of it at some random offset, you can access it directly. (There are cache lines that make this not exactly true, but we can ignore that for now.)
For our purpose, we will always refer to this storage class as “Memory”.

Non Volatile Decives:

Persistent: the storage devices does not require continuous power in order for the device to retain the bits that it is storing.
Sequential Access & Block-Addressable: With an SSD, you cannot fetch exactly 64 bits from a one-megabyte file. You must fetch the entire block that contains those 64 bits, bring that block into memory, and then extract what you need.
We will refer to this as “Disk”. We will not make a (major) distinction between solid-state storage (SSD) and spinning hard drives (HDD).

Regarding Persistence property, database systems don’t trust hardware or the OS. They implement additional safeguards like writing to multiple locations or creating backups to ensure true non-volatility. But from an architectural design perspective, the assumption is volatile versus non-volatile storage.

Since our DBMS architecture assumes that the database is stored on disk, the components of the DBMS are responsible for figuring out how to move data between non-volatile disk and volatile memory since the system cannot operate on the data directly on disk (again, we’re following Von Neumann architecture.

*. In-Depth Comparison of Byte and Word Addressing:

Source:

https://www.youtube.com/watch?v=QOrAFDZABsM&ab_channel=LearningSimplified

Generally we can store in form of 0s and 1s. Suppose a memory device that can store either a 0 or 1. If we group eight such devices, they can store 8 Bits and we call this group of 8 cells as Byte:

→1 byte = 8 bits

In the same way when we group several Bytes, then we call it as Words. But there is not fixed value of how many bytes comprise a one word:

→1 words = ? bytes

When you see 32-bit machines and 64-bit machines, these numbers generally denote the size of the Word: 32-bit machine means the size of the word is 32 bits (or 4-bytes):

→32-bit: 1 word = 4 bytes = 32 bits

Let’s draw a memory of size 4096x32:

Each layer is a word, so we have 4096 words (each identified by an address i from 0 to 4095) and each word is of 32 bits (or 4 bytes).

Let’s draw another memory of size 4096x32 which shows even the bytes as well representing by the green lines:

So, while before we had the address of each word (i from 0 to 4095), now we have address of each byte.

Let’s find a relationship between the two memories:

the address of the first word is the same as the address of the first byte → i=j;
the address of the second word is the same as the address of the fifth byte → i+1=j+4;
…

Notice that to get the next successive location of a word you have to add 4 to the address of the first bytes of the previous word:

the address of the first word (i) corresponds to the address of the first byte → i = j
to get the address of the second word (i+1) you have to add 4 to j → i+1 = j+4
to get the address of the third word (i+2) you have to add 4 to j → i+2 = j+8

The first memory is called Word-Addressable storage, meaning that to read the next word we add 1 to the current location.

The second memory is called Byte-Addressable storage, meaning that to read the next word location we add the number of bytes that each word comprises to the current location.

How many bytes are required to represent 4096 locations? The binary form of 4096 is 100000000000 (1 with 11 0s) that is equal to 2^12. So we need 12 bits to represent each word address.

But how do we address if we want to address each byte? $4096 \cdot 4 = 2^{12} \cdot 2^{2} = 2^{14}$ . So 14 bits are required to represent each byte address.

Interesting Memory Playlist, youtube channel (to do)

https://www.youtube.com/playlist?list=PLBlnK6fEyqRjdT1xkkBZSXKwFKqQoYhwy

Disk-based Architecture (continuing)

Accessing adjacent blocks is significantly cheaper than random access. Database systems want to maximize sequential access as much as possible. Example: If you need 10 megabytes of data stored as ten 1-megabyte blocks:

Random access: If blocks are scattered in different locations, you must issue multiple fetch commands to jump to each location
Sequential access: If blocks are contiguous and aligned together, you can theoretically issue one fetch command to retrieve all ten blocks at once

At the hardware level, consider a spinning hard drive (laptops don’t have these anymore, but they still exist in enterprise environments). There’s a physical arm spinning around a platter like a vinyl record. Moving the arm is expensive—you’re physically moving something through helium. If you move the arm once and read a bunch of data without moving it again (sequential access), that’s much faster than repeatedly picking up and moving the arm (random access). Even with SSDs, due to how they work underneath with ASICs, compaction, and other internal operations, batched sequential reads and writes are more efficient than random access. This is fundamentally different from introductory algorithm classes, which assume all memory access has the same cost. In the real world with real hardware, we cannot make that assumption. The database system will choose algorithms that maximize sequential access even if those algorithms might seem less optimal in a theoretical sense.

Here’s another visualization of the storage hierarchy showing other characteristics:

The performance gap between different storage types is dramatic:

L1 cache miss: 1 nanosecond
DRAM access: 100 nanoseconds
SSD access: 16 microseconds (16,000 nanoseconds)
Spinning disk hard drive: 2 milliseconds (2,000,000 nanoseconds)
EBS (elastic block storage): 50-500 milliseconds (fluctuates depending on data “hotness”)
Tape archives: Glacial speeds

As humans, it’s difficult to reason about nanoseconds. A simple trick to understand these differences: change one nanosecond to one second. The following picture shows these results:

Some hardware devices span different layers of the hierarchy:

Disaggregated storage/memory: Fast network storage that might be byte-addressable but accessed over a physical network, so it straddles the line between memory and disk.
Persistent memory: Hardware that combines the best benefits of both worlds—byte-addressability of DRAM and persistence of SSDs. It would sit in DIMM slots like memory but retain data when power is lost.
The persistent memory story: About ten years ago (when the instructor started at CMU), there was significant research interest in persistent memory. If this technology existed, much of what’s covered in the upcoming lectures—particularly Project 1 about moving data between disk and memory—would become unnecessary. The database system wouldn’t need to worry about moving data back and forth; memory would simply be persistent.
Intel Optane was actual persistent memory using phase change memory technology. HP had memristors, IBM had rumored projects, but Intel was the only company that actually manufactured and sold persistent memory devices.
However, Intel killed Optane last year. When they hired a new CEO, they cut several divisions including Optane. This is unfortunate because it could have been a game changer. The problem: Intel couldn’t make money from it, and now no one will likely attempt this for another decade. Several database companies had projects building systems specifically around persistent memory, which have now been abandoned.

System Design Goals

The database system should:

Create the illusion of operating entirely in memory, even though data actually resides on disk. For most databases under 10 gigabytes, this is achievable. For massive databases in the terabytes or petabytes range, various tricks can hide disk latency and create the appearance that everything is in memory.
Avoid prolonged stalls since reading and writing to disk is so expensive. The system should avoid appearing unresponsive, which:
- Frustrates users who think the system is stuck (when it’s actually fetching from disk)
- Creates cascading problems when holding locks while stalled on disk I/O, slowing down other operations waiting behind
Maximize sequential access since it’s much faster than random access for both reads and writes.

All of this sounds like virtual memory—having the appearance of more memory than actually exists. The next section explains why the OS’s virtual memory system isn’t sufficient for database needs.

Before doing that, let’s describe point by point the high-level diagram about what we want to build:

Database File: We break our Database file on the Disk into Pages;
Buffer Pool: The DBMS allocates this kind of memory and we use it as staging area where we bring pages in from the disk;
Engine Execution: It runs our queries. Suppose we want page #2:
1. Bring in the page directory that’s going to tell us where the pages are on the disk;
2. Then it’ll make a call to the OS and bring that page into memory;
3. The Buffer Pool will give back the Execution Engine a 64-bit pointer in memory of where this page exists;
4. It’s up to the Execution Engine to interpret what’s inside that page;
5. Let’s say once you do a bunch of updates, it makes changes to page#2;
6. Then the DBMS is responsible for writing this back to disk to make sure any changes are persistent.

DBMS vs. OS - The Problem with Memory-Mapped Files (mmap)

(TODO: I didn’t understand this chapter so well)

The operating system provides a system call for virtual memory: memory-mapped files (mmap). This allows taking the contents of a file on disk and mapping it into virtual memory in your process’s address space. The process can then jump to any offset in that memory space, and the OS handles fetching pages from disk as needed.

How mmap Works? Here’s the conceptual flow:

You have a disk file with multiple pages
Call mmap() to open it
The file appears in your virtual memory address space
Virtual memory must be backed by physical memory pages
When your process touches a page not in physical memory:
- Page fault occurs
- OS recognizes the page isn’t in physical memory
- OS fetches the page from disk
- OS updates virtual memory tables to point to the physical page
- Your process can now access it

When physical memory is full: If you want page 2 but memory is full, the OS must:

Decide which page to evict (page 1 or page 3)
Block your process while evicting and fetching
Use its own internal statistics about page access patterns
Make eviction decisions without understanding your database operations

The OS only sees reads and writes—it has no knowledge of SQL queries, table structures, or access patterns. It cannot make intelligent decisions about what to cache or evict.

Critical Problems with mmap for Databases

1. Transaction Safety

The fundamental problem: If a transaction updates multiple pages, those pages must be written to disk in the correct order. The OS doesn’t understand this ordering requirement—it only sees dirty pages.

The failure scenario:

Transaction updates pages A, B, and C
Page B must be written before page C (for consistency)
OS writes page C first, then system crashes
On restart: page C has changes that shouldn’t exist without page B
Database is now in an inconsistent state

You can use mlock() to prevent the OS from swapping pages out, but this doesn’t prevent the OS from writing them out. The OS may write dirty pages in the wrong order, causing corruption after a crash.

2. Prolonged Thread Stalls

When you access data not in memory with mmap:

Major page fault occurs
OS blocks your thread completely
Thread is descheduled
Disk scheduler fetches the page
When page is available, thread is rescheduled

Your thread is blocked doing nothing while there may be other queries that could run. You could build a dispatcher or scheduler with multiple threads where only one accesses mmap, but now you’re building infrastructure to work around mmap’s limitations—essentially reimplementing what the database should do itself.

3. Error Handling Complexity

Hardware errors with mmap don’t return exceptions—they generate SIGBUS interrupts.

This means you need signal handlers throughout your entire system because:

Interrupts can occur anywhere, including critical sections
You may be in the middle of an operation you don’t want interrupted
You need handlers to safely return to whatever you were doing

This requires significant engineering effort to handle safely. In user-space code, you’d normally get an error code you can check. With mmap, you get an interrupt that breaks your control flow.

4. Performance Issues and Contention

The OS maintains its own internal data structures about:

What’s in memory and what’s not
What’s scheduled and what’s not scheduled
Page access statistics for eviction decisions

These data structures must be protected with mutexes or locks. This creates contention points that become performance bottlenecks.

The database system is in a better position because:

SQL is declarative—we know what queries want to do
We have the physical query plan
We know what data operations will access
We can make better scheduling decisions than the OS

Question About Kernel Bypass

Can databases work with hardware directly, bypassing the OS?

Yes, there are kernel bypass methods like:

NVMe for direct storage access
DPDK (Data Plane Development Kit) from Intel for networking

DPDK example: Lets you hook directly into network hardware and get raw packets without using the OS TCP stack. But now you receive raw packets—if it’s a TCP connection, you must handle TCP headers, connection state, retransmission, etc. yourself.

Very few database systems do this. Only two known examples:

Yellowbrick: They essentially rewrote their own OS layer. They only use the OS to boot the system, then never call malloc again. They allocate all memory at startup, wrote their own PCIe drivers, and do amazing engineering. Very few people do this level of work.
ScyllaDB: They tried using DPDK but found it so difficult to handle that it became a huge pain point and they abandoned it.

The answer: For some things you must go through the OS. In the 1980s, people tried extreme approaches like writing their own storage layers on top of block devices. From an engineering perspective, you often have to rely on the OS, but you want to minimize contact with it because the OS is your enemy.

Real-World Evidence: Systems That Used mmap

Full users (entirely relied on mmap):

PoloDB
MonetDB (from CWI, same place that built DuckDB)
LMDB - The developer is the exact opposite of the instructor’s position: he advocates “always use mmap” and has been banned from several database mailing lists for spamming them about using LMDB with mmap

Partial users (used mmap but got rid of it):

MongoDB - Should really be listed as a full user historically
RocksDB
WiredTiger

The MongoDB story is the most illustrative: MongoDB was the hot database in the 2010s, raised enormous amounts of venture capital, and had really talented engineers. They started with an mmap-based storage engine. If mmap was the right choice, MongoDB with its resources and talent could have made it work. Instead, they threw it all away and acquired WiredTiger, a storage manager that doesn’t use mmap, replacing their entire storage engine.

Why Companies Initially Choose mmap

When startups and database companies visit CMU, surprisingly many mention using mmap. When asked why, they say “because it’s quick and easy to use.” A few years later, they return saying “that was a huge mistake, we should not have used mmap.”

You can get something up and running quickly with mmap because you don’t have to build a buffer pool manager—the OS handles moving data between disk and memory. But because you’re relying on the OS, it makes horrible decisions about eviction, scheduling, and data placement.

The Definitive Statement on mmap

Never use mmap for database storage. If the instructor dies, his tombstone should read: “Never use mmap for your database.”

There’s a paper explaining why you don’t want to use mmap, and a 10-minute YouTube cartoon video explaining all the problems.

Database System Design Philosophy

Database systems should minimize reliance on the OS for critical operations:

Don’t use the OS for memory management (as discussed with mmap)
Don’t use the OS for scheduling - The database knows query patterns better
Don’t use the OS for caching rights
Minimize use of OS for networking - Some kernel bypass is possible but often not worth the engineering effort

The OS is always a problem and the database system must be designed to deal with it.

Database Storage: Two Fundamental Problems

How the DBMS represents the database in files on disk - Today’s lecture.
How the DBMS handles data transfer between disk and memory - Upcoming lectures (lecture 6 and beyond)

How the DBMS represents the database in files on disk

There are 3 layers of what data is going to look like:

File Storage (Database files) - What do the actual files look like?
Page Layout (Pages within files) - How are files broken into chunks?
Tuple Layout (Tuples within pages) - How is the actual table data stored?

File Storage

The DBMS stores a database as one file (SQLite, DuckDB) or more files (Postgres, MySQL) on disk typically in proprietary format. The OS doesn’t know anything about the contents of these files; only the DBMS knows how to decipher their contents, since it is encoded in a way specific to the DBMS. We’ll discuss portable file formats (Parquet, Avro, Orc, Arrow) next week. DuckDB has connectors to read SQLite files because they want portability, but this is an exception.

Historical Note: Custom File Systems

In the 1980s, many early database systems decided to create not just custom file formats but custom file systems. They didn’t rely on whatever file system the OS provided—they wanted complete control.

This is a lot of engineering work and nobody does it today with marginal benefit. The only systems still doing this are Enterprise systems (Oracle, DB2, Teradata)—million-dollar database systems trying to extract maximum performance. These systems support custom file systems in addition to running on generic OS file systems.

The Storage Manager

The storage manager (also called the storage engine—same concept) is the part of the database system responsible for:

Maintaining and coordinating different database files
Communicating with hardware or storage devices
Communicating either through the OS or using direct hardware access to retrieve data

Disk scheduling: Some systems maintain their own disk scheduler or dispatcher that decides when to read pages and in what order. Otherwise, making random system calls to the OS lets the OS figure out ordering. But the database system is in a better position to know what it needs and in what order.

Why custom scheduling matters: You want to minimize thrashing (bringing pages into memory then immediately throwing them out). If the DBMS knows that two operators in a query plan (or two concurrent queries) will soon need the same page, it can prefetch it once and keep it in the buffer pool long enough for both to use it. The database system’s knowledge of query patterns enables these optimizations.

Database files are broken into fixed-size blocks called pages. The database system is responsible for:

Keeping track of what data has been read and written to pages
Tracking how much space is available in each page (important for inserts—need to find a page with space)
Maintaining a directory that maps page IDs to physical locations

The storage manager layer does not maintain multiple copies of pages for redundancy or replication. This is assumed to happen both above and below this layer: Above: Something at a higher level in the stack might receive a write query and send it to multiple physical boxes/nodes, having both machines execute the write. Below: RAID arrays or storage appliances that know how to replicate pages handle redundancy at the hardware level.

Typically, database systems don’t maintain multiple copies themselves because it’s significant extra work that ideally you don’t want to do at this layer.

Database Pages

The DBMS organizes the database across one or more files in fixed-size blocks of data called Pages:

It can contain data from any part of the database: tuples/records (today’s focus), indexes, log information, additional metadata, catalog information, statistics.
Most systems don’t mix page types. A one-megabyte page won’t contain data from multiple tables plus index data plus metadata. For simplicity, one page belongs to one database object (a table or index) and contains only data for that particular object.
Some systems require every page to be self-contained, meaning all metadata needed to understand the page’s contents must be included in the page itself. Oracle is the most famous example. Within each page, Oracle stores: what table it belongs to, column definitions, data types, schema information. Why self-containment? If there’s corruption in the database files and a page containing table metadata gets destroyed, you can still understand other pages because they contain their own metadata. You don’t lose access to all data just because one metadata page is corrupted.
- Trade-off: This adds overhead to every single page. Hardware has become more reliable over time (hard drives were super flaky before, they’re still not ideal but much better), so replication can solve corruption problems. Self-containment may be less critical today than when Oracle made this early design choice.
Every page in the database file is given a unique identifier—a page ID (typically a 64-bit or 32-bit integer). There must be some method or mechanism that maps a page ID to a physical location on the storage device. This could be:
- A file name and offset within a directory
- An S3 bucket and offset
- Any storage addressing scheme
The specific mechanism doesn’t matter for conceptual understanding—just know that given page ID 123, there’s a way to determine exactly where to find it.

There are three different concepts of what a “page” means in database systems, which can be confusing.

Hardware Page (typically 4 kilobytes)

It represents the smallest size block that hardware can guarantee atomic writes. What “atomic write” means: If you tell the hardware to write 4 kilobytes and get back an acknowledgment, you can assume the entire 4KB made it to disk. Put in another way:

If the hardware page is 4 KB and the system tries to write 4 KB to the disk, either all 4 KB will be written, or none of it will. If you need to write 8 kilobytes and send it as two 4KB blocks: - You might write the first 4KB successfully - System crashes - The second 4KB never makes it - No guarantee you can write both atomically (all-or-nothing)

This means that if our database page is larger than our hardware page, the DBMS will have to take additional measures to ensure that the data is written safely, since the program may be in the middle of writing a database page to disk when the system crashes.

Operating System Page (typically 4 kilobytes)

The OS’ has its own notion of a page for mapping hardware storage to virtual memory:

In Linux by default: 4 kilobytes
Huge pages in x64: The OS also supports larger page sizes—2 megabytes and 1 gigabyte. These are called “huge pages.”
Important clarification: Hardware still cannot guarantee atomic writes for 2MB or 1GB. Huge pages exist to reduce bookkeeping overhead in the OS for managing pages brought into memory. They don’t change atomicity guarantees.

Database System Page (typically 512 bytes to 64 kilobytes)

The database system has its own notion of a page independent of hardware and OS pages.

Typical range: 512 bytes (SQLite) up to 32-64 kilobytes
Common defaults:
- SQLite: 512 bytes
- PostgreSQL: 8 kilobytes
- SQL Server: 8 kilobytes
- MySQL: 16 kilobytes
- DB2: Up to 64 kilobytes (can be configured per-table)

The page ID combined with page size determines physical location. For a file with a given page size, finding page 103 means:

offset = page_ID × page_size

Jump to that offset in the file to find the data you’re looking for.

How page size affect the performance

Larger pages enable better sequential access optimization (as discussed at the beginning). For example, if you need to read 16KB of data:

If data is organized in 16KB blocks on disk:
- One call to the dispatcher
- One call to the OS
- Fetch contiguous 16KB
If data is organized in 4KB blocks on disk:
- Multiple separate calls
- Potentially reading from random locations
- More overhead

Pre-allocation trick: There are system calls to the OS or device that let you allocate space contiguously. You can say “allocate me a 10-megabyte block,” then the database divides that into 8KB chunks. This ensures sequential layout even with smaller page sizes.

The write cost trade-off: Larger page sizes make writes more expensive. Example: If you only need to write 1 kilobyte but you’re using 16KB pages, you must write all 16KB.

There is no free lunch. This is a recurring theme throughout the semester and computer science in general. Every approach has pros and cons:

Read-heavy workloads with large sequential scans: Larger page sizes perform better.
Write-heavy workloads: Smaller page sizes reduce unnecessary write amplification.

Thread about page size performance

https://dba.stackexchange.com/questions/294587/why-is-the-default-page-size-for-databases-so-small

Question: If you do sixteen 4KB reads for contiguous data, don’t you still get sequential access benefits? And when writing, can’t you write individual 4KB blocks while still having atomic writes?

Answer: Absolutely yes! This is exactly what database systems do. The system will choose algorithms and methods that write data contiguously so these batched fetches are possible.

The key difference: The database system can do this because it knows:

What the query is
What data will potentially be read
Access patterns

The system can read ahead (prefetch) data before you actually need it. The OS can also prefetch, but only for data that’s already contiguous. The OS can’t prefetch scattered data or data it doesn’t know you’ll need. It can’t predict based on query semantics.

Page Storage Architecture

How do we keep track of the mapping from page IDs to physical locations? There’s no single “best” approach—different methods have different trade-offs. The heap file approach is the most common, but other systems do different things with valid reasons.

At this point in the discussion, we don’t need to know what’s inside pages (whether it’s indexes or tuples). We just need to know: for a given page ID, how do I find it? How do I keep track of what pages I have? There are different directory mechanisms:

Heap file Approach.
Tree files - Store page locations in a tree structure (leaf nodes contain actual pages or pointers to them)
Hash tables - Use hashing for page ID lookups
Sequential/sorted files - From the 1970s; MySQL used to do this by default; not common anymore
Hashing files - Another hash-based approach

For log-structured storage (covered next class): The directory still needs to keep track of where things are, and heap files or potentially hash tables can be used. The distinction is that we don’t care what’s inside the pages at this layer—just where to find them.

Heap Files

A heap file is a collection of unordered pages where tuples are stored in random order. The relational model allows this because it doesn’t require data to be in any specific order. Some systems might pre-sort data to make certain operations faster, but the relational model doesn’t mandate ordering.

The storage manager needs to support these operations for heap files:

Create pages - Allocate new pages
Get a page - Retrieve a specific page
Write to a page - Modify page contents
Delete a page - Remove a page
Iterator API - Sequentially read through all pages

The iterator is important because access methods like sequential scan operators need to iterate over every tuple in a table. The page directory must expose an API that lets you iterate through all page IDs sequentially.

Simple Case: Single-File Databases

Managing heap files is easy for single-file databases like DuckDB or SQLite:

offset = page_number × page_size

Since all pages are the same size, you just:

Take the page ID you’re looking for
Multiply by page size
Jump to that offset in the file

Consider the scenario where the page size is 4KB and I want Page#3: the physical location is at offset $3 \cdot 4 = 12$ . Here’s a visual explanation:

Complex Case: Multi-File Databases

Most systems use multiple files (PostgreSQL, MySQL, Oracle). Now you need a way to answer: For page number 2, what file and what offset contains that page? This is what a Heap File Page Directory provides. Think of it as a hash table mapping page IDs to pages in data files. The Page Directory is typically:

A special file
Or stored in the header of the database file
Or at some special location within the database system

Think of it as a database within the database—it’s the database keeping track of what’s in your database. It’s not the full catalog, but it tracks physical locations of data. The directory must be kept in sync with actual files on disk. If you create a bunch of pages but don’t update your page directory, then crash and restart, the directory won’t know about those pages and they become inaccessible. Extra mechanisms (covered in the recovery lecture) ensure these stay synchronized.

Beyond just location mapping, the page directory can track per-page metadata:

Number of free slots - How many tuple slots are available
Free space - How many bytes are free
Page status - Is it in use, deleted, etc.

Graphically:

Why this matters for inserts: When inserting a tuple, you need a page with free space. Without this metadata, you’d have to scan all pages to find one with space. With the directory metadata, you query it to find available pages immediately.

When you run out of pages: The system knows how to allocate new pages and updates the directory accordingly.

Think of it as a hash table written to disk that keeps track of pages you have. But it must also support iteration—being able to scan through sequentially (page 1, page 2, page 3) because sequential scan operators need this capability.

Question: The directory must be in non-volatile memory, so if we crash we don’t lose it. Does that mean any changes to it must be written to disk immediately?

Answer: Yes, absolutely. But it’s not as bad as it sounds—you’re not updating this constantly. Practical approach: When you run out of space in database files, you don’t allocate just one page (that would happen on nearly every query). You’d allocate something like a gigabyte of data at once, update your page directory, write that to disk once, ensure it’s persistent and safe, then proceed with running queries. Any directory update must be written to disk because otherwise you don’t know what pages exist. If you allocate pages but the directory update isn’t persisted, those pages are lost after a crash.

Question: Is the directory stored in a special database file or along with other pages?

Answer: Varies by system. Some systems use a separate file. SQLite stores it in the file header. Typically it’s stored separately.

Question: When you say a system uses multiple files, do you mean one table uses multiple files, or the database uses multiple files for different tables?

Answer: It depends on the system. Different systems do different things. This is the beauty of SQL—you don’t know or care in your SQL queries whether you have one file or a thousand files. The database system decides how to organize storage and just knows how to run your query.

DuckDB and SQLite: One file for the entire database (all tables)
PostgreSQL: If you look in the data directory, there are many files with numeric names. Various files are for different indexes, different tables, etc.

Question: How do you know whether you want to use multiple files for one table?

Answer: It depends on your use case:

Example with large columns: Say you have a table with a BLOB field or TEXT field containing 10 megabytes of data. You might want to:
- Store that in separate pages (maybe compressed)
- Keep regular columns (integers, floats) in one file
- Store large variable-length data in another file
Various approaches: You could have some space in a single file where the top contains fixed-length data and the bottom contains variable-length data.
Historical MySQL example: Up to version 5.6, MySQL stored (if I recall correctly) one file for all tables or all databases. They changed to one file per table with separate files for indexes.

Page Layout

Now that we understand how files are laid out and how we track where pages exist, let’s discuss what’s actually inside the pages themselves.

Every page has a header containing metadata about the page contents. Common header information:

Page size
Checksum - Computed whenever fetching from disk to verify data isn’t corrupted
Database version - Which version of the database system created this page (for backward compatibility)
Transaction visibility (covered after midterm) - What thread/transaction wrote data, whether it’s visible to current queries
Compression/encoding metadata (covered next week) - How data is compressed or encoded
Schema information - What table, what columns, data types (Oracle’s self-contained approach)
Statistical information - Min/max values for columns (useful for predicate evaluation without reading all data)

For any page storage architecture, we need to decide how to organize the data inside of the page (the tuple data). Assumptions for today’s lecture:

We’re only storing tuples on pages (not indexes)
We’re storing tuples in a row-oriented manner (if you have five attributes, they’re stored contiguously before the next tuple)

Next week we’ll cover column stores, which organize data differently. But for now: row-oriented, tuple-only pages.

There are three different approaches to what could actually be in our pages:

Tuple-oriented storage - Store tuples with their exact values (today’s focus)
Log-structured approach - Store deltas of what changed since last update (next week)
Index-organized storage - Tree structure where leaf nodes contain actual data (next week)

Tuple-oriented Storage

Let’s think about how to store tuples on pages.

Strawman Approach

Here’s a simple strawman approach:

Page header tracks the number of tuples
Tuples are appended to the end
Data is fixed-length

To insert a new tuple:

Look at header to get number of tuples
Multiply by tuple size
That offset tells you where to write

offset = tuples_number * tuple_size

Consider this scenario:

If you want to add a new tuple, look at the Num Tuples, which is 3, then multiply it by the tuple_size, then you’ll get the offset where the fourth tuple will be stored. This approach is a bad idea.

Problem 1: Deletion creates gaps

If you delete tuple 2, now there’s a gap. You want to insert a new tuple and reuse that spot, not just append to the end. But tracking just “number of tuples” isn’t enough—you need to track where free space exists.

Problem 2: Variable-length data

What happens if we have a variable-length attribute? Most data isn’t fixed-length:

Email addresses vary in size
Names vary in size
Android IDs aren’t always the same size

You could use CHAR type (pre-allocated space), but what does that do? If the largest email is 1 kilobyte, you must allocate 1KB for every email address, even if most are only 20 bytes. Massive space waste.

Problem 3: Can’t move tuples without system-wide updates

Say you delete tuple 2, creating a gap. Tuple 4 can’t fit in that gap, so you want to move tuple 3 up to reclaim space.

The problem: Now you must tell the rest of the system that you moved tuple 3. We haven’t discussed how things point to tuple 3 yet, but assume it’s via some offset within the page. If you move tuple 3, you must update every index that points to it—extremely expensive.

This approach clearly doesn’t work. We need additional metadata.

Slotted Pages: The Standard Solution

The most common approach for row-oriented database systems using tuple-oriented pages (not log-structured) is Slotted Pages. What’s described here isn’t exactly how every system implements it, but at a high level, everyone’s doing something like this.

From top to bottom of the page:

Page header (at the top) - Contains all the metadata discussed earlier
Slot array (grows from beginning toward end) - Each slot stores a fixed-length offset pointing to a tuple’s starting location
Free space (in the middle) - Unused space between slot array and tuple data
Tuple data (grows from end toward beginning) - The actual fixed-length and variable-length tuple data

Here’s a graphical representation:

Important assumption for now: Everything is stored together. If you have a really large value, it’s not stored in a separate page—the entire tuple is stored inside this page. (We’ll relax this assumption next week.)

The slot array stores fixed-length offsets to tuple starting locations. You might also store tuple size in the slot entry if desired.

As you update the table and add new tuples:

Slot array grows from beginning → end
Tuple data grows from end → beginning
When you add a new tuple, update the corresponding slot to point to its location

The page is considered full when the slot array and the tuple data meet

Here’s a graphical representation:

Remember the earlier problem: we deleted tuple 3 and wanted to reclaim space. With slotted pages, this is fine:

Delete tuple 3
Slot 3 now points to nothing (or marks the space as free)
Slot 4 still points to tuple 4’s location
No other part of the system needs to know anything moved

How to deal with the fragmentation due do the empty space left by Tuple#3? It depends on the implementation. Some options are:

you could just leave the tuples where they were:
you can slide the old tuples over in order to compact them and all you have to do is to update the slot array with the new physical offset of the tuples you moved:

Some systems use the first approach, some others the second one.

This is the key advantage: The slot array provides a level of indirection. You can reorganize physical tuple locations within the page by only updating the slot array, not every index and reference throughout the database system.

Q&A

Question: Wouldn’t this waste space if tuples are variable length? The tuples grow from end → beginning, and the slot array grows from beginning → end. At some point they’ll meet and you’re out of space. There might be small unusable gaps in the middle.

Answer: Yes, there’s potential waste—the free space in the middle between slot array and tuple data might have small gaps that can’t be used. But this is acceptable because the advantage gained (not having to update the entire system when reorganizing pages) far outweighs the small amount of wasted space.

Systems using slotted pages (or variations):

PostgreSQL - Uses this approach
SQL Server - Uses this approach
Oracle - Uses this approach
Most major database systems - Use some variant of this

SQLite - Actually, it’s unclear if SQLite uses exactly this approach, but the principle is similar. This is the de facto standard for page organization in row-oriented database systems.

Record IDs: Identifying Tuples

Now that we’re using slotted pages, we need a way to uniquely identify tuples. The DBMS assigns each logical tuple a unique record identifier that represents its physical location in the database. A Record ID (also called Row ID or Row Number) is a way to uniquely identify a logical tuple based on its physical location inside a file and page. Typical composition:

File number or ID (which file)
Page number (which page within that file)
Slot number (which slot in the page’s slot array)

When you need a specific tuple, DBMS follows these steps: Lookup process with a Record ID:

Look in the page directory to find which page contains it
Fetch that page into memory
Use the slot number to index into the slot array
The slot array entry gives you the offset within the page
Jump to that offset to read the tuple data

Most databases do not store the Record ID as actual data—it’s synthesized/materialized based on the page directory and physical location. Exception: SQLite stores the Record ID as a separate column (you’re not supposed to see it, but you can access it). SQLite uses this as the primary key to identify tuples. SQLite’s approach: For secondary indexes (indexes that aren’t the primary key), when you look up a key, the value returned is that Row ID. You then use that Row ID in the primary key index to find the actual data.

Different systems do different things—no universal approach.

Record ID Sizes:

PostgreSQL: 6 bytes (ctid)
SQLite: 8 bytes (64 bits)
SQL Server: 8 bytes
Oracle: 10 bytes

You can see these directly in the database system through system columns or functions.

Warning About Using Record IDs

Applications should not rely on Record IDs being stable. Even though a Record ID uniquely identifies a tuple at a moment in time, it could change.

Why Record IDs can change:

Compaction or garbage collection
Page reorganization
Vacuum operations (in PostgreSQL)

Example failure scenario:

Insert a tuple, get Record ID (file=1, page=5, slot=3)
Application stores that Record ID
Database runs compaction/vacuum
Tuple moves to (file=1, page=5, slot=2)
Application’s stored Record ID is now wrong

Record IDs are a physical aspect of the database system not meant for application use. They’re exposed for database administration and maintenance, but should not be stored in application code as tuple references.

Live Demonstrations: Viewing Record IDs

Let me show you how different systems expose Record IDs through SQL.

PostgreSQL Example

This record ID is a tuple that gives us the page number and the slot number (page_number, slot_number). Specifically, this example shows page 0, slots 1, 2, and 3.

Delete the second tuple:

DELETE FROM r WHERE id = 101;

and then show again r:

PostgreSQL deleted the tuple but didn’t move anything. The data stays where it physically lived, but slot 2 is now marked as free.

Insert a new tuple:

INSERT INTO r VALUES (103, "xxx")

The result is:

PostgreSQL didn’t reuse slot 2—it appended to slot 4.

Run garbage collection (vacuum): Now let’s run what in Postgres is called Vacuum, a process that Postgres uses to compact every single page and write out new pages:

VACUUM FULL r;

Then let’s see the table again:

Now PostgreSQL compacted the page, moving tuples to fill gaps and creating a new version of the page.

You can even query specific ctid values (though you’re not supposed to):

SELECT * FROM r WHERE ctid = '(0,1)';

This works, showing the tuple at page 0, slot 1.

SQLite Example

SQLite has a rowid column:

CREATE TABLE r (id INT);
INSERT INTO r VALUES (100), (101), (102);
SELECT rowid, * FROM r;

Results:

rowid | id
------+-----
1     | 100
2     | 101
3     | 102

The rowid is a 64-bit integer actually stored as part of the tuple data. SQLite uses this as the primary key.

Delete a tuple:

DELETE FROM r WHERE id = 101;
SELECT rowid, * FROM r;

Results:

rowid | id
------+-----
1     | 100
3     | 102

SQLite doesn’t reuse row IDs because it’s a physical identifier, essentially the primary key.

SQL Server Example

SQL Server uses a special syntax with percent signs:

SELECT , * FROM r;

This returns hexadecimal data representing the physical location.

To decode it, there’s an undocumented function:

SELECT sys.fn_PhysLocFormatter() AS [Physical Location], *
FROM r;

Results:

Physical Location | id
------------------+-----
(1:100:0)         | 100
(1:100:1)         | 101
(1:100:2)         | 102

This shows (file_number:page_number:slot_number).

After deleting tuple 101 and inserting 103:

DELETE FROM r WHERE id = 101;
INSERT INTO r VALUES (103);
SELECT sys.fn_PhysLocFormatter(), * FROM r;

Results:

Physical Location | id
------------------+-----
(1:100:0)         | 100
(1:100:1)         | 102  -- Moved from slot 2 to slot 1!
(1:100:2)         | 103

SQL Server automatically compacted the page—it moved tuple 102 from slot 2 to slot 1 to fill the gap, then put the new tuple in slot 2.

Is this wrong? No. Is it better? Who knows—it’s a design choice.

Why SQL Server can do this: When you fetch a page and insert a tuple, you’re holding a latch/lock on that page. The database system knows no other thread can write to that page simultaneously, so it can decide whether to compact or optimize as it writes. PostgreSQL doesn’t compact on insert; SQL Server does. Both are valid approaches.

Oracle Example

Oracle has a rowid:

SELECT rowid, * FROM r;

This returns binary data.

To decode (from Stack Overflow, not official documentation):

SELECT 
    DBMS_ROWID.ROWID_OBJECT(rowid) AS object_id,
    DBMS_ROWID.ROWID_RELATIVE_FNO(rowid) AS file_number,
    DBMS_ROWID.ROWID_BLOCK_NUMBER(rowid) AS block_number,
    DBMS_ROWID.ROWID_ROW_NUMBER(rowid) AS row_slot
FROM r;

Oracle stores:

Object ID (table identifier)
File number
Block number (page number)
Row slot (slot in the array)

This is more information than other systems—Oracle includes the object ID as part of the physical locator.

Tuple Layout

We’re running short on time, so this will segue into next class, but here’s the high-level concept:

A tuple is just a sequence of bytes. There’s a header followed by byte data:

The database system must know how to interpret those bytes based on:

The schema (column types)
The values you’re querying

Tuple header contains:

Null bitmap (which columns have NULL values)
Transaction visibility information
Other metadata

The execution engine knows how to jump to different offsets within the tuple based on the schema to extract column values.

Quartz 4

Explorer

03. Database Storage Part 1

0. Resources

Course Outline

Disk-based Architecture

*. In-Depth Comparison of Byte and Word Addressing:

Interesting Memory Playlist, youtube channel (to do)

Disk-based Architecture (continuing)

System Design Goals

DBMS vs. OS - The Problem with Memory-Mapped Files (mmap)

Critical Problems with mmap for Databases

1. Transaction Safety

2. Prolonged Thread Stalls

3. Error Handling Complexity

4. Performance Issues and Contention

Question About Kernel Bypass

Real-World Evidence: Systems That Used mmap

Why Companies Initially Choose mmap

The Definitive Statement on mmap

Database System Design Philosophy

Database Storage: Two Fundamental Problems

How the DBMS represents the database in files on disk

File Storage

Historical Note: Custom File Systems

The Storage Manager

Database Pages

Hardware Page (typically 4 kilobytes)

Operating System Page (typically 4 kilobytes)

Database System Page (typically 512 bytes to 64 kilobytes)

How page size affect the performance

Page Storage Architecture

Heap Files

Simple Case: Single-File Databases

Complex Case: Multi-File Databases

Page Layout

Tuple-oriented Storage

Strawman Approach

Problem 1: Deletion creates gaps

Problem 2: Variable-length data

Problem 3: Can’t move tuples without system-wide updates

Slotted Pages: The Standard Solution

Record IDs: Identifying Tuples

Warning About Using Record IDs

Live Demonstrations: Viewing Record IDs

PostgreSQL Example

SQLite Example

SQL Server Example

Oracle Example

Tuple Layout

Graph View

Table of Contents