0. Resources

F2023 #04 - Database Storage Part 2 (CMU Intro to Database Systems)

Review: Tuple-Oriented Storage Architecture

Last class established the framework for building a disk-oriented database management system. The key premise: the primary storage location when data is at rest is non-volatile disk (SSD, spinning hard drive, cloud storage). All system components orchestrate moving data between disk and memory, following the Von Neumann architecture—you cannot operate on data while it’s at rest on disk.

Sequential I/O optimization is critical because disk access is slow. Various techniques hide disk access latency: maximizing sequential I/O (discussed today), using filters and indexes to reduce data access, and implementing smart caching strategies.

The slotted page architecture discussed last class allows storing tuples of arbitrary, variable-length sizes across heap files. The system expands tuple size as needed within page constraints.

This is what we’ll call a tuple-oriented storage scheme—the fundamental concept is “I have a tuple, I need to put it somewhere.” Page layouts are designed around storing individual tuples.

Operations in Tuple-Oriented Storage:

Insert operation:
1. Look in the page directory to find a page with a free slot
2. If the page isn’t in memory, fetch it from disk (covered next week)
3. Look in the slot array for the next free slot
4. Update the slot array
5. Store the tuple in the page
Update operation:
1. Get the Record ID of the tuple (page ID + slot number)
2. Use the page directory to find the page location
3. Fetch the page if not in memory
4. Find the offset using the slot array
5. If the updated tuple is the same size, overwrite it in place
6. If not, find another page that can accommodate it

This is the core idea of heap files with page-oriented architecture—essentially how most systems work at a fundamental level.

Problems with Tuple-Oriented Storage

1. Fragmentation

Efficiency for reads: Potentially okay—if you need an entire tuple, you go to one page and get it.

The problem with writes: As you perform inserts, updates, and deletes, you end up with fragmentation—pages that aren’t fully utilized. You may have small amounts of empty space insufficient for new tuples, just sitting there wasted.

Example: Insert one tuple into an empty table. You allocate a page and insert that single tuple. Depending on page size (which varies by system), there’s a bunch of empty space that’s simply not being used.

2. Useless Disk I/O

The core problem: If you need to update one tuple and it’s not in memory, you must fetch the entire page from disk. But what else are you getting?

Database systems don’t store one tuple per page (you could, but you typically don’t). If you fetch a page to update one tuple, you’re bringing in potentially 20 other tuples you don’t need.

The write amplification: When you update one tuple but had to bring 20 tuples into memory, you now must write all 20 tuples back out to disk, even though only one changed.

3. Random Disk I/O

The “it depends” caveat: As always in databases, the answer to “which approach is better?” depends on your workload.

Single tuple updates: If your workload only updates a single tuple at a time per query, this architecture isn’t terrible.

Multiple tuple updates: If you’re updating 20 tuples at once and those 20 tuples are scattered across 20 separate pages:

You must read 20 separate pages from disk into memory
You must update them all
You must write out 20 different pages from memory to disk
This is random I/O and it’s significantly slower than sequential I/O

4. In-Place Update Limitations

Not necessarily a problem with the architecture itself, but some environments cannot do in-place updates—you cannot fetch a page from disk, bring it into memory, update it, and write it back to the original location.

Examples where in-place updates are impossible:

S3 - You can work around this with versioning, but fundamentally can’t do in-place updates
Cloud database systems - Many don’t support in-place updates
HDFS (Hadoop File System) - It’s an append-only file system; you cannot modify pages and write them back where you got them

This means the slotted page architecture wouldn’t work in these environments because it fundamentally relies on the ability to modify a page and write it back to its original location.

Log-Structured Storage: An Alternative Approach

Remember that there are three different approaches to how I could actually organize the data inside of the pages:

Tuple-oriented Storage (Previous lesson), where we’re only storing tuples and the exact values that those tuples have.
Log-structured Storage (Today), where we store deltas of what changes since the last time time tuple was updated.
Index-organized Storage (Today), where we use a tree structure where in the leaf node we store the data itself.

All the problems just discussed are solved with log-structured storage. Beyond heap files with slotted pages, log-structured storage is probably the second most common approach in database systems. It’s even more common today because of embedded storage managers like RocksDB, which is log-structured. Any database using RocksDB is inherently log-structured.

Log-structured storage is an old idea, loosely related to log-structured file systems from the 1980s. The log-structured storage approach for databases was first proposed in the mid-1990s.

Terminology note: The textbook calls these “log-structured merge trees” (LSM trees). The discussion here describes the same fundamental idea without the tree details, because those make things more complicated than necessary for understanding the core concept.

Core Concept: Log Records of Changes

Instead of storing individual tuples, we maintain a log record of changes to those tuples. Think of this as a key-value store system.

Two operations only:

PUT - Insert or update (blind write, overwrites whatever was there)
DELETE - Remove a record

No separate INSERT because that’s just a PUT. No separate UPDATE because that’s also just a PUT doing a blind write over whatever existed before.

Each log record contains:

Operation (PUT or DELETE)
Key (tuple identifier—not the Record ID from before since we don’t have pages and slots)
Payload (for PUT: the actual tuple data; for DELETE: just the key)

How it works: As the application inserts data and makes changes, we append new log entries to an in-memory buffer in the order they arrive. At some point, the buffer gets full and we write it out to disk sequentially. When the in-memory buffer gets full, we write it out to disk by literally taking the entire contents of the memory page and writing it to disk pages. Then we clear the memory buffer and start filling it with new log entries.

This is different than the tuple-oriented architecture, where we have to fetch the page that has the original tuple and then I can update.

Simple example flow:

Application: PUT record 103, value = “1, A1”
Log entry appended to memory buffer
Application: PUT record 104, value = “2, B2”
Log entry appended to memory buffer
Application: DELETE record 102
Log entry appended to memory buffer

The buffer is organized oldest to newest: Beginning of the buffer contains oldest entries; we keep appending new entries to the end.

Critical Difference from Tuple-Oriented Storage

You don’t need to read the original record to update it. At this lowest level of the system, you don’t need to know what the original tuple’s value was.

Important clarification: Obviously, if you’re executing a SQL query like UPDATE table SET value = value + 1, you need to know the original value—that’s essentially a read followed by a write at the query execution layer. But at this lowest storage layer, we don’t need to fetch the page containing the original tuple before updating it. We just append a new log entry.

This is fundamentally different from tuple-oriented architecture, where you must fetch the page containing the original tuple to update it.

Two Critical Properties

Sequential I/O

The in-memory page might be 1 megabyte or 10 megabytes. When full, we write out those 10 megabytes sequentially to the file on disk. No random access.

Contrast with tuple-oriented storage: You might have 20 tuples scattered across 20 different pages, requiring 20 separate random I/O operations. In log-structured storage, those 20 tuples are always on the same page when written out because they’re sequential log records.

Immutability

Once a page is written to disk, it’s immutable—we never go back and do in-place updates. We’ll compact pages (essentially garbage collection, as we’ll see in a while), but we never overwrite a log record that was already written to disk.

We’re not discussing distributed systems yet, but there’s an advantage to making files immutable. If you can’t do in-place updates on cloud storage, this becomes essential. But even beyond that practical concern, it makes the architecture easier. Appending to a log is essentially what Paxos or Raft consensus algorithms do—adding log records and never going back to make changes. Just add new log entries. This makes the architecture much simpler once data is on disk.

For now, we’re ignoring what happens if we need to write the memory buffer out before it’s completely full—for instance, when a running query or transaction wants to ensure changes are written to disk before telling the outside world data is safely persisted. In that case, you might write the log buffer to a separate location (like local disk where you can do these kinds of writes), but we’ll ignore this complexity for now.

Performance Characteristics

Writes are really fast in log-structured architecture—much faster than tuple-oriented storage because we’re just appending log records and writing them out sequentially.

Reads are potentially slower. As always in computer science and database systems: there’s no free lunch. We’re making writes faster but reads potentially slower.

Reading from Log-Structured Storage

To perform a read (assuming something in the system has figured out the key/ID of the log record we want—102, 103, 104, etc.):

Check the in-memory page first
Start at the end (newest records)
Scan sequentially in reverse order back to the beginning
Stop when you find the log entry you want
If not in memory, go to disk (discussed next)

Is this efficient? No. Scanning backwards through the entire in-memory buffer for every read would be terrible.

The Index Solution

This is where the log-structured merge tree in the textbook comes in, but again, we don’t need to worry about tree details.

The system maintains an index that, for every record ID, tells you:

Where in the in-memory buffer page it’s located, OR
If it’s not in memory, where it is on disk

To find record 104: Just do a lookup in this index. We’re not specifying what data structure it is—it typically will be a B+ tree, but some systems use tries, some use skip lists, but the data structure doesn’t matter. The index tells you the offset in the memory buffer where the data you’re looking for resides.

If you want record 103 and it’s not in memory: The index tells you which disk page to fetch.

It’s like the glossary page at the end of a textbook: you look at the keyword you want, it tells the page number, and you can jump to that page without scanning through the page one after another.

Q&A

Question: Is it possible to implement the index in an append-only file system?

Answer: Yes. You treat the index itself as a log and in memory you build a data structure on top of it. In a B+ tree, when you write pages out to disk, you maintain the data structure (pointers between children and parents). In this environment, you would reconstruct the in-memory index by replaying the log. You can absolutely do this in a read-only or append-only file system. (It’s unclear exactly what RocksDB does specifically.)

Question: Is this index the same as what you get when you run CREATE INDEX in SQL?

Answer: Not exactly. You’re asking if this is the same as a primary key index. Potentially yes, but not always. SQLite is complicated—they have indexed-organized pages and non-table indexes versus primary key table indexes. But think of this index as internal bookkeeping to find records—almost like the page directory. It’s not something you would necessarily expose to SQL queries directly, although you could use it for that purpose.

Question: It seems like we’re only looking at the index that points into the memory page. Why don’t we just maintain pointers to the latest update rather than storing everything that’s happened?

Clarification: The index doesn’t only point to things in memory—it can point to disk as well. And crucially, it only points to the latest version of each key.

Why store everything? Let’s say we have record 102 on disk from an earlier PUT. Then we DELETE 102. If we don’t store that DELETE log record and someone looks for 102, they’d find the old PUT on disk and think it still exists. We can’t just go back and find the page where 102 is stored and pull it out—we can’t do in-place modifications. So we append a DELETE log record. If you go back in time and see 102, you’ll also see it’s been deleted. Then we’ll compact them later to remove the redundant entries.

Question: Why do we need to store the latest log entry for DELETE if absence in the index means it doesn’t exist?

Answer: There might be a PUT for 102 somewhere earlier in the log that got written to disk. If we don’t have the DELETE record, then 102 does exist (on disk from that earlier PUT). We can’t go back and find the disk page where 102 is stored and remove it—we can’t do that kind of in-place modification. So we just append a log record that says “if you see 102, know that it’s been deleted.” Then we’ll compact them later to clean up the extra entries of deleted records.

Compaction: Garbage Collection for Logs

Some log records we don’t need to maintain forever—DELETE records and multiple PUTs over the same key are examples. Log-structured database systems periodically run a background job that compacts pages to coalesce them and reduce redundant operations.

Consider two disk pages (oldest to newest). Compaction identifies the latest entries for each key:

103: Use the PUT from Page 1
104: Use the PUT from Page 2
105: Use the PUT from Page 2 (newer), discard from older entries in Page 2 and the ones in Page 1
101, 102: Deleted

Compacted result contains only:

PUT 103 (from Page 1)

PUT 104 (from Page 2)
PUT 105 (from Page 2)

Here’s a visual representation of compaction:

As mentioned, at some point we may not need to store DELETEs anymore. There’s an upper part of the system that has removed records 101 and 102 from the index. If anybody does a lookup, they’ll get “key not found,” so we don’t need the DELETE log entry anymore. That can be removed during compaction.

This is called Compaction—essentially garbage collection for the log. No free lunch: Log-structured storage makes inserts much faster by just appending to the log, but at some point we have to go clean things up.

The idea is: we compact pages into a compressed form of the log records. This only happens on disk—we can’t do in-place updates, so we’re literally taking one disk page and another disk page, compacting them, and writing out a new page. We can’t overwrite existing pages.

Sorted String Tables (SSTables)

Another important optimization: once a page is on disk and compacted, we know it’s going to be older than anything in memory. After compaction (removing redundant operations for the same key), each key is only referenced once in that disk page. At this point, we don’t care about temporal ordering anymore in the log—we don’t care about newest to oldest for disk pages.

If the operation we need to support is “find me key 103, 104, or 105” in disk pages, temporal ordering doesn’t help us. What we actually want is to sort the disk pages by key values.

Here’s a representation of before/after sorting by key:

Benefits of sorting:

When you fetch a disk page, you know relative ordering of pages (this page is older than that page)
Within each log record, you don’t need to know whether one is older than another
You can build an index or filter to quickly jump to the record you’re looking for rather than having to do binary search across the entire file.

When you do this compaction and sort by key values, these are called Sorted String Tables or SSTables. This term was coined by Jeff Dean and Sanjay Ghemawat when they wrote LevelDB at Google for Bigtable in the mid-2000s.

Each SSTable file (comprised of multiple pages when they get large) will have metadata in the header that keeps track of where offsets are for different keys. This allows quick lookup without scanning the entire file.

Q&A

Question: Wouldn’t the index already point to the corresponding location in the SSTable?

Answer: Not necessarily at the exact offset. You may want to keep a more coarse-grained index that says “here’s the file that has it” rather than the exact offset. Once you get to that file, it will tell you where to find it within the file.

Important clarification: When I say “disk page,” this could be multiple pages for an SSTable file, typically because these things get large (megabytes, gigabytes). So it’s not going to be a single page.

Question: Does the index for SSTables also get written to disk, and does the in-memory index get written to disk?

Answer: Yes, because you don’t want to recreate it upon restart. As mentioned before, either you could write the index pages themselves to disk, or you could maintain a log record that says “here’s how to rebuild the index” by replaying operations.

Question: Why don’t we write the pages in sorted order in the first place? If it’s in memory we could sort before writing.

Answer: That’s exactly what they do! Yes, the in-memory buffer is sorted before being written out.

Question: Wouldn’t compaction have an impact on read performance because you might have to take locks?

Answer: Absolutely yes. Compaction affects read performance. You’re not just taking locks—you’re doing disk I/O. You’re potentially bringing gigabytes of files into memory, compacting them, and writing them back out. There are different types of compaction (discussed next) that try to minimize this impact, but yes, no free lunch.

Log-Structured Compaction Strategies

There are two main ways to do compaction. The terminology here is what RocksDB uses.

1. Universal Compaction

The simplest form: Take adjacent sorted log files on disk (remember, these are multiple pages—think megabytes, gigabytes, or even terabytes) and compact them together.

Process:

Take two adjacent sorted log files
Do a sort-merge (they’re already sorted, so just merge)
Determine which entries are subsumed by others
If you see PUT for key 103 in both files, and one file is older, keep the newer one and discard the older one

You can keep compacting any possible combination of sorted log files into more compact forms, progressively reducing redundancy.

2. Level Compaction

This is where the “level” in LevelDB comes from.

RocksDB is Facebook’s fork of LevelDB. Google wrote LevelDB. The very first thing Facebook did after forking: remove mmap. Then they expanded it and added many other features.

Level compaction structure:

Level 0: Sorted files of a certain size
Keep adding more sorted files at Level 0
At some point, run compaction to combine them into a larger file at Level 1
Keep making more files at the top level
When you have enough at Level 1, run compaction to produce something at Level 2
Cascading down: You get larger and larger files as you go down levels

This creates a hierarchical structure where files get progressively larger and more compacted at lower levels.

RocksDB and Industry Adoption

RocksDB has become the default choice for many database vendors and people building data systems as the underlying storage manager. These systems are essentially log-structured, but they build on top of RocksDB:

SQL parsing layer
SQL execution engine
Indexes
All additional functionality discussed throughout the semester

What RocksDB provides: A key-value API. In the examples shown, we just have “here’s the value, here’s the payload I’m storing in the log”—RocksDB has no notion of attributes or columns.

Implication: Even if your table has 10 columns and you only update one of them, your PUT record must contain all 10 columns. (We’ll see with multi-version concurrency control later in the semester how we can be smarter about this, which essentially looks a lot like log-structured storage. This is actually how Postgres was originally envisioned in the 1980s.)

RocksDB-based systems:

Many modern database systems
CockroachDB (originally used RocksDB, then threw it away and wrote Pebble in Go)
TiDB uses TiKV
Dgraph uses BadgerDB

Custom log-structured implementations:

Cassandra - Has its own log-structured storage
LevelDB - The original from Google
Many others

This is just a sampling—log-structured storage is extremely popular, especially for embedded storage engines.

Problems with Log-Structured Storage Revisited

We’ve already established:

Reads are slower than tuple-oriented storage
Compaction is expensive

What’s another core issue?

The Third Core Issue: Write Amplification

What are we doing during compaction? Earlier we had log records in memory, wrote them out to disk. Now for compaction, we’re reading them back into memory and writing them back out to disk again.

The concept of Write Amplification: For every logical write in your application (insert a tuple, update a single tuple), how many times do we read and write it back to disk? In log-structured storage: Potentially infinite. If you just keep compacting over and over again, for a single logical write you could do dozens of physical writes because you’re bringing data back into memory and writing it out again repeatedly.

In page-oriented architecture with slotted pages, we don’t have this problem. When you update a single tuple:

Bring the page into memory
Update it
Write it back out
If you never update it again, you never write it out again

(We’re ignoring backups and the write-ahead log, which will be covered later in the semester. But if you’re not reading it or using it, you’re not bringing it into memory and writing it back out.)

In log-structured storage, you must keep rewriting data even if it’s not being actively updated because compaction processes keep moving it around.

This is a fundamental tradeoff—the cost of getting fast sequential writes and avoiding the fragmentation problems of tuple-oriented storage.

Index-Organized Storage: A Third Approach

We’ve now discussed two approaches:

Tuple-oriented storage (slotted pages)
Log-structured storage

In both approaches, we rely on indexes to find individual tuples, separate from the core storage of the tables themselves:

In tuple-oriented storage: Pages are unordered. To get a Record ID (page number + slot number), some magical data structure or index provides that information.
In log-structured storage: We need an index that tells us, for a given record ID, where to find the data.

What if we just kept tuples automatically sorted by putting them inside the index itself? Then you don’t have a separate distinction between “here’s the log-structured storage and here’s the index” or “here’s the slotted pages and here’s the index.” It’s all just indexes.

How Index-Organized Storage Works: Assuming we have a tree data structure (could be a hash table, but we’ll assume trees), instead of having leaf nodes with values that provide the Record ID telling you where to find the page, the leaf nodes themselves are the data pages with the tuples.

When you want to do a lookup for key 102:

Follow the index tree
Arrive at the leaf node
The leaf node IS the data page containing the tuple you’re looking for

No separate lookup step to go from “index tells me page 5, slot 3” to actually fetching that page—the index leaf node IS the page.

This looks like a rough diagram of a B+ tree (covered in more detail soon):

Inner nodes: Guide posts that tell you, for a given key, should you go left or right
Leaf nodes: Look like slotted pages, but sorted by key (not random location based on where free space was)

Here’s a visual representation:

When you want key 102:

Traverse the index tree to a leaf node
Arrive at that leaf node (which is a page)
Do binary search on the list of keys in that page
Get an offset to find where the data is

Systems Using Index-Organized Storage

MySQL with InnoDB engine: This is what you get by default—index-organized storage.

SQLite: Also uses index-organized storage. SQLite has an internal primary key called the rowid (visible through SQL but different from the primary key you define in your table). This rowid is the key used for lookups in the index-organized structure. For your logical primary key (like a student email address), there’s a separate index that maps email address → rowid, then you do a lookup in the primary key index using that rowid to get the actual tuple.

SQL Server and Oracle: You can get index-organized storage, but not by default—you have to explicitly request it. In SQLite, you get this by default and cannot turn it off.

Q&A

Question: There are two keys—one for the page and one inside the page?

Answer: No. If it’s SQLite and you’re looking for rowid = 1, you traverse the index (the entire tree is sorted by rowid), land in a page, then find within that page where rowid = 1 is located using binary search on the sorted keys in that page. The entire tree is sorted—it has to be because it’s a balanced tree.

Question: Does this approach still suffer from downsides like fragmentation and random I/O?

Answer: Great question. Yes, it does suffer from some of these issues. Fragmentation: Unavoidable because in a B+ tree, nodes need to be at least half full. You’re going to have leaf nodes with empty space—that’s unavoidable in the tree structure. Random I/O: If you’re doing updates to random locations in the leaf nodes, yes, random I/O is unavoidable. But there’s an important exception: If you’re just inserting (not updating), using SQLite’s rowid as an example—the rowid is just an internal counter that increments for every new tuple (1, 2, 3, 4, 5, 6, 7—monotonically increasing). If you just keep inserting to SQLite, you’re just appending to one side of the tree and not touching the other side.

Performance characteristics:

Not as good as sequential I/O in log-structured storage
Better than completely random I/O in tuple-oriented storage
The tree guides you to only update pages on one side for sequential inserts

There is some benefit to it for certain workload patterns.

Data Representation

Now that we understand file organization and page organization, let’s discuss what’s actually inside a tuple.

A tuple is just a sequence of bytes. It’s the database system’s job, based on the schema stored in its catalog (defined in CREATE TABLE with attributes and types), to interpret what those bytes are and how to perform operations on them.

Example: If you have column_a + column_b where column A is a 32-bit integer and column B is a 64-bit integer, the database system knows it needs to do the addition operator based on those two types.

Think of a basic Tuple Structure as a byte buffer or char array:

Tuple header - Contains metadata (size, nulls, transaction information)
First column - id column (e.g., 32 bits for an integer)
Second column - Value column (e.g., 64 bits for a bigint)
Additional columns follow

How the system interprets this:

Gets the starting location of the tuple (via slot array or however we access it)
Header is always the same size for every tuple—jump past it
Simple arithmetic: “Offset of first column is X bits/bytes after header”
For second column: “Offset is Y bits/bytes after header”

For VARCHAR, it’s more complicated—you have to store the length of the field (could be in the header or inline), but the principle is the same.

Essentially, you’re taking an address and doing reinterpret_cast to tell the system to treat that address as a 32-bit integer, 64-bit integer, or whatever the type is.

Q&A

Question: Is it the database engine doing the interpretation?

Answer: Yes, who else would be doing it? We’re writing SQL—there’s no reinterpret_cast in SQL. This is the implementation layer.

Data Alignment: A Critical Concern

One thing we need to be careful about as we store bits: dealing with alignment to ensure data is aligned to how the CPU actually wants to operate on it.

We need to ensure all attributes are aligned based on the word boundaries of the CPU or whatever architecture we’re running on. This ensures:

No unexpected behavior when doing operations on data
The CPU doesn’t have to do extra work

Misaligned Data Example: Consider a table with four columns:

CREATE TABLE foo (
	id INT PRIMARY KEY,
	cdate TIMESTAMP,
	color CHAR(2),
	zipcode INT
);

id (32-bit integer)
cdate (64-bit timestamp)
color (2-byte CHAR)
zipcode (32-bits integer)

Assume we break our char array (tuple) into 64-bit words. Cache lines are 64 bytes, but Postgres and other systems organize based on 64-bit words.

Storage without considering alignment:

Store id (32 bits) → First 32 bits of word 1
Store cdate (64 bits) → Remaining 32 bits of word 1 + first 32 bits of word 2
Continue with other fields

The problem: The cdate attribute spans two words—partly in word 1, partly in word 2.

What happens when you try to access memory that spans word boundaries in a CPU?

x86 (Intel) behavior: Intel likes to make life easy and hide complexity. They’ll do the extra reads for you—no error, but now your database runs slower. What should have been one register read or one cache line read to fetch something into a CPU register becomes two cache line reads. But there’s no error—Intel handles it transparently.
ARM behavior (older versions): They would reject it and throw an error, hoping you would catch it and fix your alignment. In newer ARM versions, they handle it like Intel does.
Rare but possible: The system does the reads for you but there’s no guarantee the bits land in the right order. You might do two reads to get the first word and the second word to assemble the date attribute, but it may put the last bits in front of the first bits. This sounds terrible, but older CPUs would do this—your program would have random errors and messed-up data, which people would notice and complain about. This is part of why Intel tries to hide it even though it makes things slower.

We must ensure none of the attributes in our tuple span word boundaries.

Solution 1: Padding

The basic idea: As you add attributes going across words, if you recognize the next attribute doesn’t fit within a single word, just add zeros (padding) and start the attribute in the next word.

Internal bookkeeping: The system knows that after the id, the cdate attribute is in the next word, so it ignores the padding bytes.

Example:

Word 1: id (32 bits) + padding (32 bits)
Word 2: cdate (64 bits)
Word 3: color (16 bits) + zipcode (32 bits) + padding (16 bits)

Solution 2: Reordering

The idea: Keep the logical view of the table (whatever you defined in CREATE TABLE), but underneath the covers, reorder attributes to pack things better.

Most systems don’t do this automatically. Some academic systems do (like ones built at the instructor’s lab), but most production systems lay out bits exactly as you tell them and add padding as needed.

How reordering works:

Logical view: id, cdate, color, zipcode
Physical storage: Reorder to pack efficiently
If necessary, add padding at the end
System maintains mapping between logical and physical layout

Q&A

Question: How are VARCHARs handled with padding and reordering?

Answer: In systems that do automatic reordering, you don’t store VARCHARs inline unless they’re 64 bits or less. Instead, you store a pointer to some other location (external overflow pages or oversized attribute tables). This way you can do reordering without worrying about variable-length data. Store fixed-length attributes optimally, and pointers to variable-length data.

Question: Do we need the last padding (at the end of the tuple)?

Answer: Generally no. Once you reach the end of the tuple, you don’t need padding after the last attribute unless there’s some system-level reason to align tuple sizes to specific boundaries.

PostgreSQL Demonstration: Seeing Alignment in Action

PostgreSQL will not do automatic reordering, but it will do padding. We can see how reordering the CREATE TABLE statement affects storage.

PostgreSQL has row(value1, value2, ...) function that creates a row (tuple) from comma-separated values:

SELECT row(1, 2, 3);

It’s possible use ::type to cast values to specific types (intX stands for X-bytes integer):

SELECT row(1::int2, 2::int4, 3::int8);

It’s possible to use pg_column_size function that tells the size of the row in bytes:

SELECT pg_column_size(row(1, 2, 3));
-- Result: 36 bytes
 
SELECT pg_column_size(row(1::int2, 2::int4, 3::int8));
-- Result: 40 bytes

Intermixing CHARs with integers:

SELECT pg_column_size(row('a'::char, 2::int2, 'b'::char, 4::int4, 'c'::char, 8::int8));
-- Result: 48 bytes

Reordering: integers first, then chars:

SELECT pg_column_size(row(8::int8, 4::int4, 2::int2, 'a'::char, 'b'::char, 'c'::char));
-- Result: 44 bytes

Why the difference? PostgreSQL pads to ensure 64-bit alignment, but when you intermix types, you end up with more padding. By putting all integers first (especially ordering by size—int, int, bigint) and then all chars at the end, you minimize padding requirements.

PostgreSQL doesn’t do this automatically—you have to structure your CREATE TABLE statements intelligently if you care about storage efficiency. Systems that can do automatic reordering: Some academic systems optimize this automatically, but it’s rare in production databases.

SQL Data Types and Storage

Now let’s discuss how database systems represent core SQL data types.

Integer Types

All integer data types are the same as C++ implementations—same representation you get when you allocate an int, long, or short variable in C++. Why? Because that’s what the hardware supports. There’s a standard (two’s complement) for representing signed or unsigned integers. What you get in C++ follows the standard, the hardware supports it, and that’s what you get in SQL.

No special database magic—just using hardware-native representations.

Floating-Point Numbers

Floating-point or real numbers: Defined in the IEEE 754 standard, which specifies how hardware should represent decimal numbers.

Every database system also provides fixed-point decimals: NUMERIC or DECIMAL types, where each implementation differs per system. We’ll see the performance difference between these approaches shortly.

VARCHAR, Binary Text, and BLOBs

These are typically stored with:

Header containing the length
Bytes of the actual value

If the value is too large to store inline (within the tuple in the page), there’s a pointer to some other page (overflow page) containing the data.

Whether it’s stored inline or externally: For everything under 64 bits, systems store it inline. Beyond that, whether it goes to an overflow page depends on the implementation**.

Timestamps, Dates, and Intervals

These are typically 32-bit or 64-bit integers representing the number of milliseconds or microseconds since the Unix epoch (January 1st, 1970).

Time zone storage: If you want to store with time zone information, typically they’ll store as UTC time (GMT zero) and additional metadata indicating what time zone you’re in. The system converts as needed. The system handles the conversion transparently—you don’t have to worry about it in SQL queries.

Hardware Dependence and Portability

For integer, float, and other types relying on hardware representation: You’re using whatever the hardware provides.

Portability issue: This typically means you cannot copy raw database files from one architecture to another. If one is big-endian and another is little-endian (x86 is little-endian; Power and ARM are big-endian), you can’t take binary files from a database system running on one and use them on another—the bits will be flipped and data will be corrupted.

SQLite avoids this problem: SQLite stores everything as VARCHARs and at runtime casts things based on the type in the attribute definition. This gives portability—no matter where you use the file, it will always be in the right order because it’s stored as text. The downside is performance—type casting at runtime for every operation.

Variable-Precision vs. Fixed-Precision Numbers

For variable-precision numbers (FLOAT, REAL, DOUBLE in SQL), we rely on the C++ implementation—same as float or double in C++. These are typically faster than fixed-point numbers because the hardware can natively support them.

The problem: They cannot guarantee correctness of values when doing larger calculations because of rounding issues. You cannot store decimals exactly in hardware—it’s always an approximation.

The Classic Floating-Point Precision Problem

Everyone has probably seen a simple test program like this when first learning C or C++:

#include <stdio.h>
 
int main(int argc, char* argv[]) {
	float x = 0.1;
	float y = 0.2;
	printf("x+y = %f\n", x + y);
	printf("0.3 = %f\n", 0.3);
}

Result:

x+y = 0.300000
0.3 = 0.300000

But if you increase precision:

#include <stdio.h>
 
int main(int argc, char* argv[]) {
	float x = 0.1;
	float y = 0.2;
	printf("x+y = %20f\n", x + y);
	printf("0.3 = %20f\n", 0.3);
}

Results:

x+y = 0.330000001192092895508
0.3 = 0.299999999999999998890

Why? The hardware cannot represent 0.3 exactly—it’s an approximation based on IEEE 754 format.

When This Matters?

Simple program: If you’re just doing x + y and printing to a human with default precision, maybe that’s not a big deal.
When it becomes critical:
- Complex calculations for landing something on the moon
- Satellite space trajectories
- Bank account interest calculations

Rounding errors accumulate over many operations. This becomes noticeable and people complain (especially with money).

Fixed-Precision Decimals: Database Solution

For this reason, database systems provide fixed-precision numbers or fixed-point decimals where the database system does extra work to ensure you don’t have rounding errors.

Available in other languages:

Java: BigDecimal
Python: Decimal type

All systems do something different, but at a high level, they store a variable-length representation of the number plus additional metadata:

Where the decimal point is
Whether it’s signed or unsigned
Whether it’s negative
Whether it’s NaN (not a number)

The system has to do this extra work because the hardware cannot guarantee correctness.

PostgreSQL’s NUMERIC Implementation

We can see from the actual source code that PostgreSQL represents the type of a numeric as some kind of struct with a bunch of additional metadata about what the number actually is:

But the core thing they’re storing internally along with this metadata is the NumericDigit array, that is just a typecast to an unsigned char array. They’re literally storing your decimal as a string value and then they use this metadata to figure out how to interpret that string to put it be the correct form.

The hardware knows nothing about this—this is what the database system has implemented.

You cannot just do x + y like in C++—you have to do more complicated arithmetic. Here’s a brief snippet of the addition function for two numerics in PostgreSQL:

This is obviously way more expensive than calling a single CPU instruction (add for x + y).

MySQL has the same issue. They store digits as a 32-bit integer array with additional metadata to track what the numeric type actually is. Just like PostgreSQL, they have their own implementations of addition with all the additional checks:

And just like Postgres we can see their own implementation of doing addition:

It’s not sexy, but you need it if you want correct decimal calculations, especially for financial data.

In the sake of time, if there’s time at the end we can do a demo. But the numeric versions (database-implemented decimals) are about 2× slower than hardware float/double versions.

The tradeoff:

Hardware FLOAT/DOUBLE: Fast but imprecise
Database NUMERIC/DECIMAL: Slower but correct

When to use each: Depends on your application. If precision matters (financial calculations, scientific measurements requiring exact values), use NUMERIC despite the performance cost. If approximate values are acceptable and performance is critical, use FLOAT.

Handling NULL Values

1. Null Column Bitmap Header

The most common way to handle NULLs: For every tuple, the header contains a bitmap that keeps track of which attributes are set to NULL for that tuple and the size of the bitmap varies based on the number of attributes. We know which columns can be NULL because it’s in the CREATE TABLE statement. The advantage is that, because we have a schema (not just storing JSON), we know whether a column has been declared as NOT NULL. If it has, we don’t need to store a bitmap entry for it. Only attributes that can be NULL need bitmap entries.

2. Special Values

Less common approach: Have special values within the range of each type that, if present, mean the value is NULL.

Example for 32-bit integers: Say the 32-bit minimum number (like -2,147,483,648) means NULL. If a value is that number, treat it as NULL.

Downside: One less value you can potentially store. Also requires extra logic throughout the system: “If looking at a 32-bit integer and it equals that min value, then it’s NULL—don’t let people insert it arbitrarily.”

Even more edge case logic needed for each type. This approach adds complexity.

3. Per Attribute Null Flag

The worst approach (only seen in one system): For every single attribute in the tuple, have a little flag in front of it indicating whether it’s NULL.

Why this is terrible: Remember alignment discussion—you can’t have a 32-bit integer with one bit in front of it without breaking alignment. You’d have to store another byte or possibly even pad to make things align properly.

Storage waste: To store a 32-bit integer and track if it’s NULL, if you’re putting a flag in front and need 64-bit alignment, you might have to store another 32 bits just to have one bit say it’s NULL. Extremely wasteful.

Only one system did this: MemSQL (the earlier name of SingleStore). Despite them sponsoring the class, this was one of the worst ideas ever. They got rid of it and now use the column header bitmap approach like everyone else.

“I don’t have a screenshot here—I’ll post it on Slack.”

Variable length values. These represent data types of arbitrary length. They are typically stored with a header that keeps track of the length of the string to make it easy to jump to the next value. It may also contain a checksum for the data. Most DBMSs do not allow a tuple to exceed the size of a single page. The ones that do store the data on a special “overflow” page and have the tuple contain a reference to that page. These overflow pages can contain pointers to additional overflow pages until all the data can be stored. Some systems will let you store these large values in an external file, and then the tuple will contain a pointer to that file. For example, if the database is storing photo information, the DBMS can store the photos in the external files rather than having them take up large amounts of space in the DBMS. One downside of this is that the DBMS cannot manipulate the contents of this file. Thus, there are no durability or transaction protections. Examples: VARCHAR, VARBINARY, TEXT, BLOB.
Dates/Times. Representations for date/time vary for different systems. Typically, these are represented as some unit time (micro/milli)seconds since the unix epoch. Examples: TIME, DATE, TIMESTAMP.

Because for integer types we’re relying on the hardware to store data, that typically means you can’t copy the raw database files that you generate from one architecture to another because the bits are going to be flip and it’ll get messed up. SQLite avoids this problem because it stores everything as VARCHAR, and at runtime they cast things base on the type in the attribute. In this way they can guarantee portability.

Large Values and Overflow Pages

For really large variable-length values, most database systems will not let you store them directly in the page itself. Remember: page size is defined by the database system, and every page within that table must have the same page size. You can’t have variable-length pages (there’s an experimental system in Germany that supports this, but nobody else does).

At some point, you must decide: Should I store this large VARCHAR/large string in my tuple page or not? When it exceeds a threshold, it goes into an overflow page.

PostgreSQL: Calls them TOAST (The Oversized-Attribute Storage Technique—or something like that). Any attribute larger than 2 kilobytes is stored as a separate page. Mechanism: in the actual tuple, you just have a pointer or Record ID and offset pointing to where to find the actual value.

Transparency: As the SQL programmer, you don’t know this and don’t care. You SELECT * from the query, and the database system:

Reads the base tuple
Recognizes it’s pointing to an overflow page
Goes and gets that data
Copies it into the buffer
Produces output for you

MySQL: Overflow size is half the current page size.

SQL Server: Surprisingly, the default is if it exceeds the size of the page, then it overflows. More precisely: If the data you’re trying to store in this oversized attribute plus the regular fixed-length data combined exceeds the page size, then they’ll put the oversized data to another page.

Chaining overflow pages: If you want to store (for whatever reason) a 1GB video or 10GB video, the database can let you do that. The overflow page (since they all have to be the same fixed length) could have a pointer to the next page. You follow that linked list to get all the data and put it back together.

Large Values and External File Storage

The last approach: External value storage, where the database system does not store the large data in pages it manages. Instead, it writes it out to your local file system and internally stores the URI or URL of where that data is located.

When you query, the database fetches that attribute, goes to the OS, gets that data, copies it into its buffer, and hands it back to you.

Systems supporting this:

Oracle: Called BFiles (Binary Files)
SQL Server: Called FileStreams

PostgreSQL: You can do this with foreign data wrappers—additional mechanisms to store data in cloud storage. Within a single SQL interface, you can fetch data that appears as if it’s in the table itself.

When we write to external files, the database system cannot make changes to it. It’s written out to the file system—the database won’t go make in-place updates. It can only read it in. If you delete the tuple pointing to this file, there are mechanisms to decide whether you want to delete the file as well.

Why Use External Storage?

Primary reason: You don’t want to store a 10GB file in your database system for management reasons. That becomes expensive—it’s a log record you have to manage, which is costly.
Storage cost: Database systems typically run on higher-end hardware, making storage expensive.
- Example—Amazon RDS: Charges about 4× more for storage than EBS (Elastic Block Storage). EBS itself is more expensive than locally attached disk.

You don’t want to store large files (maybe read-only, not frequently accessed) directly in database-managed files. Let the OS handle it on cheaper storage.

Guidelines: When to Use External Storage

Jim Gray’s paper (Turing Award winner for databases in the 90s, invented much of what we’re discussing): He wrote a paper at Microsoft about 15 years ago discussing whether you should store large data in a database system or not.

His recommendation: Anything larger than 256 kilobytes, store it externally.

This is outdated—written a while ago. Storage and hardware have changed significantly.

SQLite creator’s guidance (from a CMU talk five years ago): In his experience, it’s actually better to store things in SQLite if you’re running on something like a phone app.

Example: If your application has a bunch of thumbnail images, you’re better off storing them in the database system because:

Your application already has the handle to the database open
Files are already open
Much faster to retrieve thumbnails directly from the database versus doing a bunch of fopen and fread calls to many files on disk

Modern guideline (pure conjecture): 50 megabytes or less is probably okay. Anything beyond that, use external storage.

Django and other application frameworks: They have mechanisms to handle external large object storage transparently

Quartz 4

Explorer

04. Database Storage Part 2

0. Resources

Review: Tuple-Oriented Storage Architecture

Problems with Tuple-Oriented Storage

1. Fragmentation

2. Useless Disk I/O

3. Random Disk I/O

4. In-Place Update Limitations

Log-Structured Storage: An Alternative Approach

Core Concept: Log Records of Changes

Critical Difference from Tuple-Oriented Storage

Two Critical Properties

Sequential I/O

Immutability

Performance Characteristics

Reading from Log-Structured Storage

The Index Solution

Compaction: Garbage Collection for Logs

Sorted String Tables (SSTables)

Log-Structured Compaction Strategies

1. Universal Compaction

2. Level Compaction

RocksDB and Industry Adoption

Problems with Log-Structured Storage Revisited

The Third Core Issue: Write Amplification

Index-Organized Storage: A Third Approach

Systems Using Index-Organized Storage

Performance characteristics:

Data Representation

Data Alignment: A Critical Concern

Solution 1: Padding

Solution 2: Reordering

PostgreSQL Demonstration: Seeing Alignment in Action

SQL Data Types and Storage

Integer Types

Floating-Point Numbers

VARCHAR, Binary Text, and BLOBs

Timestamps, Dates, and Intervals

Hardware Dependence and Portability

Variable-Precision vs. Fixed-Precision Numbers

The Classic Floating-Point Precision Problem

Fixed-Precision Decimals: Database Solution

PostgreSQL’s NUMERIC Implementation

Handling NULL Values

1. Null Column Bitmap Header

2. Special Values

3. Per Attribute Null Flag

Large Values and Overflow Pages

Large Values and External File Storage

Guidelines: When to Use External Storage

Graph View

Table of Contents