0. Resources

F2023 #04 - Database Storage Part 2 (CMU Intro to Database Systems)

Last class discussed alternative approaches to the tuple-oriented (slotted page) storage scheme. The primary focus was log-structured storage, where instead of storing actual tuples, you store log entries of changes made to tuples.

The three storage approaches covered:

Tuple-oriented storage (slotted pages)
Log-structured storage
Index-organized storage

These approaches are ideal for write-heavy workloads—applications doing many inserts, updates, or deletes. Log-structured storage is obviously better for this because you’re just appending to the log.

For most applications: When you start off, you’re going to have a potentially write-heavy workload. But there are some applications, environments, or workloads where you don’t care about getting the best performance for writes—you want the best performance for reads. For those scenarios, these approaches may not be the best way to approach storage.

This lecture will discuss broad categories of database applications and provide motivation for why we might want an alternative storage scheme where we don’t store everything as rows (tuples with all attributes together).

Database Workloads

Let’s see some broad categories or database applications and some motivation for why we want to look at an alternative storage scheme:

On-Line Transaction Processing (OLTP)
On-Line Analytical Processing (OLAP)
Hybrid Transaction + Analytical Processing

On-Line Transaction Processing (OLTP)

OLTP applications are systems where you’re ingesting new data from the outside world and serving many users simultaneously. An OLTP workload is characterized by fast, short running operations, repetitive operations and simple queries that operate on single entity at a time. OLTP workloads typically handle more writes than reads, and only read/update a small amount of data each time.

The canonical example: Amazon. When you go to the Amazon website:

You look at products
Click things and add them to your cart
Purchase items
Go to your account information and update your mailing address or payment information

These are all OLTP-style workloads because you’re making changes to a small subset of the database—updating your cart, updating your payment information.

Other examples:

Posting things on Reddit or Hacker News
Making small changes to potentially a large database
Each query or operation makes a small change
The amount of data read is small
Operations target a single entity (one user, one order, one product)

OLAP: Online Analytical Processing

OLAP workloads are where you run queries that extract or extrapolate new information across the entire dataset. An OLAP workload is characterized by long-running, complex queries and reads on large portions of the database. In OLAP workloads, the database system is often analyzing and deriving new data from existing data collected on the OLTP side. It’s about running queries that extract new information across the entire dataset. OLAP operations are primarily read-heavy or read-only, so I’m not doing single updates.

Example OLAP query: Amazon running a query that says “Find me the number one sold product in the state of Pennsylvania on a Saturday when the temperature is above 80 degrees.”

Characteristics:

Not looking at a single person or entity
Looking across entire tables
Potentially doing many joins
Combining multiple sources of information
Similar to queries in Homework 1

These OLAP workloads are primarily read-heavy or read-only—you’re not doing single updates, you’re doing large scans and joins over big tables.

HTAP: Hybrid Transaction + Analytical Processing

HTAP is a buzzword from industry analysts (Gartner) describing applications where you want to do both OLTP-style workloads and OLAP workloads, potentially in the same system.

Instead of taking all transactional data and putting it into a separate data warehouse to do analytics, maybe you can do some analytics directly as data comes in.

This will be discussed throughout the semester, but the main two categories to focus on are OLTP and OLAP.

Visualizing Workload Types

Use a simple grid where along the x-axis we’re saying wether the workload is made read-heavy vs write-heavy and on the y-axis we’re saying how complex the queries are:

OLTP (bottom-left corner):

Potentially many updates
Queries are really simple
Example: SELECT * FROM account WHERE id = 'Andy'
Getting single things, single entities

OLAP (top-right corner):

Mostly reads
SELECT queries are much more complex than OLTP
Think about Q9 and Q10 from Homework 1
Complex aggregations, joins, and analytics

HTAP would be somewhere in the middle of this spectrum.

We’ll talk about why the things we talked about so far they’re gonna be good for OLTP, but not for OLAP, and then we’ll design a storage scheme that is better for OLAP. To understand why approaches discussed so far are good for OLTP but not OLAP, let’s use a real example: the Wikipedia database.

Historical Context Terminology

OLTP terminology: Goes back to the 1980s

OLAP terminology: Comes from the 1990s

Jim Gray (famous database researcher who invented much of what we discuss this semester) wrote an article in the early 1990s saying “there’s this new category of workloads called OLAP and we should pay attention to it.”

The controversy: It turns out he was actually getting paid by a company trying to sell an OLAP database system product in the early 1990s. The paper got retracted, but the name stuck around.

Jim Gray won the Turing Award for databases in 1996—he’s a very famous database researcher.

Jim Gray famously got lost at sea in the San Francisco Bay in 2006. He was out sailing by himself and his boat disappeared. This was one of the early examples of crowdsourcing: They moved satellites to take pictures of San Francisco Bay, and people looked at images trying to find the boat. They never found him.

The instructor never met him, but many of the concepts discussed (like the Pluto metaphor for disk access times versus reading a book in front of you) came from Jim Gray. He had many interesting analogies like that.

Example: Wikipedia Database

Wikipedia is open-source—it runs software called MediaWiki, runs on MySQL and PHP. You can go look at it. The schema roughly looks like this:

useracct - People making changes
pages - The articles in Wikipedia
revisions - Edits/changes to those articles

Relationships:

Foreign key from revision → user (who created the change)
Foreign key from revision → page (which page was edited)
Foreign key from page → revision (to find the latest revision)
All the text itself goes in the revision table

Important reminder: The relational model does not define or specify anything about how we should store data in a table. In all examples shown so far, we’ve been showing tuples with every table having all attributes one after another. Yes, there were overflow pages for large attributes, but in general, all the smaller attributes are stored together. But there’s nothing about the relational model that says you have to do that. It’s just what we as humans came up with first because it’s easy to conceptually think about. For OLAP workloads, this may not be the best approach.

OLTP queries

For OLTP, you have many simple queries that read or write a small amount of data relative to all data in the database.

Example 1: Get latest revision for a page (this retrieves one page and one revision):

SELECT P.*, R.*
FROM pages AS p 
INNER JOIN revision as R
ON P.latest = R.revID
WHERE P.pageID = ?

Example 2: Update user login timestamp (updating a single user account with the timestamp of when they logged in and the hostname):

UPDATE useracct
SET lastLogin = NOW(),
hostname = ?
WHERE user_id = ?

Example 3: Insert new revision (inserting a single row):

INSERT INTO revision VALUES (...)

This is what you usually end up with when building a brand new application. If you’re creating a startup and building an online service, you start with something looking like this because you don’t have any data initially—you need to get it. You make a website, and your website runs these kinds of queries.

OLAP queries

For OLAP, you do more complicated things requiring you to look at larger portions of the table.

Example: Congressional Wikipedia Editing Scandal. This is a rough approximation of a real query people were running:

- [ ] SELECT COUNT(U.lastLogin), EXTRACT(month FROM U.lastLogin) as month
FROM useracct AS U 
WHERE U.hostname LIKE '%.gov' 
GROUP BY EXTRACT(month FROM U.lastLogin)

This finds all login attempts from users with IP addresses or hostnames ending in .gov.

The scandal (late 2000s/early 2010s): People in Congress were having their staff update Wikipedia to say more flattering things about the congressman or congresswoman. Mike Pence did this, Joe Biden did this.

This query could find all people doing that—essentially paying government employees to update Wikipedia, which they shouldn’t be doing.

This is the type of query you’d execute on data after you’ve already collected it from the OLTP portion of the application—analytics on historical data.

SELECT COUNT(U.lastLogin),
		EXTRACT(month FROM U.lastLogin) AS month
FROM useracct AS U
WHERE U.hostname LIKE '%.gov'
GROUP BY
EXTRACT(month FROM. U.lastLogin)

You execute these workloads on the data you have collected from your OLTP application(s).

Storage Models

A DBMS’s storage model is how the database system physically organizes tuples on disk and in memory relative to other tuples and their own attributes.

Until now, we’ve been assuming all attributes are contiguous for a tuple—this is roughly called a row store. For OLAP, that may not be the best approach, and we’ll see why.

The reason why we have to discuss this part of as system is because there is a clear distinction in the database marketplace between:

Row store system - Use for OLTP
Column store system - Use for OLAP

If anybody tries to say “I have a fast row store you can use for analytics,” you should be very skeptical.

There are different ways to store tuples in pages:

N-ary Storage Model (NSM) - The row store
Decomposition Storage Model (DSM) - The column store
Hybrid approach: PAX (Partition Attributes Across) - The most common one for column stores

When people say they have a column store, they really have the PAX one, but it’s not a major difference from pure column stores.

N-ary Storage Model (NSM): Row Store

This is what we’ve assumed so far this semester: All attributes for a given tuple are stored contiguously in a single page, one after another.

The idea: You’re going across the page, laying out all data for a given tuple, and you don’t start laying down any bits for the next tuple until you finish the current tuple.

This approach is ideal for OLTP workloads because most OLTP queries access a single entity or single tuple. You can go to a single page and get all the data you need for that single entity—that’s really all you need to satisfy the query.

Page sizes: Always some multiple of hundreds of bytes (we’ve discussed this: 4KB, 8KB, 16KB, etc.).

NSM: Physical Organization

The layout is basically the same as before:

Database page with header
Slot array
As you scan through the table and insert data, append entries to the end
Keep adding more and more until the page fills up

When a query comes along (SELECT * FROM table WHERE id = ?):

Need to get one page
Jump to the offset defined in the slot array
Get all the data needed

Let’s see how this works in the Wikipedia example where we have someone wants to login with username and password:

SELECT * FROM useracct
WHERE userName = ?
AND userPass = ?

Checking whether credentials match—roughly how authentication works in database-backed applications. Ignore for now how we find the data (assume there’s some index, hash table, or B+ tree that gives us the Record ID and offset).

Process:

Go to page directory and find the page with the data
Look in slot array
Jump to offset
Have all the data needed for the query

This is ideal for OLTP because all data is contiguous.

Same for INSERT:

Look in page directory to find a page with a free slot
Bring it into memory (assume it’s a particular page)
Append to the end

Consider the earlier OLAP query: Finding number of logins per month for users with .gov hostnames:

SELECT COUNT(U.lastLogin),
		EXTRACT(month FROM U.lastLogin) AS month
FROM useracct AS U
WHERE U.hostname LIKE '%.gov'
GROUP BY
EXTRACT(month FROM. U.lastLogin)

What happens:

Must scan all pages in the table (need to look at all user accounts)
When bringing a page in, need to execute the WHERE clause on hostname
Need to look at tuples in the page where the hostname predicate is satisfied
The only data really needed is just the hostname attribute
Then need to do aggregation on last_login for the GROUP BY

The only attributes actually needed:

hostname (for the WHERE clause)
last_login (for the GROUP BY aggregation)

The obvious problem: You have to go through all the rows and brought in a bunch of data you don’t actually need. To get just the attributes needed, you had to bring in the entire page. The entire page contains attributes like:

user_id
username
user_password

You don’t need these for this query. You’re doing useless I/O—fetching data from disk that you don’t even need at all.

Not only is this slow, but in some systems (like Amazon Aurora), you pay per disk IOPS. You pay for the number of I/O operations for a query. In this case, you’d be paying for data you don’t actually need.

Summary: Why NSM is Bad for OLAP

Problem 1: Bringing in Unnecessary Data

NSM is great for:

Insert, update, delete operations
Queries that need all data for a single entity

NSM is bad for queries that:

Scan large portions of a table
Only need a subset of attributes (which most OLAP queries do)

It’s very rare to do SELECT * on a really wide, huge table because you’re basically dumping the whole thing out. There are utilities to make that faster, but typically OLAP queries are selective about which columns they need.

Problem 2: Non-Sequential Memory Access

This is a lower-level detail, but think about how you’d actually execute the query to run the predicate: you’re jumping around to different locations in memory to do the scan:

Read the header for the first tuple
Figure out how far to jump to get to the hostname
Maybe look at lastLogin if computing aggregates
Jump down to the next tuple
Jump over to its hostname attribute
Repeat…

In a modern superscalar CPU, this is terrible because there are:

Many non-sequential operations
Could be non-deterministic
Memory locations accessed aren’t random (always increasing order), but not reading contiguous strides of memory
Can’t crunch through data very quickly

This isn’t covered deeply this semester, but it’s worth understanding. We’ll see this when covering compression and query execution.

Problem 3: Poor Compression Opportunities

When trying to compress data to reduce amount or pack more data within a single page, all attributes for a given table are thrown together in that page. There’s less chance for:

Repeatability
Identifying that values are the same and can be compressed well

Think about the data domains:

userID: Integer (could be sequential or random)
userName: Random strings
userPassword: Random strings (hashed)

They’re different value domains mixed together and this is not ideal for compression because there’s less uniformity and predictability within a page.

Decomposition Storage Model (DSM)

In the Decomposition Storage Model, the DBMS, instead of storing all attributes for a single tuple together in a page, we store a single attribute (column) for all tuples in a single page. Thus, it is also known as a “column store.” This model is ideal for OLAP workloads with many read-only queries that perform large scans over a subset of the table’s attributes that we really need.

Example: That lastLogin field—instead of having it intermixed with other attributes within a single page, only have lastLogin in dedicated pages.

DSM is ideal for OLAP queries that scan the entire table and only need a subset of attributes:

When you fetch a page from disk, you’re only getting data for attributes you actually need
Not fetching other things that just get carried along for the ride

The benefit of declarative language (SQL): You don’t have to know or care whether you’re running on a row storage system versus a column storage system. Your same SQL query works just fine. The database system’s responsibility (us, the people building it): Take data, split it up into separate columns/attributes, and stitch it back together when needed to produce results.

DSM: Physical Organization

Think of it this way: For the first column (first attribute, Col A):

Separate file with a bunch of pages
Header telling us what’s inside the page
NULL bitmap for all values within that column
Contiguous values for all tuples in the table

Do the same for the next column, and the next one, and so on.

These files are still broken up as database pages (4KB, 8KB, whatever the system supports), but each file contains data for only a single attribute and metadata overhead for column files is much less than a row store because you don’t have to keep track of:

Every single column’s NULL status individually
Different information about offsets for finding different attributes
Where to find things within a heterogeneous tuple

All stored because:

It’s all the same value domain
All the same attribute type
Much less metadata than a row store

DSM Example: Wikipedia OLAP Query

Going back to the Wikipedia example, we take every column for the table and store it as a separate page:

For the hostname example: Within a single page, we’re only storing values for the hostname column. We have separate pages for all other attributes.

Executing the OLAP Query

Let’s take again the query we examined before that counts the logins per month with .gov hostnames:

SELECT COUNT(U.lastLogin),
		EXTRACT(month FROM U.lastLogin) AS month
FROM useracct AS U
WHERE U.hostname LIKE '%.gov'
GROUP BY
EXTRACT(month FROM. U.lastLogin)

First part of execution: Get the hostname data

Assuming one page per attribute (simplified), fetch that one page
Do the scan, ripping through the column
Identify all matches for the hostname predicate

There is 100% utilization of all data brought in—only bringing in data needed for this query, not attributes we don’t care about.

Second part: Now need lastLogin data

We’ll talk later (in query execution) about how to match things up
Assuming we keep track of a list of offsets of tuples within the hostname column that match the predicate
Go to the lastLogin page, fetch it
Again, only has data we need
Know how to jump to different offsets of matching hostnames to find the right lastLogin timestamp
Compute whatever needed for the query

Q&A

Question: If in the row store example there’s some index that magically gets you to the single record (using WHERE username = ?), how does that work in column stores?

Answer: The username index gives you the Record ID or Record IDs (if non-unique) that match. From the Record ID, you look in the page directory to find what page you need. If it’s in memory, great; if not, go get it from disk. Then you use the slot number from the Record ID to jump to that slot in the page.

Important distinction in column stores: You only need one username (assuming usernames are unique), so you fetch that one record. But you had to fetch all these other rows you don’t actually need if this is a row store.

In a column store: You’d have separate pages for each attribute, and you’d only fetch the pages for attributes you actually need.

Question: Just to understand—the only difference between column-based and row-based is slightly more implementation on the DBMS side? In row-based, you have access to the whole row and can compare hostname and login at a single instance. In column-based, you have to have more intelligence to understand what you’re putting out because you’re matching two different columns.

Answer: I would say they’re equally hard if you don’t care about other protections like transactions.

If you don’t care about those things, then yes, a row store would be easier to implement—everything’s there, you assume it all fits, and (ignoring overflows and assuming every record fits in a single page) a row store would potentially be easier.

But with all the other complexities (transactions, recovery, optimization), they’re roughly equally complex to implement correctly.

Question: Is the fact that a database should store as row store versus column store something configurable by table, or is it all-or-nothing?

Answer: Most systems are row-only or column-only. HTAP (hybrid) systems try to do both. Typically, you’d say: “I know I’m using this system, it’s a column store, so I’ll store everything there” because all tables will be columnar. Handling updates in column stores: Even though it’s a column store optimized for read-only queries, people obviously want to update data. The way you typically get around this:

These systems have a row store buffer area, typically log-structured

Any updates are applied to that row portion

Periodically, merge them into the column store

Oracle’s approach: Different—the row store is the primary storage location, but they make a copy of your tables in column store format and are responsible for keeping things updated. Different approaches, different solutions. Typically, if a system supports both, you could define it on a per-table basis, but most systems don’t do that

Question: If I have as many disks as I have columns (splitting a column store table so every attribute goes to a separate disk), would that be as fast as a row store?

Answer: No, because you’d still have to split apart and write it all out. You’d also have more pressure on memory because:

Say you have a thousand attributes

To update, you have a thousand pages in your buffer pool

Have to update each with a single attribute

Then write them all out

Doing updates in a column store without the buffer approach is always going to be slow. It’s not as bad as random I/O might be, but it’s not as good as sequential I/O in log-structured storage, and not as simple as row store updates.

Question: What is the NULL bitmap, and is there an equivalent in row stores?

Answer: Discussed last class. It’s a way to represent which attribute is NULL. In row stores: In the header of every row (tuple), we have a bitmap that says which attribute is NULL or not. That’s one approach—the most common one. Alternatives discussed:

Special values (one specific value in the type’s range means NULL)

A little flag in front of each attribute (wasteful)

In column stores: Instead of having that bitmap per tuple in the header, it’s just for the entire column—here’s the NULL bitmap for all values in this column.

Question: For an index in a column store, is it instead of “here’s where to find this one row,” it’s “here’s where to find these 10 columns”?

Answer: Some OLAP systems that are column stores don’t give you any indexes at all. Examples:

Snowflake: No user-created indexes (might have changed)

Vertica: Couldn’t have indexes

Why? They’re not trying to do point queries or single-thing lookups—they’re designed for complete scans. You could have indexes that are range indexes: “If your ID is between 0 and 100, go to this page.” Or inverted indexes: “Find me all records where the keyword ‘Andy’ exists.” Different types of indexes exist, but you wouldn’t have the index to do point query lookups like in OLTP systems.

DSM: Tuple Identification

I said “fetch the page with hostname, run WHERE clause, get matches, then fetch lastLogin page and magically find the matches there.” How do we do that?

Approach 1: Fixed-Length Offsets (Most Common)

Identify tuples not by slot number but by their offset within the table. The unique identifier for a tuple (why we say “tuple” instead of “row”—what does a row look like in a column store?) is its offset within the table.

If I’m at offset 3 in one column, I know how to jump to offset 3 in another column, then stitch that tuple back together. This only works if values are fixed-length.

What breaks this assumption? Variable-length data:

VARCHAR strings
BLOBs
TEXT

Approach 2: Embedded IDs (Legacy)

Less common, legacy approach: With every single value, have some unique tuple identifier (like log-structured storage—some counter incremented by one).

Then there’s some index structure (not shown) where, for a given Record ID for a given column, it tells you where to jump to.

This is rare—you probably shouldn’t worry about it. There was some old system that did this because they were contorting a row store to make it a column store. Everyone uses fixed-length offsets.

Here’s a visual comparison between Approach 1 and Approach 2:

DSM: Variable-Lenght Data

Since Fixed-Length Offsets approach is more common, the problem we have to deal is how to convert variable-length values into fixed-length values. Solution: use dictionary compression to replace repetitive variable-length values with fixed-length codes (usually 32 bits). You can:

Do predicates on the dictionary code
If necessary, if it matches something, do a lookup to find the actual value

DSM: System History

The idea is not new. According to literature:

the very first version goes back to the 1970s. Cantor (Swedish Defense Ministry project): More of a file system than a database system, but considered the first documented proposal for a column store system. Unknown if it exists today.
1980s: A paper mapped out the theoretical properties of what the Decomposition Storage Model looked like, but it was mostly only in academia.
SybaseIQ (1990s): Roughly considered the first commercial implementation, but not really a full-fledged database system—more like a query accelerator.
- Similar to Oracle’s approach: SybaseIQ would make a copy of your row store into an in-memory column store. Queries would show up, and Sybase would figure out: “Should I go to the row store or run the query on the in-memory column store?”
The 2000s is when column store stuff really took off. Three key systems in the space:
1. Vertica - Founded by Mike Stonebraker (the instructor’s PhD advisor) and Stan Zdonik
2. VectorWise - A fork out of MonetDB at CWI
3. MonetDB - A major academic project at CWI
DuckDB is from CWI, so there’s a lineage. The first version of DuckDB was actually called MonetDB Lite. They threw all the code away and started DuckDB from scratch after learning from MonetDB Lite.
VectorWise was started by someone who worked on MonetDB.
The two main people at VectorWise: One left and was a co-founder of Snowflake. A lot of the early ideas VectorWise developed are in Snowflake. The other guy, Peter Boncz, went back to CWI and helped advise the DuckDB project.

There were other systems at the time, but these are considered the three major pioneers in the column store space.

The Walmart Story

How this all came about (according to Mike Stonebraker). Mike was consulting for Walmart Labs in the early 2000s. They were struggling to scale their Teradata database (which at the time was a row store). Walmart’s database was multiple petabytes—every single transaction, every time somebody bought something at a store, every scan from the cash register went into that database.

They were struggling to get Teradata to run fast. Mike said, “We should just make this a column store.” He founded the C-Store project, which became Vertica.

Modern Column Stores (2010s and Beyond)

Now pretty much everybody does this. Here’s a sample of different database systems considered column stores:

Snowflake
Vertica
MonetDB
DuckDB
Many others

Two key developments in the 2010s: Open-source file formats:

Parquet - Came out of Dremio and others
ORC - Came out of Facebook

These are open-source columnar file formats. Now you can build database systems that read and write Parquet or ORC files.

Advantages of DSM (Column Store)

Greatly reduced wasted I/O for analytical queries because you’re only reading the exact data you need.
Better cache reuse and locality for access patterns:
- Literally ripping through columns one after another
- Not jumping around within memory
- Better for CPUs

Better compression (discussed next):
- Values from the same column tend to be similar
- Better compression ratios

Disadvantages of DSM (Column Store)

Slow for point queries: Looking up a single tuple requires accessing multiple column files.
Slow for inserts, updates, deletes: Have to split things up and write data to multiple locations. To reconstruct, have to bring it back in from multiple places.

Multi-Column Access Problem

Important point: In the earlier example, the query seemed to process one column (hostName scan), then move to another (lastLogin). In reality, queries often want to look at multiple columns simultaneously.

Your WHERE clause only referenced one attribute, but as you’ve seen in Homework 1, oftentimes you have multiple columns or attributes referenced in your WHERE clause. It would be expensive or cumbersome to maintain state as you’re scanning one column while fetching another column at the same time, trying to patch things together. This, we want a way to have attributes that are accessed together somewhat close to each other on disk or in files, but still get all the benefits of a column store layout. This is what the PAX model provides.

Hybrid Storage Model (PAX)

As mentioned, in most systems when they say they’re a column store, they’re using PAX. Even Parquet and ORC are really doing this.

The idea is: instead of having a separate file for every single column by itself, break them up into chunks (row groups) and have data for tuples in the same row close to each other, just spaced out in separate pages.

PAX: Physical Organization

Going back to the example, we horizontally partition the table into row groups, then within that row group, partition based on columns.

Look at the first three rows in the picture above:

In the giant file, define a row group
Header for that row group
All values for Column A together
All values for Column B together
All values for Column C together

If you have a WHERE clause needing both Column A and Column C, when you fetch pages for this row group, you’ll have all the data you need close to each other. You’ll also get benefit of sequential I/O because this row group is tens of megabytes instead of 4KB or 8KB pages

This is roughly how Parquet works. There are many diagrams and presentations of Parquet.

Default page size: 1 megabyte (they want to group things together and have as much sequential I/O as possible)
Row group size: 128 megabytes

Q&A

Question: Couldn’t this still have the problem of useless I/O if you’re doing a full table scan and bring in the entire row group?

Answer: The header tells you where things are. In real systems (Parquet and ORC), it’s actually in the footer (not header) because, assuming the file is immutable, you don’t know where everything’s going to be until you finish writing it. You don’t bring the whole row group in—you bring the header (or footer) in first. It says “here are the offsets of where my attributes are,” then you can fetch just those specific column sections.

Database Compression

Disk I/O is (almost) always the main bottleneck I/O. Whatever the exact size of a tuple in a table, every page brought in contains exactly that data. To speed up queries, you can:

Skip data: Column store approach helps—avoid reading attributes you don’t need
Make data you fetch bring more tuples into memory: Compression*

There’s a trade-off between speed and compression ratio.

Disk is potentially slower than CPU, especially in cloud settings. You’re willing to pay the extra cost of decompressing/compressing data because it reduces:

Number of IOPS
Time wasted on I/O to fetch things

The gap between disk speed and CPU speed is narrowing: In some cases, disk is getting so fast that maybe you don’t want things compressed.

Other benefits of keeping things compressed: The database system can actually run faster when processing things in memory (covered in a few weeks when discussing query execution).

In general, for most systems: Compressing things on disk is always going to be a win.

Goals of compression:

Any compression scheme must produce fixed-length values (as discussed before) because if you want to store in a column store, you want fixed-length offsets.
Delay decompression as long as possible during query execution (discussed more later). Idea: If you have a bunch of 1-megabyte strings in your table but can convert them to 32-bit integers:
- Want to process 32-bit integers for as long as possible
- As you copy data from one operator to the next during query execution (or over the network in distributed systems)
- Want to keep things compressed as long as possible
- Only decompress when you actually have to show something back (someone needs it decompressed or user needs output)
The most important thing: Ensure we’re using a lossless compression scheme.
- Lossless: No information is lost when you compress/decompress things.
- Lossy schemes: MP3, MP4, JPEG—doing tricks about how humans perceive audio/visual data to compress to much smaller sizes. If you have the raw image or sound file, when you compress it, you’re not going to get back the same values when you decompress.
  - We don’t want lossy in database systems because, as said before, people don’t like losing data. If you have $100 in yo u r banka cco u n t, co m p ress t h e d a t a, d eco m p ress i t, an d n o w ha v e$ 90—you’re going to notice and complain.
- You can do lossy compression yourself at the application level.
  - Example: Keeping track of room temperature every second for 10 years. Do you really need exact temperature at one-second intervals a year from now? No—you can compress it to average temperature per minute. You can’t get back original data because it’s been aggregated. That might be okay, but you as the user in the application have to know whether that’s acceptable. The database system doesn’t do it—the database system always uses lossless schemes.

What to Compress: Granularity Levels

Several choices for what to compress:

Block-level: Compress a block of tuples for the same table
Tuple-level: Compress the contents of the entire tuple (NSM-only)
Attribute-level: Compress a single attribute within one tuple (overflow); can target multiple attributes for the same tuple.
- Example: Overflow tables (discussed before)—if storing huge text attributes (like Wikipedia revisions, could be kilobytes of text), compress just that one entry. PostgreSQL does this, many systems do this.
Column-level: compress multiple values for one or more attributes stored for multiple tuples (DSM-only).

We’ll discuss block-level compression first, then spend most time on column-level compression because that matters most for columnar systems.

Block-Level Compression

To do block-level compression, you essentially need to use a naive compression scheme. By “naive”, we mean that the database system makes a call to a third-party library (like gzip—though you wouldn’t want to use that because it’s slow, there are faster alternatives:) that takes the page and compresses it to some binary form where the database system has no way to interpret or do any introspection into the compressed version of the block.

If you call gzip on a file, the database system doesn’t know how to read inside that compressed file—it has to decompress it to get back the original version. You wouldn’t want to use gzip. There are faster alternatives:

LZO (1990s): A big breakthrough
Snappy (Google): Fast compression
LZ4: Very fast
Zstandard (Facebook): Considered state-of-the-art compression scheme now. They’re working on a new version (not public yet) that’s even faster and better. Zstandard is what you should be using.

MySQL’s Table Compression

MySQL supports table compression (declare it on a per-table basis—not on by default).

How It works: Pages written to disk are compressed into some multiple of page size (multiples of 2, up to 8 kilobytes). Each page has a header portion called the “modification log” (mod log). You can do writes and make changes to the page without decompressing it first. It’s a little extra space at the beginning.

If your compressed page is 6 kilobytes, they’ll pad it up to the next highest value (8KB) within that range (1, 2, 4, 8). This ensures:

No fragmentation in layout on disk
When bringing things into memory, alignment is maintained

Say a query wants to read something on page zero: for a blind write (insert, delete, or even update assuming you have the values):

Don’t need to decompress the page
Just write that change to the mod log
It’s just log-structured storage (think of the mod log as log-structured storage.)

In some cases, can do reads on the mod log: If the data you need was just inserted and is in the mod log, don’t have to decompress the rest of the page.

If you need to read the page:

Decompress it
Store it as a regular 16KB page in memory in the buffer pool (default size for MySQL)
Can do whatever reads you want

Still keep the compressed version around. I think also when it gets decompressed, they apply the changes from the mod log to the page.

This is a decent idea. PostgreSQL doesn’t do it—not because PostgreSQL doesn’t do something means you shouldn’t do it. PostgreSQL has an amazing frontend, but the backend is actually pretty terrible because a lot of the design is remnants from the 1980s. It’s not how you’d build a modern system today. PostgreSQL doesn’t support compression for regular data pages (only for TOAST tables—overflow pages).

Why This Approach Has Limitations

Because MySQL is a row store, you have to use a naive compression scheme. You can’t do anything fancy because:

Values you’re storing in tuples/pages are from all different attributes
Not going to be able to do all the native compression schemes (discussed next)

Because we’re using a general-purpose compression algorithm (Snappy or Zstandard):

The database system doesn’t know how to interpret what the compressed bytes mean
Spoiler: All those compression algorithms are doing some variant of dictionary compression—building their own dictionary of repeated byte sequences
But MySQL doesn’t know how to read that dictionary, so it has to decompress the whole thing

For some workloads, this is actually a good idea. Kind of wish PostgreSQL did do some compression.

Operating Directly on Compressed Data

For OLAP, ideally we want to run queries directly on compressed data without decompressing first.

Assuming you have some compression algorithm (not saying what it is) and a compressed form of the database: if the query shows up wanting to get salary for ‘Andy’, do some magic that converts the query constant string ‘Andy’ into the compressed form. Then do a direct lookup on the compressed table using the compressed constant—don’t have to decompress every single page as you’re going along.

Benefits:

Reduce amount of I/O (fetching compressed pages)
Don’t have to decompress them to do lookups

This is ideally what we want, and the easiest way to do this is in a columnar system.

Compression Algorithms Overview

This is a quick overview of different compression algorithms you could possibly have:

Run-length encoding (RLE)
Bit-packing
Mostly encoding
Bitmap encoding
Delta encoding
Dictionary compression

Dictionary compression is the default choice for most systems. But you may want to compress a single column using these other schemes first and you can have a multiplicative effect: do compression one way, then run another compression algorithm on the compressed data, getting even better compression. Still done in a way where the database system can natively interpret what those bytes mean in compressed form without decompressing first. This is why you want the database system to do everything—don’t want the OS or anybody else to do anything. Because we can do this native compression.

Run-Length Encoding (RLE)

Basic idea: If you have contiguous runs of values that are the same thing (literally the same value), instead of storing that value over and over, store a compressed summary that says:

For this value at this offset
Here’s how many occurrences it has

This works great if your data is sorted based on whatever column you’re trying to compress. You can’t always do this, but if you sort things, you maximize the amount of repeated runs.

Consider a sample table with id field and isDead field saying whether somebody’s dead or not. Let’s see how to build the compressed form of this table: scan through the column, find contiguous tuples with the same value, convert into triplets:

Value
Offset
Run length

Example query:

SELECT isDead, COUNT(*)
FROM users
GROUP BY isDead

You can rip through the new isDead column and compute aggregation by just summing up the length of the run along with the value.

Can do even better: Say you have somebody not dead, then dead, then not dead—three triplets where run size is one. You’re actually doing worse (storing triplet when you could store single value). If you sort the data based on whether somebody’s dead or not:

Compressed column only has two entries
All dead people
All non-dead people

This method greatly reduces amount of data to store. Thinking in extremes: With a billion people, compress down to tracking who’s dead/not dead into a small number of bytes fitting on one page.

Why do you need the length in the triplet? Assuming fixed offsets, this allows you to figure out—if you need to find for a single tuple/entry if they’re dead or not—it allows you to do the math to reverse it back and say “I would be at this offset if uncompressed.” Simple arithmetic.

Bit-Packing

Idea: People oftentimes declare attributes/columns in a type larger than they actually need. For example, consider a column that keeps track of some number, declared as INTEGER type (in SQL it’s a 32-bit integer). Even if it’s a small value, we’re still allocating 32 bits to store it. For these numbers, none of them are very big, but, since each is 32-bit integer, we’re allocating so much space:

It’s clear that the only thing that matters here is the lower portion of bits (the actual data) represented by the first 8 bits. The other 24 bits are just wasted space:

Bit-Packing proposes to store as 8-bit integer, even though you declared it as 32-bit integer. Greatly reduces size by a factor of four—from 256 bits to 64 bits:

You can do tricks with bit-shifting operators (in C) to, as you’re scanning along trying to find matches on a certain number:

These are now 8-bit integers
Put them into a single 32-bit integer
Keep track internally: it’s really at this offset with these different values
With a single CPU instruction, operate on four values at once

What’s the problem with this solution? What happens if you add a 32-bit integer to the database that can’t fit in 8 bits? The solution for this is from Amazon Redshift and it’s called Mostly Encoding. The idea is as follows: most data in the column is going to be small enough, but in cases where it’s not, keep track of that and store separately in a dictionary.

Example: 32-bit numbers, but one value (99999999) is really big.

Still store as 8 bits
Have a special marker value (think of all bits set to one)
Separate table: for a given offset, here’s what the original value should be

As you’re scanning the column, if you see the special marker value, know you should look in this offset table to find what the real value should have been.

As example, consider a table with eight 32-bit values. The size of this table is: 256 bits. Now, let’s applying compression with mostly encoding:

8-bit mostly column: 64 bits
Lookup table (assuming 16 bits for offset, 32 bits for value): 48 bits
Total: 112 bits (not counting additional metadata overhead)

Bitmap Encoding

If you have an attribute with low cardinality (small number of unique values), instead of storing the actual value for every tuple in a column you can apply Maintain bitmaps where:

One bitmap for every possible value in the column
Bit is set to one if the tuple at that offset has that particular value

Some database systems provide bitmap indexes that do the same technique—you still have the original column, but they maintain bitmap indexes. On the other hand, there’s a company coming to talk about their database (either FeatureBase or FeatureForm—two different databases with “feature” in the name). One of them only stores bitmap indexes—you can’t actually store the base data.

Going back to the isDead column (only two possible values: yes or no), instead of storing actual values themselves:

Two bitmaps: one for “yes”, one for “no”
Bit set in these bitmaps corresponds to whether the original value has that value or not

So we need 16 bits to store yes/no, plus 18 bits for the bitmaps (nine values, two bits each). This compression approach reduces the size down from 72 bits to 34 bits:

An obvious problem with this approach is that, if your data is high cardinality, this is a terrible idea. For example, assume you have 10 million tuples and suppose you have zip_code column in a table (there are 43 thousand zip codes in US). Assuming we store the zip code with 32 bits:

the raw data is 10.000.000 x 32 bits = 40 MB
with bitmap encoding: 10 million size bitmap for every single ZIP code: 10.000.000 x 43.000 = 53 gigabytes.

Clearly a bad idea. Furthermore, every time somebody adds a new tuple, have to extend that bitmap because offsets have to match. Have to do that for every possible bitmap.

Bitmap indexes can make a huge difference, but really for when you have a small number of unique values (less than maybe 10). Most systems don’t do this by default.

Delta Encoding

Idea: If values from one tuple to the next are close enough to each other, maybe don’t need to store the entire value—just store the difference (delta) between the previous value.

For example, if you want to record the temperature of a room every minute, it is likely to vary slightly:

From one tuple to the next, what was the difference from the previous one?

Timestamp: Just +1 (adding a minute)
Temperature: Some fractional decimal difference

I can compress even further with RLE, getting finally a combination of Delta Encoding and RLE:

This method reduces the size of time64 column down from 320 bits to 96 bits (if you think of billions of records, this would be a massive savings)

Dictionary Compression

This is the most common approach and, as mentioned earlier, this is how most systems compress data, even for things that aren’t strings. In some cases, columnar systems will compress integer data and float data by putting them into dictionary codes.

Idea: If we have values we see over and over again, instead of storing that value repeatedly within a column, convert it to a 32-bit integer (dictionary code) and then we maintain a mapping data structure (the dictionary) that knows how to:

Take that dictionary code (32-bit integer)
Convert it back into the original value

Typically: One-on-one correspondence—for one value, have one dictionary code. There are advanced techniques (academic literature only—no commercial systems) where if you see multiple attributes in patterns together, convert the combination of two or three into a single dictionary code for even further compression, but this is rare.

We need a way to do fast encoding/decoding on the fly that allows both:

Range queries
Point queries

Point queries are obvious: String “Andy” maps to code 101, can do exact lookups. But, ideally, we want to do range queries on compressed data: want dictionary codes to have the same ordering that original values had.

As example, consider a bunch of names as original data. Then i want a compressed version:

Original column converted to 32-bit integers
Mapping table (dictionary) converts: given code → original value, or given value → code

Then:

SELECT * FROM users WHERE name = 'Andy'

Convert string Andy into dictionary code by doing lookup first in dictionary. Then scan through column and do comparisons based on integers.

How do we actually do encoding and decoding?

Encode/Locate: for a given uncompressed value, convert it into is compressed form.
Decode/Extract: for a given compressed value, convert it back into its original form.

There’s not a magic hash function that can do this. Any reversible hash function is going to generate something much larger than the original value (or not get it down to 32-bit integer). So, we’re going to build a data structure we maintain that allows us to do this. As mentioned want something that preserves the ordering of original values such that compressed data (dictionary codes) have the same ordering lexicographically as original data.

We want the dictionary generating codes such that:

If one original value comes before another in ordering
Its dictionary code should come before it as well

Dictionary is basically sorted. This allows queries like:

SELECT * FROM users WHERE name LIKE 'And%'

if we operate directly on compressed data, we can convert LIKE clause into BETWEEN clause:

That’s because we can:

Look up in dictionary, run LIKE portion just on dictionary values
Find ones that match
Find min and max values for matching values
Rewrite LIKE into BETWEEN clause
Rip through column while still compressed

We can do this because it’s SQL—we know exactly what the WHERE clause wants to do. Can be smart, intelligent, rewrite it, get better performance. As the application programmer (not you guys, but some JavaScript programmer), they don’t have to know what’s happening underneath. Just write the LIKE clause—database system is smart, rewrites it for you, gets better performance.

If you need output of name attribute, still have to rip through column and look at them. For example:

SELECT DISTINCT name FROM users WHERE name LIKE 'And%'

So, in this case, you still must perform scan on column.

But the database system can be even smarter and answer queries without looking at compressed data, and it can just operate directly on the dictionary. For example:

SELECT DISTINCT name FROM users WHERE name LIKE 'And%'

Here, instead of SELECT name, it’s SELECT DISTINCT name. So, we don’t need to get actual tuples: we just need unique values. After conversion to BETWEEN (converting wildcard to min/max values), we only need to know what values exist in the dictionary and we don’t need to look at actual column.

For this query with DISTINCT: Assuming only four names in table but a billion rows, only need to look at four rows in dictionary to answer it.

Again, we can do this because database is responsible for compressing data.

Parquet and ORC: One of the big limitations is they don’t actually expose the dictionary when using their libraries and utilities. You can’t do this trick if using Parquet or ORC—one of the biggest limitations of those two formats.

What is the data structure we’re going to use for our dictionary?

Array Approach (Most Common): Really simple array. Works great if files are immutable—build the array once, never have to resize, insert things in place, or move things around. Build it once, done.

Example: consider original data in your column:

The first thing to do is to build the dictionary, that will be a sorted list of values with length of string.

The dictionary code is going to be just an offset into this array.

Here’s the compressed data and these are just offsets (byte offset into the array):

When doing scan: Want second entry (offset 17)?

Jump to byte offset 17
Look in header to tell how big the string is afterwards

The dictionary itself is literally just an array of packed bytes.

If you need something dynamic that can support updates:

Hash table
B+ tree

These are less common. Most people do the array and assume compressed blocks will be immutable. Only if need to rebuild, then rebuild the array.

Summary: Storage Models and Compression

Row stores are important and will show up when discussing:

Query execution
Concurrency control
Recovery
Transactions
Query optimization

The distinction between row store and column store has ramifications throughout all parts of the database system.

It’s really important to understand this now—we’ll see trade-offs between the two approaches again and again throughout the entire semester.

To get the best compression ratio: Want to do it natively yourself. Dictionary coding is the most common approach.

Looking Ahead

We’ve covered (over last three lectures) one of the two problems in database storage:

✓ How we represent data on disk
When we bring things into memory, what do we do with it? How do we store it? How do we write things back out safely when we make changes?

Starting next week: We’ll tackle the second problem—buffer pool management and safely persisting changes.

Quartz 4

Explorer

05. Storage Models & Database Compression