0. Resources


Last class discussed alternative approaches to the tuple-oriented (slotted page) storage scheme. The primary focus was log-structured storage, where instead of storing actual tuples, you store log entries of changes made to tuples.

The three storage approaches covered:

  1. Tuple-oriented storage (slotted pages)
  2. Log-structured storage
  3. Index-organized storage

These approaches are ideal for write-heavy workloads—applications doing many inserts, updates, or deletes. Log-structured storage is obviously better for this because you’re just appending to the log.

For most applications: When you start off, you’re going to have a potentially write-heavy workload. But there are some applications, environments, or workloads where you don’t care about getting the best performance for writes—you want the best performance for reads. For those scenarios, these approaches may not be the best way to approach storage.

This lecture will discuss broad categories of database applications and provide motivation for why we might want an alternative storage scheme where we don’t store everything as rows (tuples with all attributes together).

Database Workloads

Let’s see some broad categories or database applications and some motivation for why we want to look at an alternative storage scheme:

  • On-Line Transaction Processing (OLTP)
  • On-Line Analytical Processing (OLAP)
  • Hybrid Transaction + Analytical Processing

On-Line Transaction Processing (OLTP)

OLTP applications are systems where you’re ingesting new data from the outside world and serving many users simultaneously. An OLTP workload is characterized by fast, short running operations, repetitive operations and simple queries that operate on single entity at a time. OLTP workloads typically handle more writes than reads, and only read/update a small amount of data each time.

The canonical example: Amazon. When you go to the Amazon website:

  • You look at products
  • Click things and add them to your cart
  • Purchase items
  • Go to your account information and update your mailing address or payment information

These are all OLTP-style workloads because you’re making changes to a small subset of the database—updating your cart, updating your payment information.

Other examples:

  • Posting things on Reddit or Hacker News
  • Making small changes to potentially a large database
  • Each query or operation makes a small change
  • The amount of data read is small
  • Operations target a single entity (one user, one order, one product)

OLAP: Online Analytical Processing

OLAP workloads are where you run queries that extract or extrapolate new information across the entire dataset. An OLAP workload is characterized by long-running, complex queries and reads on large portions of the database. In OLAP workloads, the database system is often analyzing and deriving new data from existing data collected on the OLTP side. It’s about running queries that extract new information across the entire dataset. OLAP operations are primarily read-heavy or read-only, so I’m not doing single updates.

Example OLAP query: Amazon running a query that says “Find me the number one sold product in the state of Pennsylvania on a Saturday when the temperature is above 80 degrees.”

Characteristics:

  • Not looking at a single person or entity
  • Looking across entire tables
  • Potentially doing many joins
  • Combining multiple sources of information
  • Similar to queries in Homework 1

These OLAP workloads are primarily read-heavy or read-only—you’re not doing single updates, you’re doing large scans and joins over big tables.

HTAP: Hybrid Transaction + Analytical Processing

HTAP is a buzzword from industry analysts (Gartner) describing applications where you want to do both OLTP-style workloads and OLAP workloads, potentially in the same system.

Instead of taking all transactional data and putting it into a separate data warehouse to do analytics, maybe you can do some analytics directly as data comes in.

This will be discussed throughout the semester, but the main two categories to focus on are OLTP and OLAP.

Visualizing Workload Types

Use a simple grid where along the x-axis we’re saying wether the workload is made read-heavy vs write-heavy and on the y-axis we’re saying how complex the queries are:

OLTP (bottom-left corner):

  • Potentially many updates
  • Queries are really simple
  • Example: SELECT * FROM account WHERE id = 'Andy'
  • Getting single things, single entities

OLAP (top-right corner):

  • Mostly reads
  • SELECT queries are much more complex than OLTP
  • Think about Q9 and Q10 from Homework 1
  • Complex aggregations, joins, and analytics

HTAP would be somewhere in the middle of this spectrum.

We’ll talk about why the things we talked about so far they’re gonna be good for OLTP, but not for OLAP, and then we’ll design a storage scheme that is better for OLAP. To understand why approaches discussed so far are good for OLTP but not OLAP, let’s use a real example: the Wikipedia database.

Example: Wikipedia Database

Wikipedia is open-source—it runs software called MediaWiki, runs on MySQL and PHP. You can go look at it. The schema roughly looks like this:

  1. useracct - People making changes
  2. pages - The articles in Wikipedia
  3. revisions - Edits/changes to those articles

Relationships:

  • Foreign key from revision → user (who created the change)
  • Foreign key from revision → page (which page was edited)
  • Foreign key from page → revision (to find the latest revision)
  • All the text itself goes in the revision table

Important reminder: The relational model does not define or specify anything about how we should store data in a table. In all examples shown so far, we’ve been showing tuples with every table having all attributes one after another. Yes, there were overflow pages for large attributes, but in general, all the smaller attributes are stored together. But there’s nothing about the relational model that says you have to do that. It’s just what we as humans came up with first because it’s easy to conceptually think about. For OLAP workloads, this may not be the best approach.

OLTP queries

For OLTP, you have many simple queries that read or write a small amount of data relative to all data in the database.

Example 1: Get latest revision for a page (this retrieves one page and one revision):

SELECT P.*, R.*
FROM pages AS p 
INNER JOIN revision as R
ON P.latest = R.revID
WHERE P.pageID = ?

Example 2: Update user login timestamp (updating a single user account with the timestamp of when they logged in and the hostname):

UPDATE useracct
SET lastLogin = NOW(),
hostname = ?
WHERE user_id = ?

Example 3: Insert new revision (inserting a single row):

INSERT INTO revision VALUES (...)

This is what you usually end up with when building a brand new application. If you’re creating a startup and building an online service, you start with something looking like this because you don’t have any data initially—you need to get it. You make a website, and your website runs these kinds of queries.

OLAP queries

For OLAP, you do more complicated things requiring you to look at larger portions of the table.

Example: Congressional Wikipedia Editing Scandal. This is a rough approximation of a real query people were running:

- [ ] SELECT COUNT(U.lastLogin), EXTRACT(month FROM U.lastLogin) as month
FROM useracct AS U 
WHERE U.hostname LIKE '%.gov' 
GROUP BY EXTRACT(month FROM U.lastLogin)

This finds all login attempts from users with IP addresses or hostnames ending in .gov.

The scandal (late 2000s/early 2010s): People in Congress were having their staff update Wikipedia to say more flattering things about the congressman or congresswoman. Mike Pence did this, Joe Biden did this.

This query could find all people doing that—essentially paying government employees to update Wikipedia, which they shouldn’t be doing.

This is the type of query you’d execute on data after you’ve already collected it from the OLTP portion of the application—analytics on historical data.

SELECT COUNT(U.lastLogin),
		EXTRACT(month FROM U.lastLogin) AS month
FROM useracct AS U
WHERE U.hostname LIKE '%.gov'
GROUP BY
EXTRACT(month FROM. U.lastLogin)

You execute these workloads on the data you have collected from your OLTP application(s).

Storage Models

A DBMS’s storage model is how the database system physically organizes tuples on disk and in memory relative to other tuples and their own attributes.

Until now, we’ve been assuming all attributes are contiguous for a tuple—this is roughly called a row store. For OLAP, that may not be the best approach, and we’ll see why.

The reason why we have to discuss this part of as system is because there is a clear distinction in the database marketplace between:

  • Row store system - Use for OLTP
  • Column store system - Use for OLAP

If anybody tries to say “I have a fast row store you can use for analytics,” you should be very skeptical.

There are different ways to store tuples in pages:

  • N-ary Storage Model (NSM) - The row store
  • Decomposition Storage Model (DSM) - The column store
  • Hybrid approach: PAX (Partition Attributes Across) - The most common one for column stores

When people say they have a column store, they really have the PAX one, but it’s not a major difference from pure column stores.

N-ary Storage Model (NSM): Row Store

This is what we’ve assumed so far this semester: All attributes for a given tuple are stored contiguously in a single page, one after another.

The idea: You’re going across the page, laying out all data for a given tuple, and you don’t start laying down any bits for the next tuple until you finish the current tuple.

This approach is ideal for OLTP workloads because most OLTP queries access a single entity or single tuple. You can go to a single page and get all the data you need for that single entity—that’s really all you need to satisfy the query.

Page sizes: Always some multiple of hundreds of bytes (we’ve discussed this: 4KB, 8KB, 16KB, etc.).

NSM: Physical Organization

The layout is basically the same as before:

  • Database page with header
  • Slot array
  • As you scan through the table and insert data, append entries to the end
  • Keep adding more and more until the page fills up

When a query comes along (SELECT * FROM table WHERE id = ?):

  • Need to get one page
  • Jump to the offset defined in the slot array
  • Get all the data needed

Let’s see how this works in the Wikipedia example where we have someone wants to login with username and password:

SELECT * FROM useracct
WHERE userName = ?
AND userPass = ?

Checking whether credentials match—roughly how authentication works in database-backed applications. Ignore for now how we find the data (assume there’s some index, hash table, or B+ tree that gives us the Record ID and offset).

Process:

  1. Go to page directory and find the page with the data
  2. Look in slot array
  3. Jump to offset
  4. Have all the data needed for the query

This is ideal for OLTP because all data is contiguous.

Same for INSERT:

  1. Look in page directory to find a page with a free slot
  2. Bring it into memory (assume it’s a particular page)
  3. Append to the end

Consider the earlier OLAP query: Finding number of logins per month for users with .gov hostnames:

SELECT COUNT(U.lastLogin),
		EXTRACT(month FROM U.lastLogin) AS month
FROM useracct AS U
WHERE U.hostname LIKE '%.gov'
GROUP BY
EXTRACT(month FROM. U.lastLogin)

What happens:

  • Must scan all pages in the table (need to look at all user accounts)
  • When bringing a page in, need to execute the WHERE clause on hostname
  • Need to look at tuples in the page where the hostname predicate is satisfied
  • The only data really needed is just the hostname attribute
  • Then need to do aggregation on last_login for the GROUP BY

The only attributes actually needed:

  • hostname (for the WHERE clause)
  • last_login (for the GROUP BY aggregation)

The obvious problem: You have to go through all the rows and brought in a bunch of data you don’t actually need. To get just the attributes needed, you had to bring in the entire page. The entire page contains attributes like:

  • user_id
  • username
  • user_password

You don’t need these for this query. You’re doing useless I/O—fetching data from disk that you don’t even need at all.

Not only is this slow, but in some systems (like Amazon Aurora), you pay per disk IOPS. You pay for the number of I/O operations for a query. In this case, you’d be paying for data you don’t actually need.

Summary: Why NSM is Bad for OLAP

Problem 1: Bringing in Unnecessary Data

NSM is great for:

  • Insert, update, delete operations
  • Queries that need all data for a single entity

NSM is bad for queries that:

  • Scan large portions of a table
  • Only need a subset of attributes (which most OLAP queries do)

It’s very rare to do SELECT * on a really wide, huge table because you’re basically dumping the whole thing out. There are utilities to make that faster, but typically OLAP queries are selective about which columns they need.

Problem 2: Non-Sequential Memory Access

This is a lower-level detail, but think about how you’d actually execute the query to run the predicate: you’re jumping around to different locations in memory to do the scan:

  1. Read the header for the first tuple
  2. Figure out how far to jump to get to the hostname
  3. Maybe look at lastLogin if computing aggregates
  4. Jump down to the next tuple
  5. Jump over to its hostname attribute
  6. Repeat…

In a modern superscalar CPU, this is terrible because there are:

  • Many non-sequential operations
  • Could be non-deterministic
  • Memory locations accessed aren’t random (always increasing order), but not reading contiguous strides of memory
  • Can’t crunch through data very quickly

This isn’t covered deeply this semester, but it’s worth understanding. We’ll see this when covering compression and query execution.

Problem 3: Poor Compression Opportunities

When trying to compress data to reduce amount or pack more data within a single page, all attributes for a given table are thrown together in that page. There’s less chance for:

  • Repeatability
  • Identifying that values are the same and can be compressed well

Think about the data domains:

  • userID: Integer (could be sequential or random)
  • userName: Random strings
  • userPassword: Random strings (hashed)

They’re different value domains mixed together and this is not ideal for compression because there’s less uniformity and predictability within a page.

Decomposition Storage Model (DSM)

In the Decomposition Storage Model, the DBMS, instead of storing all attributes for a single tuple together in a page, we store a single attribute (column) for all tuples in a single page. Thus, it is also known as a “column store.” This model is ideal for OLAP workloads with many read-only queries that perform large scans over a subset of the table’s attributes that we really need.

Example: That lastLogin field—instead of having it intermixed with other attributes within a single page, only have lastLogin in dedicated pages.

DSM is ideal for OLAP queries that scan the entire table and only need a subset of attributes:

  • When you fetch a page from disk, you’re only getting data for attributes you actually need
  • Not fetching other things that just get carried along for the ride

The benefit of declarative language (SQL): You don’t have to know or care whether you’re running on a row storage system versus a column storage system. Your same SQL query works just fine. The database system’s responsibility (us, the people building it): Take data, split it up into separate columns/attributes, and stitch it back together when needed to produce results.

DSM: Physical Organization

Think of it this way: For the first column (first attribute, Col A):

  • Separate file with a bunch of pages
  • Header telling us what’s inside the page
  • NULL bitmap for all values within that column
  • Contiguous values for all tuples in the table

Do the same for the next column, and the next one, and so on.

These files are still broken up as database pages (4KB, 8KB, whatever the system supports), but each file contains data for only a single attribute and metadata overhead for column files is much less than a row store because you don’t have to keep track of:

  • Every single column’s NULL status individually
  • Different information about offsets for finding different attributes
  • Where to find things within a heterogeneous tuple

All stored because:

  • It’s all the same value domain
  • All the same attribute type
  • Much less metadata than a row store

DSM Example: Wikipedia OLAP Query

Going back to the Wikipedia example, we take every column for the table and store it as a separate page:

For the hostname example: Within a single page, we’re only storing values for the hostname column. We have separate pages for all other attributes.

Executing the OLAP Query

Let’s take again the query we examined before that counts the logins per month with .gov hostnames:

SELECT COUNT(U.lastLogin),
		EXTRACT(month FROM U.lastLogin) AS month
FROM useracct AS U
WHERE U.hostname LIKE '%.gov'
GROUP BY
EXTRACT(month FROM. U.lastLogin)

First part of execution: Get the hostname data

  • Assuming one page per attribute (simplified), fetch that one page
  • Do the scan, ripping through the column
  • Identify all matches for the hostname predicate

There is 100% utilization of all data brought in—only bringing in data needed for this query, not attributes we don’t care about.

Second part: Now need lastLogin data

  • We’ll talk later (in query execution) about how to match things up
  • Assuming we keep track of a list of offsets of tuples within the hostname column that match the predicate
  • Go to the lastLogin page, fetch it
  • Again, only has data we need
  • Know how to jump to different offsets of matching hostnames to find the right lastLogin timestamp
  • Compute whatever needed for the query

DSM: Tuple Identification

I said “fetch the page with hostname, run WHERE clause, get matches, then fetch lastLogin page and magically find the matches there.” How do we do that?

Approach 1: Fixed-Length Offsets (Most Common)

Identify tuples not by slot number but by their offset within the table. The unique identifier for a tuple (why we say “tuple” instead of “row”—what does a row look like in a column store?) is its offset within the table.

If I’m at offset 3 in one column, I know how to jump to offset 3 in another column, then stitch that tuple back together. This only works if values are fixed-length.

What breaks this assumption? Variable-length data:

  • VARCHAR strings
  • BLOBs
  • TEXT

Approach 2: Embedded IDs (Legacy)

Less common, legacy approach: With every single value, have some unique tuple identifier (like log-structured storage—some counter incremented by one).

Then there’s some index structure (not shown) where, for a given Record ID for a given column, it tells you where to jump to.

This is rare—you probably shouldn’t worry about it. There was some old system that did this because they were contorting a row store to make it a column store. Everyone uses fixed-length offsets.

Here’s a visual comparison between Approach 1 and Approach 2:

DSM: Variable-Lenght Data

Since Fixed-Length Offsets approach is more common, the problem we have to deal is how to convert variable-length values into fixed-length values. Solution: use dictionary compression to replace repetitive variable-length values with fixed-length codes (usually 32 bits). You can:

  • Do predicates on the dictionary code
  • If necessary, if it matches something, do a lookup to find the actual value

DSM: System History

The idea is not new. According to literature:

  • the very first version goes back to the 1970s. Cantor (Swedish Defense Ministry project): More of a file system than a database system, but considered the first documented proposal for a column store system. Unknown if it exists today.
  • 1980s: A paper mapped out the theoretical properties of what the Decomposition Storage Model looked like, but it was mostly only in academia.
  • SybaseIQ (1990s): Roughly considered the first commercial implementation, but not really a full-fledged database system—more like a query accelerator.
    • Similar to Oracle’s approach: SybaseIQ would make a copy of your row store into an in-memory column store. Queries would show up, and Sybase would figure out: “Should I go to the row store or run the query on the in-memory column store?”
  • The 2000s is when column store stuff really took off. Three key systems in the space:
    1. Vertica - Founded by Mike Stonebraker (the instructor’s PhD advisor) and Stan Zdonik
    2. VectorWise - A fork out of MonetDB at CWI
    3. MonetDB - A major academic project at CWI
  • DuckDB is from CWI, so there’s a lineage. The first version of DuckDB was actually called MonetDB Lite. They threw all the code away and started DuckDB from scratch after learning from MonetDB Lite.
  • VectorWise was started by someone who worked on MonetDB.
  • The two main people at VectorWise: One left and was a co-founder of Snowflake. A lot of the early ideas VectorWise developed are in Snowflake. The other guy, Peter Boncz, went back to CWI and helped advise the DuckDB project.

There were other systems at the time, but these are considered the three major pioneers in the column store space.

The Walmart Story

How this all came about (according to Mike Stonebraker). Mike was consulting for Walmart Labs in the early 2000s. They were struggling to scale their Teradata database (which at the time was a row store). Walmart’s database was multiple petabytes—every single transaction, every time somebody bought something at a store, every scan from the cash register went into that database.

They were struggling to get Teradata to run fast. Mike said, “We should just make this a column store.” He founded the C-Store project, which became Vertica.

Modern Column Stores (2010s and Beyond)

Now pretty much everybody does this. Here’s a sample of different database systems considered column stores:

  • Snowflake
  • Vertica
  • MonetDB
  • DuckDB
  • Many others

Two key developments in the 2010s: Open-source file formats:

  1. Parquet - Came out of Dremio and others
  2. ORC - Came out of Facebook

These are open-source columnar file formats. Now you can build database systems that read and write Parquet or ORC files.

Advantages of DSM (Column Store)

  • Greatly reduced wasted I/O for analytical queries because you’re only reading the exact data you need.
  • Better cache reuse and locality for access patterns:
    • Literally ripping through columns one after another
    • Not jumping around within memory
    • Better for CPUs
  • Better compression (discussed next):
    • Values from the same column tend to be similar
    • Better compression ratios

Disadvantages of DSM (Column Store)

  • Slow for point queries: Looking up a single tuple requires accessing multiple column files.
  • Slow for inserts, updates, deletes: Have to split things up and write data to multiple locations. To reconstruct, have to bring it back in from multiple places.

Multi-Column Access Problem

Important point: In the earlier example, the query seemed to process one column (hostName scan), then move to another (lastLogin). In reality, queries often want to look at multiple columns simultaneously.

Your WHERE clause only referenced one attribute, but as you’ve seen in Homework 1, oftentimes you have multiple columns or attributes referenced in your WHERE clause. It would be expensive or cumbersome to maintain state as you’re scanning one column while fetching another column at the same time, trying to patch things together. This, we want a way to have attributes that are accessed together somewhat close to each other on disk or in files, but still get all the benefits of a column store layout. This is what the PAX model provides.

Hybrid Storage Model (PAX)

As mentioned, in most systems when they say they’re a column store, they’re using PAX. Even Parquet and ORC are really doing this.

The idea is: instead of having a separate file for every single column by itself, break them up into chunks (row groups) and have data for tuples in the same row close to each other, just spaced out in separate pages.

PAX: Physical Organization

Going back to the example, we horizontally partition the table into row groups, then within that row group, partition based on columns.

Look at the first three rows in the picture above:

  • In the giant file, define a row group
  • Header for that row group
  • All values for Column A together
  • All values for Column B together
  • All values for Column C together

If you have a WHERE clause needing both Column A and Column C, when you fetch pages for this row group, you’ll have all the data you need close to each other. You’ll also get benefit of sequential I/O because this row group is tens of megabytes instead of 4KB or 8KB pages

This is roughly how Parquet works. There are many diagrams and presentations of Parquet.

  • Default page size: 1 megabyte (they want to group things together and have as much sequential I/O as possible)
  • Row group size: 128 megabytes

Database Compression

Disk I/O is (almost) always the main bottleneck I/O. Whatever the exact size of a tuple in a table, every page brought in contains exactly that data. To speed up queries, you can:

  • Skip data: Column store approach helps—avoid reading attributes you don’t need
  • Make data you fetch bring more tuples into memory: Compression*

There’s a trade-off between speed and compression ratio.

Disk is potentially slower than CPU, especially in cloud settings. You’re willing to pay the extra cost of decompressing/compressing data because it reduces:

  • Number of IOPS
  • Time wasted on I/O to fetch things

The gap between disk speed and CPU speed is narrowing: In some cases, disk is getting so fast that maybe you don’t want things compressed.

Other benefits of keeping things compressed: The database system can actually run faster when processing things in memory (covered in a few weeks when discussing query execution).

In general, for most systems: Compressing things on disk is always going to be a win.

Goals of compression:

  • Any compression scheme must produce fixed-length values (as discussed before) because if you want to store in a column store, you want fixed-length offsets.
  • Delay decompression as long as possible during query execution (discussed more later). Idea: If you have a bunch of 1-megabyte strings in your table but can convert them to 32-bit integers:
    • Want to process 32-bit integers for as long as possible
    • As you copy data from one operator to the next during query execution (or over the network in distributed systems)
    • Want to keep things compressed as long as possible
    • Only decompress when you actually have to show something back (someone needs it decompressed or user needs output)
  • The most important thing: Ensure we’re using a lossless compression scheme.
    • Lossless: No information is lost when you compress/decompress things.
    • Lossy schemes: MP3, MP4, JPEG—doing tricks about how humans perceive audio/visual data to compress to much smaller sizes. If you have the raw image or sound file, when you compress it, you’re not going to get back the same values when you decompress.
      • We don’t want lossy in database systems because, as said before, people don’t like losing data. If you have 90—you’re going to notice and complain.
    • You can do lossy compression yourself at the application level.
      • Example: Keeping track of room temperature every second for 10 years. Do you really need exact temperature at one-second intervals a year from now? No—you can compress it to average temperature per minute. You can’t get back original data because it’s been aggregated. That might be okay, but you as the user in the application have to know whether that’s acceptable. The database system doesn’t do it—the database system always uses lossless schemes.

What to Compress: Granularity Levels

Several choices for what to compress:

  1. Block-level: Compress a block of tuples for the same table
  2. Tuple-level: Compress the contents of the entire tuple (NSM-only)
  3. Attribute-level: Compress a single attribute within one tuple (overflow); can target multiple attributes for the same tuple.
    • Example: Overflow tables (discussed before)—if storing huge text attributes (like Wikipedia revisions, could be kilobytes of text), compress just that one entry. PostgreSQL does this, many systems do this.
  4. Column-level: compress multiple values for one or more attributes stored for multiple tuples (DSM-only).

We’ll discuss block-level compression first, then spend most time on column-level compression because that matters most for columnar systems.

Block-Level Compression

To do block-level compression, you essentially need to use a naive compression scheme. By “naive”, we mean that the database system makes a call to a third-party library (like gzip—though you wouldn’t want to use that because it’s slow, there are faster alternatives:) that takes the page and compresses it to some binary form where the database system has no way to interpret or do any introspection into the compressed version of the block.

If you call gzip on a file, the database system doesn’t know how to read inside that compressed file—it has to decompress it to get back the original version. You wouldn’t want to use gzip. There are faster alternatives:

  • LZO (1990s): A big breakthrough
  • Snappy (Google): Fast compression
  • LZ4: Very fast
  • Zstandard (Facebook): Considered state-of-the-art compression scheme now. They’re working on a new version (not public yet) that’s even faster and better. Zstandard is what you should be using.

MySQL’s Table Compression

MySQL supports table compression (declare it on a per-table basis—not on by default).

How It works: Pages written to disk are compressed into some multiple of page size (multiples of 2, up to 8 kilobytes). Each page has a header portion called the “modification log” (mod log). You can do writes and make changes to the page without decompressing it first. It’s a little extra space at the beginning.

If your compressed page is 6 kilobytes, they’ll pad it up to the next highest value (8KB) within that range (1, 2, 4, 8). This ensures:

  • No fragmentation in layout on disk
  • When bringing things into memory, alignment is maintained

Say a query wants to read something on page zero: for a blind write (insert, delete, or even update assuming you have the values):

  • Don’t need to decompress the page
  • Just write that change to the mod log
  • It’s just log-structured storage (think of the mod log as log-structured storage.)

In some cases, can do reads on the mod log: If the data you need was just inserted and is in the mod log, don’t have to decompress the rest of the page.

If you need to read the page:

  • Decompress it
  • Store it as a regular 16KB page in memory in the buffer pool (default size for MySQL)
  • Can do whatever reads you want

Still keep the compressed version around. I think also when it gets decompressed, they apply the changes from the mod log to the page.

This is a decent idea. PostgreSQL doesn’t do it—not because PostgreSQL doesn’t do something means you shouldn’t do it. PostgreSQL has an amazing frontend, but the backend is actually pretty terrible because a lot of the design is remnants from the 1980s. It’s not how you’d build a modern system today. PostgreSQL doesn’t support compression for regular data pages (only for TOAST tables—overflow pages).

Why This Approach Has Limitations

Because MySQL is a row store, you have to use a naive compression scheme. You can’t do anything fancy because:

  • Values you’re storing in tuples/pages are from all different attributes
  • Not going to be able to do all the native compression schemes (discussed next)

Because we’re using a general-purpose compression algorithm (Snappy or Zstandard):

  • The database system doesn’t know how to interpret what the compressed bytes mean
  • Spoiler: All those compression algorithms are doing some variant of dictionary compression—building their own dictionary of repeated byte sequences
  • But MySQL doesn’t know how to read that dictionary, so it has to decompress the whole thing

For some workloads, this is actually a good idea. Kind of wish PostgreSQL did do some compression.

Operating Directly on Compressed Data

For OLAP, ideally we want to run queries directly on compressed data without decompressing first.

Assuming you have some compression algorithm (not saying what it is) and a compressed form of the database: if the query shows up wanting to get salary for ‘Andy’, do some magic that converts the query constant string ‘Andy’ into the compressed form. Then do a direct lookup on the compressed table using the compressed constant—don’t have to decompress every single page as you’re going along.

Benefits:

  • Reduce amount of I/O (fetching compressed pages)
  • Don’t have to decompress them to do lookups

This is ideally what we want, and the easiest way to do this is in a columnar system.

Compression Algorithms Overview

This is a quick overview of different compression algorithms you could possibly have:

  1. Run-length encoding (RLE)
  2. Bit-packing
  3. Mostly encoding
  4. Bitmap encoding
  5. Delta encoding
  6. Dictionary compression

Dictionary compression is the default choice for most systems. But you may want to compress a single column using these other schemes first and you can have a multiplicative effect: do compression one way, then run another compression algorithm on the compressed data, getting even better compression. Still done in a way where the database system can natively interpret what those bytes mean in compressed form without decompressing first. This is why you want the database system to do everything—don’t want the OS or anybody else to do anything. Because we can do this native compression.

Run-Length Encoding (RLE)

Basic idea: If you have contiguous runs of values that are the same thing (literally the same value), instead of storing that value over and over, store a compressed summary that says:

  • For this value at this offset
  • Here’s how many occurrences it has

This works great if your data is sorted based on whatever column you’re trying to compress. You can’t always do this, but if you sort things, you maximize the amount of repeated runs.

Consider a sample table with id field and isDead field saying whether somebody’s dead or not. Let’s see how to build the compressed form of this table: scan through the column, find contiguous tuples with the same value, convert into triplets:

  • Value
  • Offset
  • Run length

Example query:

SELECT isDead, COUNT(*)
FROM users
GROUP BY isDead

You can rip through the new isDead column and compute aggregation by just summing up the length of the run along with the value.

Can do even better: Say you have somebody not dead, then dead, then not dead—three triplets where run size is one. You’re actually doing worse (storing triplet when you could store single value). If you sort the data based on whether somebody’s dead or not:

  • Compressed column only has two entries
  • All dead people
  • All non-dead people

This method greatly reduces amount of data to store. Thinking in extremes: With a billion people, compress down to tracking who’s dead/not dead into a small number of bytes fitting on one page.

Why do you need the length in the triplet? Assuming fixed offsets, this allows you to figure out—if you need to find for a single tuple/entry if they’re dead or not—it allows you to do the math to reverse it back and say “I would be at this offset if uncompressed.” Simple arithmetic.

Bit-Packing

Idea: People oftentimes declare attributes/columns in a type larger than they actually need. For example, consider a column that keeps track of some number, declared as INTEGER type (in SQL it’s a 32-bit integer). Even if it’s a small value, we’re still allocating 32 bits to store it. For these numbers, none of them are very big, but, since each is 32-bit integer, we’re allocating so much space:

It’s clear that the only thing that matters here is the lower portion of bits (the actual data) represented by the first 8 bits. The other 24 bits are just wasted space:

Bit-Packing proposes to store as 8-bit integer, even though you declared it as 32-bit integer. Greatly reduces size by a factor of four—from 256 bits to 64 bits:

You can do tricks with bit-shifting operators (in C) to, as you’re scanning along trying to find matches on a certain number:

  • These are now 8-bit integers
  • Put them into a single 32-bit integer
  • Keep track internally: it’s really at this offset with these different values
  • With a single CPU instruction, operate on four values at once

What’s the problem with this solution? What happens if you add a 32-bit integer to the database that can’t fit in 8 bits? The solution for this is from Amazon Redshift and it’s called Mostly Encoding. The idea is as follows: most data in the column is going to be small enough, but in cases where it’s not, keep track of that and store separately in a dictionary.

Example: 32-bit numbers, but one value (99999999) is really big.

  • Still store as 8 bits
  • Have a special marker value (think of all bits set to one)
  • Separate table: for a given offset, here’s what the original value should be

As you’re scanning the column, if you see the special marker value, know you should look in this offset table to find what the real value should have been.

As example, consider a table with eight 32-bit values. The size of this table is: 256 bits. Now, let’s applying compression with mostly encoding:

  • 8-bit mostly column: 64 bits
  • Lookup table (assuming 16 bits for offset, 32 bits for value): 48 bits
  • Total: 112 bits (not counting additional metadata overhead)

Bitmap Encoding

If you have an attribute with low cardinality (small number of unique values), instead of storing the actual value for every tuple in a column you can apply Maintain bitmaps where:

  • One bitmap for every possible value in the column
  • Bit is set to one if the tuple at that offset has that particular value

Some database systems provide bitmap indexes that do the same technique—you still have the original column, but they maintain bitmap indexes. On the other hand, there’s a company coming to talk about their database (either FeatureBase or FeatureForm—two different databases with “feature” in the name). One of them only stores bitmap indexes—you can’t actually store the base data.

Going back to the isDead column (only two possible values: yes or no), instead of storing actual values themselves:

  • Two bitmaps: one for “yes”, one for “no”
  • Bit set in these bitmaps corresponds to whether the original value has that value or not

So we need 16 bits to store yes/no, plus 18 bits for the bitmaps (nine values, two bits each). This compression approach reduces the size down from 72 bits to 34 bits:

An obvious problem with this approach is that, if your data is high cardinality, this is a terrible idea. For example, assume you have 10 million tuples and suppose you have zip_code column in a table (there are 43 thousand zip codes in US). Assuming we store the zip code with 32 bits:

  • the raw data is 10.000.000 x 32 bits = 40 MB
  • with bitmap encoding: 10 million size bitmap for every single ZIP code: 10.000.000 x 43.000 = 53 gigabytes.

Clearly a bad idea. Furthermore, every time somebody adds a new tuple, have to extend that bitmap because offsets have to match. Have to do that for every possible bitmap.

Bitmap indexes can make a huge difference, but really for when you have a small number of unique values (less than maybe 10). Most systems don’t do this by default.

Delta Encoding

Idea: If values from one tuple to the next are close enough to each other, maybe don’t need to store the entire value—just store the difference (delta) between the previous value.

For example, if you want to record the temperature of a room every minute, it is likely to vary slightly:

From one tuple to the next, what was the difference from the previous one?

  • Timestamp: Just +1 (adding a minute)
  • Temperature: Some fractional decimal difference

I can compress even further with RLE, getting finally a combination of Delta Encoding and RLE:

This method reduces the size of time64 column down from 320 bits to 96 bits (if you think of billions of records, this would be a massive savings)

Dictionary Compression

This is the most common approach and, as mentioned earlier, this is how most systems compress data, even for things that aren’t strings. In some cases, columnar systems will compress integer data and float data by putting them into dictionary codes.

Idea: If we have values we see over and over again, instead of storing that value repeatedly within a column, convert it to a 32-bit integer (dictionary code) and then we maintain a mapping data structure (the dictionary) that knows how to:

  • Take that dictionary code (32-bit integer)
  • Convert it back into the original value

Typically: One-on-one correspondence—for one value, have one dictionary code. There are advanced techniques (academic literature only—no commercial systems) where if you see multiple attributes in patterns together, convert the combination of two or three into a single dictionary code for even further compression, but this is rare.

We need a way to do fast encoding/decoding on the fly that allows both:

  1. Range queries
  2. Point queries

Point queries are obvious: String “Andy” maps to code 101, can do exact lookups. But, ideally, we want to do range queries on compressed data: want dictionary codes to have the same ordering that original values had.

As example, consider a bunch of names as original data. Then i want a compressed version:

  • Original column converted to 32-bit integers
  • Mapping table (dictionary) converts: given code → original value, or given value → code

Then:

SELECT * FROM users WHERE name = 'Andy'

Convert string Andy into dictionary code by doing lookup first in dictionary. Then scan through column and do comparisons based on integers.

How do we actually do encoding and decoding?

  • Encode/Locate: for a given uncompressed value, convert it into is compressed form.
  • Decode/Extract: for a given compressed value, convert it back into its original form.

There’s not a magic hash function that can do this. Any reversible hash function is going to generate something much larger than the original value (or not get it down to 32-bit integer). So, we’re going to build a data structure we maintain that allows us to do this. As mentioned want something that preserves the ordering of original values such that compressed data (dictionary codes) have the same ordering lexicographically as original data.

We want the dictionary generating codes such that:

  • If one original value comes before another in ordering
  • Its dictionary code should come before it as well

Dictionary is basically sorted. This allows queries like:

SELECT * FROM users WHERE name LIKE 'And%'

if we operate directly on compressed data, we can convert LIKE clause into BETWEEN clause:

That’s because we can:

  • Look up in dictionary, run LIKE portion just on dictionary values
  • Find ones that match
  • Find min and max values for matching values
  • Rewrite LIKE into BETWEEN clause
  • Rip through column while still compressed

We can do this because it’s SQL—we know exactly what the WHERE clause wants to do. Can be smart, intelligent, rewrite it, get better performance. As the application programmer (not you guys, but some JavaScript programmer), they don’t have to know what’s happening underneath. Just write the LIKE clause—database system is smart, rewrites it for you, gets better performance.

If you need output of name attribute, still have to rip through column and look at them. For example:

SELECT DISTINCT name FROM users WHERE name LIKE 'And%'

So, in this case, you still must perform scan on column.

But the database system can be even smarter and answer queries without looking at compressed data, and it can just operate directly on the dictionary. For example:

SELECT DISTINCT name FROM users WHERE name LIKE 'And%'

Here, instead of SELECT name, it’s SELECT DISTINCT name. So, we don’t need to get actual tuples: we just need unique values. After conversion to BETWEEN (converting wildcard to min/max values), we only need to know what values exist in the dictionary and we don’t need to look at actual column.

For this query with DISTINCT: Assuming only four names in table but a billion rows, only need to look at four rows in dictionary to answer it.

Again, we can do this because database is responsible for compressing data.

Parquet and ORC: One of the big limitations is they don’t actually expose the dictionary when using their libraries and utilities. You can’t do this trick if using Parquet or ORC—one of the biggest limitations of those two formats.

What is the data structure we’re going to use for our dictionary?

  • Array Approach (Most Common): Really simple array. Works great if files are immutable—build the array once, never have to resize, insert things in place, or move things around. Build it once, done.

Example: consider original data in your column:

The first thing to do is to build the dictionary, that will be a sorted list of values with length of string.

The dictionary code is going to be just an offset into this array.

Here’s the compressed data and these are just offsets (byte offset into the array):

When doing scan: Want second entry (offset 17)?

  • Jump to byte offset 17
  • Look in header to tell how big the string is afterwards

The dictionary itself is literally just an array of packed bytes.

If you need something dynamic that can support updates:

  • Hash table
  • B+ tree

These are less common. Most people do the array and assume compressed blocks will be immutable. Only if need to rebuild, then rebuild the array.

Summary: Storage Models and Compression

Row stores are important and will show up when discussing:

  • Query execution
  • Concurrency control
  • Recovery
  • Transactions
  • Query optimization

The distinction between row store and column store has ramifications throughout all parts of the database system.

It’s really important to understand this now—we’ll see trade-offs between the two approaches again and again throughout the entire semester.

To get the best compression ratio: Want to do it natively yourself. Dictionary coding is the most common approach.

Looking Ahead

We’ve covered (over last three lectures) one of the two problems in database storage:

  1. ✓ How we represent data on disk
  2. When we bring things into memory, what do we do with it? How do we store it? How do we write things back out safely when we make changes?

Starting next week: We’ll tackle the second problem—buffer pool management and safely persisting changes.