Different Worlds of Data Capture and Data Analysis

Organizations use data for two main purposes: operational record keeping (where you put data in) and analytical decision-making (where you get data out):

Operational Systems:
- These systems are for the daily running of a business, like taking orders or signing up customers.
- They are built to handle one transaction at a time, very quickly.
- They usually focus on the current state of data and often don’t keep a full history.
Data Warehouse / Business Intelligence (DW/BI) Systems:
- These systems are for analyzing business performance and answering questions like “How many new orders did we get compared to last week?”
- They need to search and combine large amounts of data (hundreds or thousands of transactions) quickly, so they’re optimized for high-performance queries.
- They are designed to keep historical data, which is crucial for understanding performance over time.

It’s important to understand that DW/BI systems have very different needs and structures than operational systems. Just copying operational data to a separate system isn’t enough for a good DW/BI solution because it doesn’t meet the unique needs of analytical users.

Goals of Data Warehousing and Business Intelligence

The main goals of a DW/BI system come from common business needs. These goals have been consistent for a long time:

Easy Access: The DW/BI system must make information easy to find and understand for business users, not merely for developers. Data structures and names should match how business users think and speak. It also needs to be fast when users ask questions.
Consistency: Data in the DW/BI system must be reliable and trustworthy. This means data is carefully collected, cleaned, checked for quality, and has consistent names and definitions across all sources.
Adaptability to Change: The system must be able to handle changes in user needs, business conditions, and technology without breaking existing data or applications. When new data is added or new questions are asked, it shouldn’t disrupt what’s already there.
Timeliness: Information needs to be available quickly, sometimes within hours, minutes, or even seconds, especially for important operational decisions.
Security: The DW/BI system must protect sensitive business information because it often contains valuable and confidential data.
Authoritative and Trustworthy Foundation: The data warehouse should be the reliable source for decision-making. The most important result of a DW/BI system is the better decisions made based on the data.
Business Acceptance: Even if the system is technically excellent, it’s only successful if business users actually use it. They will use it if it’s “simple and fast” and provides useful information for their decisions.

To succeed with DW/BI, you need skills that combine both IT knowledge (like a Database Administrator or DBA) and business understanding (like a Master of Business Administration or MBA). You have to bridge the gap between technical details and business needs.

Publishing Metaphor for DW/BI Managers

The book compares the job of a DW/BI manager to that of a magazine editor-in-chief. Just like an editor, a DW/BI manager’s main goal is to serve their “readers” – who are the business users. This means focusing on the users’ needs rather than just the technology. Here’s how the responsibilities line up:

Understanding the Readers (Business Users):
- A magazine editor needs to know their readers: who they are, what they want, and how to attract new ones.
- Similarly, a DW/BI manager must understand business users’ jobs, goals, and the decisions they need to make. They also need to identify the most important users and show potential new users what the DW/BI system can do.
Ensuring the Magazine Appeals (Delivering High-Quality Information):
- An editor makes sure the magazine has interesting content, a good layout, high writing standards, and accurate information. They also adapt to changes in reader interests.
- A DW/BI manager needs to deliver useful, actionable, and accessible data and analysis. This means choosing the right data from many sources, making user interfaces simple and easy to understand, ensuring data is accurate and trustworthy, and adapting to changing user needs and new data.
Sustaining the Publication (Sustaining the DW/BI Environment):
- An editor must keep the magazine running profitably, publish regularly, maintain reader trust, and keep owners happy.
- A DW/BI manager’s role is to regularly update the system, maintain user trust, and keep all key stakeholders (users, executives, IT management) satisfied. They should also take credit for the positive business decisions made using the system to justify ongoing investment.

The key takeaway is that a successful DW/BI manager focuses outward on the business users and their needs, rather than just inward on the technical aspects of the system. While technology is used to build the system, the main job is to provide valuable information that helps the business make better decisions.

Dimensional Modeling Introduction

Dimensional modeling is a widely used and effective method for organizing analytical data. It focuses on two important things:

Making data easy for business users to understand.
Making sure data queries run very fast.

This approach has been popular for over 50 years because it makes databases simple, which is something people naturally prefer. Simplicity helps users easily grasp the data and allows software to quickly get and deliver results.

Imagine a business executive describing their work: “We sell products in different markets and track our performance over time.” A dimensional designer would pick up on these key ideas: product, market, and time. Most people can easily picture this as a “data cube” with these three edges. Inside the cube are the measurements, like sales volume or profit, for each combination of product, market, and time. This ability to visualize complex data simply is key to making it understandable. If it seems too simple, that’s a good sign! A design that starts simple is more likely to stay simple, leading to better performance and user acceptance. As Albert Einstein said, “Make everything as simple as possible, but not simpler.”

It’s important to know that dimensional models are different from Third Normal Form (3NF) models, even though both often use relational databases.

3NF Models (Normalized Models):
- These models aim to remove duplicated data by breaking it into many small, separate tables.
- They are great for operational systems because updating or adding data only affects one place in the database.
- However, they are too complex for business intelligence (BI) queries. Users find them hard to understand and navigate
- Databases struggle to run queries efficiently on such complex structures, leading to slow performance.
Dimensional Models:
- These models are designed for analytical data presentation.
- They contain the same information as 3NF models, but they organize it in a way that is easy for users to understand, provides fast query performance, and can handle changes well.

The industry sometimes refers to 3NF models as entity-relationship (ER) models. Entity-relationship diagrams (ER diagrams or ERDs) are drawings that communicate the relationships between tables. Both 3NF and dimensional models can be represented in ERDs because both consist of joined relational tables; the key difference between 3NF and dimensional models is the degree of normalization. Because both model types can be presented as ERDs, we refrain from referring to 3NF models as ER models; instead, we call them normalized models to minimize confusion.

In short, dimensional modeling solves the problem of overly complicated data structures for analytical purposes, making data accessible and performant for business users.

Star Schemas Versus OLAP Cubes

Dimensional models can be implemented in two main ways, which are often confused:

Star Schemas: These are dimensional models built using relational databases. They are called “star schemas” because their structure (a central fact table surrounded by dimension tables) looks like a star.
OLAP Cubes (Online Analytical Processing Cubes): These are dimensional models built using multidimensional database environments. They represent data as a cube, allowing for quick “slicing and dicing” (analyzing data from different angles).

Both star schemas and OLAP cubes use the same core idea of dimensions and facts. The main difference is how they are physically built and stored.

OLAP Cubes generally offer:

Better Query Performance: They often have pre-calculated summaries (aggregations) and special indexing, making queries very fast, especially for drilling up or down.
Richer Analytical Functions: They can do more advanced analysis than standard SQL queries in relational databases.

However, OLAP Cubes have a downside:

They can be slower to load, especially with large amounts of data.

Kimball’s Recommendation: The book generally recommends loading detailed, raw data into a star schema first. Then, if needed, OLAP cubes can be built from that star schema. This book will mostly focus on techniques for relational star schemas.

OLAP Deployment Considerations (Things to remember for OLAP Cubes):

A star schema in a relational database is a good stable base for an OLAP cube and for backup/recovery.
While OLAP cubes traditionally had a huge performance edge, advances in computer hardware (like in-memory databases) and relational databases (like columnar databases) have reduced this gap.
OLAP cube structures can vary more between different vendors than relational databases, making it harder to move your BI applications from one OLAP tool to another.
OLAP cubes often provide more advanced security (e.g., limiting access to detailed data but allowing more access to summaries).
They excel at complex analyses that are harder to do with standard SQL. This is often the main reason to use an OLAP product.
OLAP cubes handle Slowly Changing Dimension Type 2 (SCD2) changes well (where new rows are added for changes, keeping history). However, other types of SCDs that overwrite data might require reprocessing the cube.
They work well with transaction and periodic snapshot fact tables, but not typically with “accumulating snapshot” fact tables due to their data overwriting limitations.
They are good at handling complex, flexible hierarchies (like organizational charts) with their own query language, which is often better than relational database approaches.
They might have strict rules for how dimension keys are structured for drill-down hierarchies.
Some OLAP products may not support “dimensional roles” or “aliases,” meaning you might need to create separate physical dimensions for different uses of the same concept.

Fact Tables for Measurements

The fact table is the core of a dimensional model where measurements from business events are stored.

Purpose: It holds the numbers that tell you “how much” or “how many.”
Centralized Data: It’s best to store low-level measurement data in a single, centralized location to ensure everyone in the organization uses consistent numbers.
What are Facts? Facts are business measures, like “unit quantity” or “dollar sales amount” from a sales transaction.
Grain: Each row in a fact table represents a single measurement event at a specific level of detail, called the grain. For example, one row might be “one product sold in one transaction.” All rows in a fact table must be at the same grain to prevent mistakes like double-counting.
Additivity:
- The most useful facts are numeric and additive, meaning you can sum them up across any dimension (like summing sales dollars over different products, markets, or time periods).
- Some facts are semi-additive (can be summed across some dimensions but not others, like account balances across time).
- Non-additive facts (like unit prices) can never be summed.
Continuously Valued: Facts are usually numbers that can take on many different values within a range (e.g., a sales amount).
Textual Facts: Textual data generally belongs in dimension tables, not fact tables. Only truly unique text for every fact row (which is rare and hard to analyze) should be considered a text fact. Storing repetitive text in fact tables wastes space and makes analysis difficult.
Sparsity and Size: Fact tables only contain rows for actual activity (no zeros for “no sales”). They are typically very large (90% or more of the dimensional model’s space) but have few columns (deep but narrow).
Types of Fact Tables:
- Transaction: Most common, records individual events (e.g., each sale).
- Periodic Snapshot: Summarizes activity over a fixed period (e.g., monthly sales totals).
- Accumulating Snapshot: Tracks the progress of a process with a clear start and end (e.g., order fulfillment steps).
Keys and Relationships:
- Fact tables have foreign keys (FKs) that link them to the primary keys in dimension tables.
- When all FKs correctly match, it ensures referential integrity.
- The fact table usually has its own composite primary key, made up of some or all of its foreign keys.
- Fact tables represent many-to-many relationships between dimensions.

Here’s an example of how business process measurement events is translated into fact tables:

Furthermore, note that the idea that a measurement event in the physical world has a one-to-one relationship to a single row in the corresponding fact table is a bedrock principle for dimensional modeling. Everything else builds from this foundation.

Dimension Tables for Descriptive Context

Dimension tables are essential partners to fact tables in a dimensional model. They provide the descriptive information (the “who, what, where, when, how, and why”) that gives context to the numerical measurements stored in the fact table.

Structure:

Dimension tables often have many columns (attributes), sometimes 50 to 100, but can have fewer.
They typically have fewer rows than fact tables, but they can be “wide” because of many text columns.
Each dimension table has a single primary key (PK) that links it to the fact table.

Here’s an example of dimension table:

Purpose of Attributes:

Dimension attributes are the main way users can filter, group, and label reports. When a user wants to see “sales by brand,” “brand” must be a dimension attribute.
They make the DW/BI system usable and understandable.

Clarity and Usability:

Attributes should use real words, not confusing abbreviations or codes.
If operational systems use codes, provide their full, user-friendly descriptions as separate dimension attributes. This ensures consistent labeling across all reports and applications. Don’t hide these translations in reporting tools.
If operational codes have embedded meanings (e.g., first two digits mean “line of business”), break out these meanings into separate, easily filterable dimension attributes.
The quality and detail of dimension attributes are crucial for the DW/BI system’s analytical power. The more effort spent on detailed, business-friendly descriptions and high-quality data in these attributes, the better the system will be for analysis.

Fact vs. Dimension Attribute (Numeric Data):

Sometimes it’s hard to decide if a number is a “fact” or a “dimension attribute.”
Facts are typically measurements that change often, take on many different values, and are used in calculations (e.g., dollar sales).
Dimension attributes are usually relatively constant descriptions, have a limited set of values, and are used for filtering or labeling (e.g., product color code).
Rule of Thumb: Continuously changing numeric observations are almost always facts; discrete (limited list) numeric observations are almost always dimension attributes.

Hierarchies and Denormalization:

Dimension tables often include hierarchical relationships (e.g., products roll up into brands, which roll up into categories). For example:
This hierarchical information (like brand and category descriptions) should be stored directly within the dimension table itself, even if it means some repetition. This is called denormalization.
You should avoid “snowflaking”, which is normalizing dimension tables into separate lookup tables (e.g., a separate table just for brands). This would make queries more complex and slower without significant storage savings, as dimension tables are much smaller than fact tables.
The goal is to trade off a little storage space for simplicity and easy access for users and faster query performance.

Finally, the concepts of “fact” and “dimension” have been around for a long time, dating back to the 1960s, and are not the invention of Ralph Kimball. They emerged naturally from the need to make analytical data easy to understand and performant.

Facts and Dimensions Joined in a Star Schema

When fact tables and dimension tables are put together, they form a dimensional model, often called a star schema because of its star-like shape. In this structure:

Each business process (like sales or inventory) has its own dimensional model.
The fact table, at the center, holds the numerical measurements from that business event.
It is surrounded by a “halo” of dimension tables, which provide the descriptive information (like “who,” “what,” “where,” “when,” “how,” and “why”) that was true when the event happened.

Here’s an example:

Key Benefits of Star Schemas:

Simplicity and Understandability:
- The design is simple, clear, and easy for business users to understand and navigate.
- Users often immediately recognize how the model represents their business.
- The reduced number of tables and use of meaningful names make it less likely for users to make mistakes.
Fast Query Performance:
- Database systems can process these simple structures very efficiently, especially because there are fewer table joins compared to more complex models.
- Database optimizers can quickly filter the smaller, highly indexed dimension tables first, and then efficiently access the large fact table using the filtered keys. This allows for very fast querying, even with many joins.(TODO: try to understand what this part means)
Graceful Extensibility (Handles Change Well):
- Dimensional models are designed to easily adapt to new or unexpected changes in user questions or business needs.
- You can add new dimensions as long as a single value for that new dimension exists for each existing fact row.
- You can add new facts to the fact table, as long as the level of detail matches the existing facts.
- You can add new attributes to existing dimension tables.
- In all these cases, existing data often doesn’t need to be reloaded, and existing reports or applications continue to work without changes.

Let’s understand visually the complementary nature of fact and dimension tables by looking at the following figure, where dimension attributes supply the report filters and labeling, whereas the fact tables supply the report’s numeric values:

Atomic Data is Key: It’s important to build fact tables on the most detailed, unaggregated (atomic) data. This gives you the most flexibility to answer unexpected “ad hoc” (on-the-fly) questions from business users.

How it looks in a Report (and SQL):

SELECT
	store.district_name,
	product.brand,
	sum(sales_facts.sales_dollars) AS "Sales Dollars"
FROM
	store,
	product,
	date,
	sales_facts
WHERE
	date.month_name="January" AND
	date.year=2013 AND
	store.store_key = sales_facts.store_key AND
	product.product_key = sales_facts.product_key AND
	date.date_key = sales_facts.date_key
GROUP BY
	store.district_name,
	product.brand

or in a more “modern” version:

SELECT
    s.district_name,
    p.brand,
    SUM(f.sales_dollars) AS "Sales Dollars"
FROM
    sales_facts f
JOIN store s ON s.store_key = f.store_key
JOIN product p ON p.product_key = f.product_key
JOIN date d ON d.date_key = f.date_key
WHERE
    d.month_name = 'January'
    AND d.year = 2013
GROUP BY
    s.district_name,
    p.brand;

In a report, dimension attributes provide the filters and the labels (e.g., “District Name,” “Brand”).
Fact table data provides the numerical values (e.g., “Sales Dollars”).
The SQL query to create such a report shows how dimension tables are joined to the fact table to filter, group, and sum the data. The SELECT statement pulls dimension attributes and aggregated facts, the FROM clause lists tables, WHERE applies filters and joins, and GROUP BY organizes the results.

Kimball’s DW/BI Architecture

The Kimball DW/BI architecture describes the different parts of a data warehouse and business intelligence system. It has four main components:

Operational Source Systems
ETL System (Extract, Transformation, and Load)
Data Presentation Area
Business Intelligence Applications

Understanding each component’s role is crucial to building an effective DW/BI system.

Operational Source Systems

These are the systems that run the daily business operations and capture all the transactions.

Outside the Data Warehouse: You usually have little control over the data’s content or format in these systems.
Main Focus: Their priority is processing transactions quickly and being available for daily operations.
Query Patterns: Queries against these systems are typically small, focused on one record at a time, and part of the normal transaction flow. They are not designed for broad, complex analytical queries like a DW/BI system.
Historical Data: They generally keep little historical data, as they focus on the most current state. A good data warehouse helps by taking over the responsibility of storing historical information.
Data Sharing: Source systems often don’t share common data (like product or customer details) with other systems, unless there’s a big system like an Enterprise Resource Planning (ERP) system or a master data management (MDM) system in place.

Extract, Transformation, and Load System (ETL)

The ETL system is the bridge between the operational source systems and the data warehouse’s presentation area. It includes the work area, data structures, and the processes to move and prepare data.

Let’s explore each step:

Extraction: This is the first step. It means reading and copying the necessary data from the source systems into the ETL environment. At this point, the data belongs to the data warehouse.
Transformation: This is where the data is improved. It involves:
- Cleansing data: Fixing errors (like misspellings), resolving conflicting values, handling missing information, and making data follow standard formats.
- Combining data: Bringing data together from multiple source systems.
- De-duplicating data: Removing duplicate records.
- These tasks add value to the data and can also help identify issues in the source systems that need fixing over time.
Loading: This is the final step where the prepared data is physically loaded into the dimensional models in the presentation area.
- The ETL system is crucial for creating and updating dimension and fact tables.
- For dimensions, it handles tasks like assigning surrogate keys, looking up codes for descriptions, and combining/splitting columns. It also often “flattens” normalized source data into denormalized dimension tables.
- Fact tables are usually large and take time to load, but their preparation is often simpler.
- Once the dimension and fact tables are updated, indexed, and aggregated (summarized), and quality checked, business users are informed that new data is available.

Creating a separate, normalized database during the ETL process is a good idea?

The author is addressing a common debate in data warehousing: After you’ve extracted data from source systems, should you build a “perfect,” highly structured normalized database (often in Third Normal Form or 3NF) before you then transform that data into the final dimensional model (e.g., a star schema) that business users will query?

It’s Often Pointless. The ETL process is mostly about sorting, filtering, and validating data. You can often do this with simple files without the massive effort of building a full-fledged relational database, only to immediately tear it down and rebuild it in a different (dimensional) format for the presentation area. It’s like building a perfect brick wall just to knock it down to build something else with the same bricks.
It’s Inefficient. Even if your developers are more comfortable working with normalized structures, this two-step process is wasteful:
- More Work: You have to load the data twice—once into the normalized database and then again into the dimensional model.
- More Time: This adds significant time to both the initial development and the ongoing, periodic data updates.
- More Cost: It requires more storage for multiple copies of the data and bigger budgets for hardware, development, and support.
It’s Risky. This is the most critical warning. Many data warehouse projects have failed because the team spent all their time, money, and energy perfecting the “backstage” normalized database. They get so focused on this technical step that they never finish the final dimensional model—the part that actually delivers value to the business.

It is acceptable to create a normalized database to help organize the ETL process, but it should be treated as a temporary, backstage staging area, not the final product. The author’s final, non-negotiable rule is: This intermediate normalized database must be off-limits to business users. Allowing users to query it would be slow and confusing, defeating the primary goals of a data warehouse: understandability and performance. Users should only ever have access to the final dimensional model.

Presentation Area to Support Business Intelligence

The DW/BI presentation area is the part of the data warehouse that business users directly interact with. It’s where data is organized, stored, and made ready for queries, reports, and other analytical tools. For business users, this is the DW/BI environment.

The book has strong beliefs about the presentation area:

Dimensional Schemas are a Must: Data must be presented and stored in dimensional schemas, either relational star schemas or OLAP cubes. This is now widely accepted as the best way to deliver data to DW/BI users.
Detailed, Atomic Data is Essential: The presentation area must contain detailed, granular (atomic) data.
- This detailed data is crucial for answering unexpected, specific questions from users.
- While summary data (aggregates) can also be included for better performance, it’s unacceptable to only store summaries and keep the detailed data hidden in complex, normalized models.
- Users need to be able to “drill down” to the most granular level of detail to ask precise questions. Since user needs are always changing and unpredictable, access to this fine-grained data is vital.
Business Process-Centric Structure: The presentation area should be organized around business process measurement events.
- This means creating a single fact table for core metrics (like sales) that serves all departments (sales, marketing, finance) rather than building separate, slightly different databases for each department.
- This approach ensures consistency across the enterprise.
Conformed Dimensions (Enterprise Data Warehouse Bus Architecture): All dimensional structures must use common, shared (conformed) dimensions.
- This is a fundamental principle of the enterprise data warehouse bus architecture.
- Without shared dimensions, different parts of the data warehouse become isolated (“stovepipe” systems), leading to inconsistent views of the business.
- Conformed dimensions allow different dimensional models to be easily combined and used together, which is key for building a truly integrated and robust DW/BI environment.
- This “bus architecture” also enables building the data warehouse in a flexible, step-by-step (agile and iterative) way.

In summary, data in the presentation area must be:

Dimensional (star schemas or OLAP cubes).
Atomic (most detailed level), with aggregates for performance.
Business process-centric (organized around core business events).
Built using the enterprise data warehouse bus architecture (with conformed dimensions).
Not structured based on individual department interpretations.

Business Intelligence Applications

These are the tools and capabilities that allow business users to use the data in the presentation area for analysis and decision-making.

Querying is the Goal: The main purpose of BI applications is to allow users to query the data in the DW/BI presentation area to make better decisions.
Range of Tools: BI applications can be simple or complex:
- Ad hoc query tools: These allow users to create their own custom queries, but only a small percentage of users might be skilled enough to use them effectively.
- Prebuilt applications and templates: Most business users will likely access the data through simpler, pre-designed reports and applications that don’t require them to write queries directly.
- Sophisticated applications: These include tools for data mining, forecasting, or advanced modeling. Sometimes, the results from these tools might even be uploaded back into operational systems, the ETL system, or the presentation area itself.

Restaurant Metaphor for the Kimball Architecture

This metaphor compares the different parts of a DW/BI environment to a restaurant to highlight why each component needs to be separate and managed effectively.

ETL in the Back Room Kitchen

The ETL (Extract, Transformation, Load) system is like the restaurant’s kitchen.

Behind the Scenes: Just as diners don’t go into the kitchen, the ETL system is off-limits to business users and BI application developers. It’s not safe, and it would distract the professionals working there. Things happen in the kitchen (or ETL) that customers shouldn’t see directly, like raw ingredients being processed.
Planning and Design: A lot of planning goes into designing an efficient kitchen layout, similar to how the ETL system needs careful architecture before data extraction.
Key Goals (Efficiency, Quality, Integrity):
- Efficiency/Throughput: The kitchen must be highly efficient to produce many meals quickly, just as the ETL system must efficiently transform raw data.
- Consistent Quality: Chefs make special sauces consistently in the kitchen, rather than letting individual diners mix their own. Similarly, the ETL system applies business rules and cleans data once to ensure consistent, high-quality information is delivered to users, rather than relying on each user to do it themselves. This puts more work on the ETL team but ensures a better, more consistent product.
- Integrity/Safety: Kitchens are designed to prevent food contamination. Likewise, the ETL system ensures data quality and integrity, checking incoming data and monitoring conditions. You don’t want users “dipping their fingers” into unfinished, potentially unclean data.
Skilled Professionals: The kitchen has skilled chefs with their tools. The ETL system is staffed by professionals who use their tools to “magically transform” source data into usable information.
Done Once in the Back Room: A key idea is that work done in the back room (ETL) by professionals is done once for consistency. This reduces repeated work that would otherwise have to be done over and over by business users in the “front room” BI applications.
Delivery: Once data is ready and quality-checked, it’s brought through a “doorway” into the DW/BI presentation area, just like finished meals are brought to the dining room.

Data Presentation and BI in the Front Dining Room

The Data Presentation Area and Business Intelligence (BI) applications are like the restaurant’s dining room. This is where the “customers” (business users) experience the “meal” (data).

Primary Deliverable - The “Food” (Data):
- Just as a restaurant’s primary deliverable is good food, the DW/BI system’s main deliverable is the data in the presentation area.
- Users expect consistency and high quality. The data must be well-prepared and “safe to consume.”
- The DW/BI system provides “menus” (metadata, published reports) to show what data is available.
“Decor” (Organization and Design):
- The dining room’s decor is designed for customer comfort. Similarly, the presentation area’s organization and design (the dimensional models) must be based on the preferences of the BI users, not just the developers. It needs to be appealing and easy to navigate for the users.
“Service” (Delivery and Support):
- Good restaurant service means prompt and accurate food delivery. In DW/BI, data must be delivered quickly, as requested, and in a format that is appealing and understandable to business users or BI application developers.
“Cost”:
- A restaurant must offer meals at a price customers are willing to pay to survive. The DW/BI system also has a “cost” factor, meaning the investment in building and maintaining it must be justified by the value it provides.
User Satisfaction is Key to Success:
- If diners are happy, the restaurant thrives (busy dining room, good revenue, expansion).
- If DW/BI users are happy, the system is used, decisions are improved, and the project is successful.
- If users are unhappy, they will stop using the system, just as unhappy diners will find another restaurant. This wastes the investment in the DW/BI system.
- DW/BI managers must proactively monitor user satisfaction and address issues immediately, similar to how restaurant managers check on diners. You can’t wait for complaints, as users might just quietly leave.

In essence, the metaphor stresses that the DW/BI system needs a well-managed “back room” ETL to prepare data, and a user-friendly “front room” presentation area and BI applications to ensure business users are satisfied and actually use the data for decision-making.

Alternative DW/BI Architectures

Now that we’ve covered the Kimball architecture, let’s look at some other ways DW/BI systems can be built. We’ll explore two main alternatives and then a hybrid approach that combines them. It’s good to know that over time, the differences between these approaches have become less strict, and dimensional modeling still plays a role, no matter which architecture you choose.

The most important thing, regardless of the architecture, is making sure the DW/BI system helps the business make better, more informed decisions. The architecture itself is just a tool to reach that goal.

Independent Data Mart Architecture

This approach involves creating analytical data solutions on a department-by-department basis, without much thought for sharing or integrating information across the entire organization. Here’s an illustration:

Typically, a single department will identify its data needs from an operational system. Then, working with IT or consultants, they build a database (often called a “data mart”) that specifically meets their departmental requirements, reflecting their own business rules and preferred ways of labeling data. This data mart works in isolation to help that department with its analysis.

The problem arises when another department needs similar data from the same source. Since they can’t access the first department’s data mart, they go through the same process, building their own solution with similar but slightly different data. When these two departments then compare performance numbers using their separate reports, the figures rarely match. This happens because they used different business rules or labels.

This creation of standalone analytical “silos” is essentially an “un-architected” approach, even though it’s quite common, especially in large organizations. It often reflects how many companies fund IT projects, and it avoids the need for cross-organizational data governance and coordination. In the short term, it can seem like the easiest and cheapest way to develop solutions quickly. However, over time, having multiple uncoordinated data extractions from the same sources and redundant storage of analytical data becomes very inefficient and wasteful. Without an overall enterprise view, this approach leads to many isolated solutions that provide conflicting views of the organization’s performance, causing unnecessary debates and efforts to reconcile numbers.

While we strongly advise against this independent data mart approach, it’s worth noting that these data marts often do use dimensional modeling. This is because departments want data that’s easy to understand and quick to query. However, these independent dimensional models usually ignore some key Kimball principles, such as focusing on atomic-level detail, designing around business processes instead of just departments, and using conformed dimensions for enterprise consistency.

Hub-and-Spoke Corporate Information Factory (Inmon) Architecture

This approach, championed by Bill Inmon, focuses on creating a central, highly normalized Enterprise Data Warehouse (EDW). Here’s an illustration:

In this architecture, data is extracted from operational source systems and goes through an ETL process (sometimes called “data acquisition”). The atomic, raw data then lands in a Third Normal Form (3NF) database. This normalized, atomic repository is the central Enterprise Data Warehouse (EDW) within the CIF architecture. Unlike the Kimball approach, where normalization for ETL is optional, a normalized EDW is a required central component in the CIF. Both Kimball and CIF aim for enterprise data coordination and integration, but the CIF believes the normalized EDW fulfills this role, whereas Kimball emphasizes an “enterprise bus” built with conformed dimensions.

It’s important to understand that normalization itself doesn’t guarantee integration. Normalization simply structures tables based on many-to-one relationships. Integration, on the other hand, means resolving inconsistencies when data comes from different sources. You could normalize incompatible data sources completely without actually integrating them. The Kimball architecture, with its focus on conformed dimensions, addresses data inconsistencies directly without strictly requiring normalization.

In organizations that adopt the CIF approach, business users sometimes directly query the central EDW because it contains detailed data or because of its data availability. However, there are also additional ETL processes that load data from the EDW into “downstream” reporting and analytical environments for business users. While these analytical databases are often dimensionally structured, they frequently differ from the Kimball architecture’s presentation area. They are often department-specific (instead of process-centric) and populated with aggregated data (instead of detailed atomic data). If these downstream ETL processes apply specific business rules, like renaming columns for a department or using alternative calculations, it can become difficult to link these departmental analytical databases back to the detailed, atomic data in the central EDW.

In our opinion, the most extreme version of a pure CIF architecture is unworkable as a data warehouse. Such an architecture can lock the most detailed data in complex, normalized structures that are hard to query, while delivering department-specific and incompatible data marts to different groups of users. However, it’s worth noting that the next architectural approach offers a more positive outlook.

Hybrid Hub-and-Spoke and Kimball Architecture

This final architecture combines elements of both the Kimball and Inmon CIF approaches. Here’s an illustration:

In this hybrid model, a CIF-style EDW is created, but it’s strictly off-limits for direct analysis and reporting by business users. Its sole purpose is to serve as the source for populating a Kimball-style presentation area. This presentation area contains data that is dimensional, atomic (with aggregates for performance), organized by business process, and adheres to the enterprise data warehouse bus architecture using conformed dimensions.

Some supporters of this blended approach believe it offers the best of both worlds. It can leverage existing investments in an integrated, normalized repository while solving the performance and usability problems often associated with direct querying of a 3NF EDW by offloading user queries to the user-friendly dimensional presentation area. And since the final data delivered to business users and BI applications is built according to Kimball’s principles, it’s a very defensible approach.

If your organization has already invested in a 3NF EDW but finds that it’s not meeting user expectations for fast and flexible reporting and analysis, this hybrid approach could be a good fit. However, if you’re starting from scratch, the hybrid approach will likely cost more time and money, both during development and in ongoing operations. This is due to the multiple movements of data and the redundant storage of atomic details. If your organization has the desire, recognizes the need, and, most importantly, has the budget and patience to fully normalize and prepare your data before loading it into well-designed dimensional structures using Kimball methods, then this hybrid approach is certainly an option.

Dimensional Modeling Myths

Even though dimensional modeling is widely accepted, some mistaken ideas about it still exist in the industry. These misunderstandings can be a distraction and make it harder to agree on the best practices for building a DW/BI system. This section aims to clear up these common misconceptions.

Myth 1: Dimensional Models are Only for Summary Data

This frequent misconception often leads to poor designs. Since all user questions can’t be predicted, providing access to the most detailed, atomic data is crucial for flexibility, allowing users to aggregate it as needed. Summary data should only supplement, not replace, these granular details for performance.

Similarly, there’s no inherent limit on the historical data a dimensional model can store; the amount should solely depend on business requirements.

Myth 2: Dimensional Models are Departmental, Not Enterprise

Dimensional models should be organized around core business processes (like orders or invoices), not departmental structures. Multiple departments often need to analyze metrics from the same business process. Building separate, inconsistent analytical databases from the same source data for each department leads to data inconsistencies and should be avoided.

Myth 3: Dimensional Models are Not Scalable

This is false; dimensional models are highly scalable. Fact tables commonly handle billions, even trillions, of rows. Database vendors continuously enhance their products to optimize dimensional models for scalability and performance. Both normalized and dimensional models contain the same information and relationships and can answer identical questions, though with varying ease.

Myth 4: Dimensional Models are Only for Predictable Usage

Dimensional models should be designed around stable measurement processes, not specific, predefined reports, as these analyses constantly evolve. The key to flexibility lies in building fact tables at the most granular (atomic) level. Models relying only on summary data will hit “analytic brick walls,” limiting drill-down capabilities and making it hard to accommodate new dimensions or facts. Prioritizing the lowest level of detail from the start ensures maximum flexibility and extensibility.

Myth 5: Dimensional Models Can’t Be Integrated

Dimensional models are indeed integrable if they adhere to the enterprise data warehouse bus architecture. This requires conformed dimensions, which are centrally built and maintained in the ETL system, then reused across various dimensional models. This practice ensures data integration and consistent definitions. Achieving this consistency demands effort, regardless of whether normalized or dimensional models are used. Failing to use conformed dimensions results in isolated data solutions, a problem inherent in neglecting this fundamental tenet, not in dimensional modeling itself.

Quartz 4

Explorer

01. Data Warehousing, Business Intelligence, and Dimensional Modeling Primer