Building Real-Time Dashboards with Materialized Views and Change Data Capture

In today’s fast-paced digital landscape, the ability to make data-driven decisions instantly is no longer a luxury—it’s a necessity. Businesses across every sector, from e-commerce to IoT, demand immediate insights into their operations. This drives the critical need for real-time dashboards: visual representations of data that update milliseconds after changes occur in the underlying systems. Imagine seeing sales figures update as transactions happen, monitoring server health as events unfold, or tracking user behavior the moment it occurs.

However, achieving true real-time analytics is fraught with challenges. Traditional data warehousing approaches often rely on batch processing, leading to data staleness. Directly querying operational databases for dashboards can overwhelm them, impacting transactional performance. This article delves into a powerful, scalable, and efficient architecture for building real-time dashboards using two fundamental technologies: Materialized Views (MVs) and Change Data Capture (CDC).

We’ll explore why these technologies are a perfect match, walk through a practical architectural blueprint, provide code examples, discuss real-world scenarios, and address the common challenges encountered in implementing such a system. By the end, you’ll have a clear understanding of how to empower your organization with live, actionable insights.

The Challenge of Real-Time Analytics

Before we dive into solutions, let’s understand why real-time analytics is inherently complex. The core problem lies in the conflict between operational and analytical workloads:

Operational Databases (OLTP): Optimized for high-volume, low-latency transactions (inserts, updates, deletes). They are typically denormalized for speed and consistency but are not designed for complex analytical queries that scan large datasets. Running heavy reporting queries directly on OLTP databases can lead to performance degradation, locking issues, and a poor user experience for transactional applications.
Analytical Databases (OLAP) / Data Warehouses: Optimized for complex queries, aggregations, and reporting over vast historical datasets. They are typically column-oriented, denormalized for analytical performance, and designed for batch loading processes (ETL/ELT). While excellent for historical analysis, their batch nature means data is often hours or even days old.

Bridging this gap to provide real-time insights presents several hurdles:

Data Latency: How quickly can changes in the source system be reflected in the dashboard? Traditional ETL pipelines introduce significant delays.
Query Performance: Real-time dashboards often require fast aggregations over constantly changing data. Running these queries repeatedly on raw data can be computationally expensive and slow.
Resource Contention: Balancing the needs of transactional systems with demanding analytical queries on the same database is a recipe for performance bottlenecks.
Scalability: As data volume and velocity grow, the system must scale horizontally to handle the load without compromising performance or latency.
Complexity: Building and maintaining real-time data pipelines involves managing multiple components, ensuring data consistency, and handling failures gracefully.

Simply put, continuously running complex SQL queries against a live production database for every dashboard refresh is unsustainable. We need a way to pre-process, aggregate, and serve data specifically for analytical consumption, and to do so with minimal delay from the source system. This is where Materialized Views and Change Data Capture come into play.

Core Technologies Explained

Materialized Views (MVs)

A Materialized View is essentially a database object that stores the pre-computed results of a query. Unlike a regular view, which is a virtual table defined by a query and executed every time it’s accessed, an MV stores the actual data. Think of it as a cache for complex query results.

How Materialized Views Work

When you create a Materialized View, the database executes the underlying query and stores the result set as a table. Subsequent queries against the MV are incredibly fast because they are simply reading from this pre-computed table, rather than re-executing the complex join, aggregation, or filtering operations defined in the MV’s query.

-- Example: Create a Materialized View for daily sales summary
CREATE MATERIALIZED VIEW daily_sales_summary AS
SELECT
    DATE(order_timestamp) AS sale_date,
    product_id,
    SUM(quantity) AS total_quantity_sold,
    SUM(price * quantity) AS total_revenue
FROM
    orders
GROUP BY
    DATE(order_timestamp),
    product_id;

-- Query the MV - much faster than querying the base table directly
SELECT sale_date, SUM(total_revenue) FROM daily_sales_summary GROUP BY sale_date ORDER BY sale_date DESC;

Benefits of Materialized Views

Improved Query Performance: By pre-calculating and storing results, MVs drastically reduce query execution times, making dashboards responsive.
Reduced Load on Source Systems: Complex queries are executed less frequently (only when the MV is refreshed), offloading the analytical burden from transactional databases.
Simplified Application Logic: Dashboard applications can query simple MVs instead of constructing complex, performance-intensive SQL.
Data Consistency (at refresh): At the moment of refresh, the MV provides a consistent snapshot of the data.

Drawbacks of Materialized Views

Data Staleness: This is the primary challenge for real-time applications. MVs are only as fresh as their last refresh. If not refreshed frequently, they become stale.
Refresh Overhead: Refreshing an MV (especially a full refresh) can be a resource-intensive operation, potentially locking tables or consuming significant CPU/IO, which can be problematic for very large MVs.
Storage Requirements: MVs consume disk space, potentially duplicating data.

For MVs to be truly useful in a real-time context, we need an efficient mechanism to keep them updated with minimal latency and overhead. This is where Change Data Capture shines.

Change Data Capture (CDC)

Change Data Capture (CDC) is a set of software design patterns used to determine and track the changes in a database. Instead of relying on full table scans or scheduled batch jobs, CDC focuses on identifying exactly what data has changed, when it changed, and what the nature of that change was (insert, update, or delete).

Why CDC is Crucial for Real-Time

CDC is the engine that drives real-time data pipelines. It provides an event-driven approach to data synchronization, which is fundamental for keeping Materialized Views fresh without overwhelming the source database. Instead of having to re-read entire tables to find changes, CDC tells us precisely which rows were affected.

Methods of CDC

There are several ways to implement CDC, each with its pros and cons:

Trigger-Based CDC: Database triggers are set up on source tables. When an insert, update, or delete occurs, the trigger writes the change event to a separate “change log” table.
- Pros: Relatively easy to implement, works with most databases.
- Cons: Adds overhead to OLTP transactions, requires schema modifications, can be complex to manage for schema evolution, potential for infinite loops if not carefully managed.
Timestamp-Based CDC: Tables are designed with a last_modified_timestamp column. CDC processes periodically query for rows where this timestamp is newer than the last processed time.
- Pros: Simple to implement.
- Cons: Misses deletes, requires careful indexing, prone to race conditions if not handled transactionally, can still be inefficient for large tables.
Log-Based CDC (Recommended for Production): This method reads the database’s transaction log (e.g., PostgreSQL’s WAL, MySQL’s Binlog, Oracle’s Redo Logs). These logs contain a detailed, ordered record of every change made to the database.
- Pros:
  - Zero Impact on Source System: It reads from the transaction log, which is already being written by the database, incurring minimal overhead on the active OLTP workload.
  - Comprehensive: Captures all types of changes (inserts, updates, deletes) and often the “before” and “after” state of rows.
  - Transactional Order: Preserves the exact order of events as they occurred in the source database, crucial for data consistency.
  - Non-Intrusive: Requires minimal or no schema changes to source tables.
- Cons: More complex to set up, requires specific database configurations (e.g., logical decoding enabled), dependent on database vendor’s log format.

Common CDC Tools

Several tools facilitate log-based CDC:

Debezium: An open-source distributed platform for CDC. It provides a set of Kafka Connect connectors that read database transaction logs and stream all row-level changes to Kafka topics. Highly recommended for its robustness and integration with the Kafka ecosystem.
Fivetran / Stitch / Airbyte: Managed ELT/CDC services that handle the complexities of data extraction and loading into data warehouses.
Native Database Features: Some databases offer built-in CDC features (e.g., SQL Server CDC, Oracle GoldenGate).

By capturing changes as they happen and streaming them as events, CDC provides the real-time “feed” that we need to keep our Materialized Views perpetually fresh.

Architectural Blueprint: CDC + Materialized Views for Real-Time Dashboards

Combining CDC with Materialized Views forms a powerful pipeline for real-time analytics. Here’s a common architecture:

Real-Time Dashboards: MVs & CDC for Live Analytics — Generated Image

High-Level Overview

+-----------------+       +----------------+       +---------------+       +---------------------+       +-------------------+       +--------------------+
| Source Database | <-----| CDC Connector  | <-----| Message Broker| <-----| Stream Processing   | <-----| Analytical DB     | <-----| Dashboarding Tool  |
|   (e.g., PG)    |       |  (e.g., Debezium)|       |  (e.g., Kafka)|       |   (e.g., Flink/Spark)|       | (e.g., PG/ClickHouse)|       |  (e.g., Grafana)   |
+-----------------+       +----------------+       +---------------+       +---------------------+       +-------------------+       +--------------------+
        ^                                                                                                            ^
        |                                                                                                            |
        |                                                                                                            |
        |                                                                                                            |
        +------------------------------------------------------------------------------------------------------------+
                                                    (Serves Materialized Views)

Detailed Data Flow

Data Changes in Source Database: An OLTP database (e.g., PostgreSQL, MySQL) serves transactional applications. As users interact with the application, data is inserted, updated, or deleted. These changes are recorded in the database’s transaction log (e.g., WAL for PostgreSQL, binlog for MySQL).
CDC Captures Changes: A CDC connector (like Debezium) continuously monitors the transaction log of the source database. It captures every row-level change (insert, update, delete) as an event.
Events to Message Broker: Each captured change event is then published as a message to a distributed message broker (e.g., Apache Kafka). Typically, each source table has its own Kafka topic. The message contains metadata about the change (operation type, timestamp, schema) and the actual data (before and after images for updates).
Stream Processing Engine: A stream processing application (e.g., Apache Flink, Apache Spark Streaming, or a custom consumer written in Python/Java) subscribes to the Kafka topics. Its role is to:
- Consume Events: Read change events from Kafka.
- Transform and Aggregate: Apply necessary transformations, filters, and real-time aggregations (e.g., sum sales by product, count active users). This is where the logic for your “Materialized View” is recalculated incrementally based on the incoming events.
- Upsert into Analytical Database: Based on the processed events and aggregations, the stream processor performs an “upsert” (update or insert) operation into a target table in an analytical database. This target table effectively serves as your incrementally updated Materialized View.
Analytical Database (Serving Layer): A database (e.g., PostgreSQL, ClickHouse, a specialized real-time analytics database like Materialize) houses these continuously updated “Materialized View” tables. This database is optimized for analytical queries and can handle high concurrency for dashboard requests.
Dashboarding Tool: Tools like Grafana, Tableau, or Power BI connect to the analytical database. They query the fast-access Materialized View tables to render real-time dashboards, reflecting changes with minimal latency (often sub-second).

This architecture decouples the analytical workload from the operational database, ensures data freshness, and provides a scalable foundation for real-time insights.

Step-by-Step Implementation Guide (Conceptual & Code Examples)

Let’s walk through a conceptual implementation using PostgreSQL as the source and target database, Debezium for CDC, Kafka as the message broker, and a conceptual stream processor.

1. Setting up the Source Database (PostgreSQL)

For Debezium (or any log-based CDC) to work with PostgreSQL, you need to enable logical decoding and configure the replication level.

-- In postgresql.conf:
-- Set `wal_level` to `logical`
-- wal_level = logical
-- Set `max_replication_slots` and `max_wal_senders` appropriately
-- max_replication_slots = 10
-- max_wal_senders = 10

-- Restart PostgreSQL after changes.

-- Example source table for orders
CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INT NOT NULL,
    product_id INT NOT NULL,
    quantity INT NOT NULL,
    price DECIMAL(10, 2) NOT NULL,
    order_timestamp TIMESTAMP DEFAULT NOW()
);

-- Ensure the table has a primary key or unique index for Debezium to correctly identify rows.
-- In PostgreSQL, you might also need to set REPLICA IDENTITY FULL for tables you want to capture:
ALTER TABLE orders REPLICA IDENTITY FULL;

-- Sample data
INSERT INTO orders (customer_id, product_id, quantity, price) VALUES
(101, 1, 2, 19.99),
(102, 2, 1, 49.99),
(101, 3, 3, 9.50);

2. Implementing CDC (Debezium with Kafka)

Debezium runs as a Kafka Connect connector. You’d typically deploy Kafka Connect and then configure a connector. Here’s a simplified conceptual configuration for a Debezium PostgreSQL connector:

-- Debezium PostgreSQL Connector Configuration (e.g., `debezium-postgres-connector.json`)
{
  "name": "postgres-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "tasks.max": "1",
    "database.hostname": "your_postgres_host",
    "database.port": "5432",
    "database.user": "debezium_user",
    "database.password": "debezium_password",
    "database.dbname": "your_database_name",
    "database.server.name": "your_logical_replication_server_name",
    "schema.include.list": "public",
    "table.include.list": "public.orders",
    "plugin.name": "pgoutput",
    "slot.name": "debezium_slot",
    "topic.prefix": "dbserver",
    "key.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "key.converter.schemas.enable": "false",
    "value.converter.schemas.enable": "false"
  }
}

You would then submit this configuration to your Kafka Connect REST API. Once running, Debezium will start streaming changes from the orders table to a Kafka topic, typically named something like `dbserver.public.orders`.

Example CDC Message (Kafka)

An update event from Debezium might look something like this (simplified):

-- Kafka message for an update on the 'orders' table
{
  "before": {
    "order_id": 1,
    "customer_id": 101,
    "product_id": 1,
    "quantity": 2,
    "price": 19.99,
    "order_timestamp": "2023-10-27T10:00:00Z"
  },
  "after": {
    "order_id": 1,
    "customer_id": 101,
    "product_id": 1,
    "quantity": 3,  -- quantity changed
    "price": 19.99,
    "order_timestamp": "2023-10-27T10:05:00Z"
  },
  "source": {
    "db": "your_database_name",
    "table": "orders",
    "sequence": "...",
    "ts_ms": 1678886400000
  },
  "op": "u", -- 'u' for update, 'c' for create/insert, 'd' for delete
  "ts_ms": 1678886405000,
  "transaction": null
}

3. Processing Streamed Data and Updating Materialized Views

This is the core of real-time aggregation. A stream processing engine consumes these Kafka messages, performs real-time aggregations, and then upserts the results into a dedicated “Materialized View” table in our analytical database.

Instead of a full Materialized View in the traditional database sense (which often requires full refreshes), we’re creating and maintaining a regular table that *serves the purpose* of a real-time MV, incrementally updated by the stream processor. This allows for fine-grained control over updates.

Analytical Database (Target Table)

First, create the target table in your analytical database. This table will hold the aggregated data, similar to what our conceptual `daily_sales_summary` MV would hold.

-- Analytical database: Target table for real-time daily sales summary
CREATE TABLE real_time_daily_sales_summary (
    sale_date DATE NOT NULL,
    product_id INT NOT NULL,
    total_quantity_sold BIGINT NOT NULL,
    total_revenue DECIMAL(18, 2) NOT NULL,
    PRIMARY KEY (sale_date, product_id) -- Composite primary key for efficient upserts
);

Stream Processing Logic (Conceptual Example)

A stream processor (e.g., using Python with Kafka-Python and a database client, or a full-fledged system like Flink/Spark Streaming) would perform the following steps:

Consume Kafka Messages: Continuously read messages from the `dbserver.public.orders` topic.
Parse CDC Event: Extract the `op` (operation type), `before`, and `after` fields from each message.
Calculate Delta/Aggregation:
- For ‘c’ (insert) operations: Add the `quantity` and `price * quantity` from the `after` image to the corresponding `sale_date` and `product_id` in the `real_time_daily_sales_summary` table.
- For ‘u’ (update) operations: Calculate the difference. Subtract the `quantity` and `price * quantity` from the `before` image, then add the `quantity` and `price * quantity` from the `after` image. This handles changes in quantity or price.
- For ‘d’ (delete) operations: Subtract the `quantity` and `price * quantity` from the `before` image.
Upsert to Target Table: Perform an upsert operation into `real_time_daily_sales_summary` based on the calculated delta.

Here’s a conceptual Python-like pseudocode snippet illustrating the aggregation logic for an update:

# Conceptual stream processor logic (simplified)

from kafka import KafkaConsumer

import json

import psycopg2 # or other DB client
consumer = KafkaConsumer(

    'dbserver.public.orders',

    bootstrap_servers=['localhost:9092'],

    value_deserializer=lambda m: json.loads(m.decode('utf-8'))

)
def update_summary_table(conn, sale_date, product_id, quantity_delta, revenue_delta):

    cursor = conn.cursor()

    cursor.execute(

        """

        INSERT INTO real_time_daily_sales_summary (sale_date, product_id, total_quantity_sold, total_revenue)

        VALUES (%s, %s, %s, %s)

        ON CONFLICT (sale_date, product_id) DO UPDATE

        SET

            total_quantity_sold = real_time_daily_sales_summary.total_quantity_sold + EXCLUDED.total_quantity_sold,

            total_revenue = real_time_daily_sales_summary.total_revenue + EXCLUDED.total_revenue;

        """,

        (sale_date, product_id, quantity_delta, revenue_delta)

    )

    conn.commit()
# Connect to your analytical database

conn = psycopg2.connect(database="analytics_db", user="user", password="password", host="localhost", port="5432")
for message in consumer:

    event = message.value

    op = event['op']

    before = event['before']

    after = event['after']
    quantity_delta = 0

    revenue_delta = 0
    current_sale_date = None

    current_product_id = None
    if op == 'c': # Insert

        current_sale_date = after['order_timestamp'].split('T')[0]

        current_product_id = after['product_id']

        quantity_delta = after['quantity']

        revenue_delta = after['price'] * after['quantity']
    elif op == 'u': # Update

        # Handle potential change in date or product_id (more complex in real life, might need two upserts)

        # For simplicity, assume sale_date and product_id don't change in the MV key

        current_sale_date = after['order_timestamp'].split('T')[0]

        current_product_id = after['product_id']
        old_quantity = before['quantity']

        old_revenue = before['price'] * before['quantity']

        new_quantity = after['quantity']

        new_revenue = after['price'] * after['quantity']
        quantity_delta = new_quantity - old_quantity

        revenue_delta = new_revenue - old_revenue
    elif op == 'd': # Delete

        current_



            
                        
                Tags:
                analyticsCDCchange data capturedata engineeringdata streamingDebeziumKafkakhadervalimaterialized viewsreal-time dashboards            
            
            
            
                
                                    
                
                    Written by
                    Khader Vali
                    Senior Software Engineer specializing in cloud architecture, real-time systems, and enterprise-scale applications.
                
            

            
            
                Share this article

Real-Time Dashboards: MVs & CDC for Live Analytics