Database Architecture

Distributed data storage with Citus and PostgreSQL

Philosophy

Traditional monolithic databases eventually hit scalability limits—either they run out of storage or can’t handle concurrent query volume. Scaling vertically (bigger servers) has practical limits and creates single points of failure.

DineHub adopts horizontal database scaling: distribute data across multiple servers, with each server handling a subset of the data. This provides both capacity and performance scaling.

Why Citus?

The Problem with Single Databases

As data grows, a single PostgreSQL server faces challenges:

Storage limits: Hardware can only hold so much data
Query performance: Large tables become slow to scan
Concurrent load: Limited CPU/memory for parallel queries
Availability: Single server failure means downtime

Citus Solution

Citus extends PostgreSQL to distribute tables across multiple servers:

Horizontal scaling: Add servers as data grows
Query parallelization: Complex queries execute across workers
High availability: Replicas provide fault tolerance
PostgreSQL compatible: Standard SQL, tools, and drivers work

Architecture

The Coordinator

The coordinator is the entry point for all database queries:

Receives queries: Applications connect here like normal PostgreSQL
Plans execution: Determines which workers hold relevant data
Routes requests: Sends sub-queries to appropriate workers
Aggregates results: Combines worker responses into final result

From the application perspective, the coordinator looks like a standard PostgreSQL server.

The Workers

Workers store actual data and execute queries:

Hold shards: Each worker contains portions of distributed tables
Process queries: Execute SQL against local data
Return results: Send partial results back to coordinator

Workers are standard PostgreSQL servers with Citus extension installed.

Data Distribution

                    ┌─────────────────────┐
                    │     Coordinator     │
                    │  (Query planner &   │
                    │   result aggregator)│
                    └──────────┬──────────┘
                               │
              ┌────────────────┼────────────────┐
              │                │                │
              ▼                ▼                ▼
    ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
    │    Worker 1     │ │    Worker 2     │ │    Worker 3     │
    │                 │ │                 │ │                 │
    │  Restaurants    │ │  Restaurants    │ │  Restaurants    │
    │  ID: 1-1000     │ │  ID: 1001-2000  │ │  ID: 2001-3000  │
    │                 │ │                 │ │                 │
    │  Orders         │ │  Orders         │ │  Orders         │
    │  (same shard)   │ │  (same shard)   │ │  (same shard)   │
    └─────────────────┘ └─────────────────┘ └─────────────────┘

Sharding Strategy

Distribution Column

Tables are distributed by a distribution column:

Restaurants: Distributed by restaurant_id
Orders: Also distributed by restaurant_id (co-located with restaurant)
Users: Distributed by user_id

This “co-location” means a restaurant’s orders reside on the same worker as the restaurant itself, making join queries efficient.

Shard Assignment

Citus uses consistent hashing to assign shards to workers:

Hash of distribution column determines shard
Each shard assigned to one primary worker
Replicas may exist on other workers for availability

When to Distribute

Not all tables should be distributed:

Distribute (large tables):

Restaurants (millions of rows expected)
Orders (billions of rows expected)
Order items (billions of rows expected)

Reference tables (replicated to all workers):

Cuisine types (small, lookup data)
Configuration (rarely changes)

Reference tables are replicated to every worker, making joins fast but updates expensive.

Query Execution

Simple Queries

Single-row lookups by distribution column are fast:

SELECT * FROM orders WHERE restaurant_id = 42;

Coordinator hashes 42, determines which worker holds the shard, and routes directly to that worker.

Complex Queries

Aggregations and joins may involve multiple workers:

SELECT region, COUNT(*) FROM restaurants GROUP BY region;

Execution:

Coordinator sends query to all workers
Each worker counts local restaurants
Workers return partial counts
Coordinator sums results and returns final count

This parallel execution provides near-linear speedup with added workers.

Cross-Shard Joins

Joins between distributed tables require care:

Efficient (co-located join):

SELECT * FROM restaurants r
JOIN orders o ON r.id = o.restaurant_id
WHERE r.id = 42;

Both tables share distribution column, so data is on same worker.

Less efficient (repartition join):

SELECT * FROM orders o
JOIN users u ON o.customer_id = u.id;

Different distribution columns require data movement between workers.

High Availability

Replication Strategy

Each shard has replicas on different workers:

Primary: Handles reads and writes
Standby: Receives streaming replication, takes over if primary fails
Cross-region: Replicas in other regions for disaster recovery

Failover Process

If a worker fails:

Detection: Health checks notice unresponsive worker
Promotion: Standby replica promoted to primary
Reconfiguration: Coordinator routes queries to new primary
Recovery: Failed worker repaired, rejoins as replica

This process is automatic—applications don’t need to change connection strings.

Split-Brain Prevention

Citus uses consensus mechanisms to prevent split-brain scenarios:

Only one primary per shard at a time
Writes blocked until consensus achieved
Clients may see brief unavailability during failover

Performance Optimization

Query Planning

The coordinator analyzes queries to optimize distribution:

Pushdown: Move filters and aggregations to workers
Pruning: Skip workers that can’t have relevant data
Parallelization: Split work across multiple workers

Index Strategy

Index recommendations change with distribution:

Distribution column: Always indexed (used for routing)
Join columns: Index if frequently joined
Filter columns: Index if selective filters common
Coordinator: May need indexes for final aggregation

Monitoring

Key metrics for distributed databases:

Shard imbalance: Are workers evenly loaded?
Query latency: Coordinator vs worker time breakdown
Replication lag: Standby replicas behind primary?
Connection pooling: Managing thousands of connections

Operational Considerations

Adding Workers

Scale out by adding workers:

Provision new worker nodes
Run citus_add_node() to add to cluster
Existing data doesn’t automatically redistribute
New data uses new workers
Optional: Rebalance shards for even distribution

Schema Changes

Schema modifications propagate to all workers:

-- This runs on coordinator and all workers
ALTER TABLE restaurants ADD COLUMN rating FLOAT;

Citus handles distribution automatically—DDL just works.

Backup and Recovery

Backup strategies for distributed data:

Logical backups: pg_dump on coordinator captures distributed schema
Per-worker backups: Physical backups of each worker’s data
Point-in-time recovery: WAL archiving for granular recovery
Cross-region replicas: Live replicas for disaster recovery

Trade-offs

Benefits

Scalability: Add capacity by adding servers
Performance: Parallel query execution
Availability: Replicas provide fault tolerance
PostgreSQL compatible: Familiar SQL and tooling

Complexity

Query planning: Must consider distribution in query design
Operational overhead: More servers to monitor and maintain
Transaction limitations: Cross-shard transactions have overhead
Migration: Existing applications may need modification

When Not to Use

Citus may be overkill for:

Small datasets (< 100GB)
Simple workloads (mostly single-row lookups)
Strong consistency requirements across shards (use single PostgreSQL)

DineHub’s expected scale (thousands of restaurants, millions of orders) justifies the complexity.

Future Enhancements

Citus MX: Multi-coordinator for higher availability
Columnar storage: Analytics queries on compressed columnar data
Automatic rebalancing: Dynamic shard redistribution
Read replicas: Offload read traffic to standbys
Global indexes: Cross-shard indexes for unique constraints

Keyboard shortcuts

Multi Region Cloud System