Databricks Tackles Downtime

Planned maintenance can be more disruptive than unexpected outages for many databases. Databricks is tackling this head-on with its Lakebase architecture, aiming to make version updates and security patches entirely unnoticeable. The primary challenge with traditional database restarts is the loss of in-memory caches, leading to significant performance degradation as data reloads from storage. This can escalate from a speed issue to a critical availability problem under heavy loads.

The core innovation lies in 'prewarming.' Before a scheduled restart, a new compute node is spun up in the background. This new node pre-caches data using the current primary's page list and WAL stream. Once ready, it seamlessly takes over, promoting itself to primary with no additional cost or replica overhead. This method ensures databases remain available and performant throughout the patching process.

This preemptive caching strategy is enabled by Lakebase's architecture, which combines stateless, elastic compute nodes with disaggregated, shared storage. Unlike traditional systems where cache misses cripple performance post-restart, Lakebase leverages its flexible compute to prepare nodes in advance.

The Cold Cache Problem

PostgreSQL restarts typically wipe out essential caches, like the buffer cache and local file cache. While the database itself may come back online quickly, user workloads can experience a sharp drop in throughput—up to 70% in some tests—as the cache slowly repopulates from disk. This performance hit isn't just an annoyance; it can lead to timeouts and availability issues.

Existing solutions like `pg_prewarm` run post-restart, meaning the damage is already done. Streaming replication offers prewarming for replicas, but it demands a full replica setup and complex orchestration, adding overhead.

Lakebase's Prewarming Mechanism

Databricks Postgres restarts are now handled differently. Shortly before a scheduled update, a new compute instance is provisioned invisibly. This instance receives the current primary's cache page list and begins loading data from shared storage without impacting the live workload. It also subscribes to the write-ahead log (WAL) stream from Safekeepers, efficiently updating its cache without burdening the primary.

Once prewarming is complete, the old primary is shut down, and the new compute node is promoted. This process uses standard PostgreSQL promotion mechanisms, avoiding another server restart.

This advancement means customers experience zero performance degradation during Databricks Postgres restarts, a significant improvement over conventional methods. The company is rolling this out for read/write endpoints immediately, with read-only endpoints to follow soon.

In tests, Databricks demonstrated that with prewarming, throughput recovery is nearly instantaneous for both read-only and read-write workloads. Without it, performance lags significantly as the cache warms up. This difference is particularly pronounced in read-only scenarios where a healthy cache hit ratio provides a substantial boost.

Databricks Tackles Downtime

The Cold Cache Problem

Lakebase's Prewarming Mechanism

AI Daily Digest