Quick notes on database storage internals. Watched this video and it clicked.

The Core Problem

Databases can’t keep everything in RAM—it’s volatile and expensive. Data lives on disk, which is 1000x+ slower than RAM.

  • RAM: ~100 nanoseconds access time
  • SSD: ~10-20 microseconds
  • HDD: ~5-10 milliseconds

The real killer? Random access. HDD seeks cost 5-10ms each. Reading 1000 random records = 10 seconds just in seek time. Sequential reads are way faster.

So databases obsess over locality—keeping related data close together on disk.

Two Main Storage Strategies

B+ Trees (Read-Optimized)

Used by: PostgreSQL, MySQL, SQLite

How it works:

  • Data organized in a sorted tree structure
  • Each lookup takes 3-4 disk reads (even with billions of records)
  • Tree height is predictable: log(records)

The catch: Writes are expensive. Inserting can trigger page splits, rebalancing, and multiple writes for a single logical insert. This is write amplification.

Good for: OLTP, transactions, mixed workloads with frequent lookups

LSM Trees (Write-Optimized)

Used by: Cassandra, ScyllaDB, RocksDB, ClickHouse

How it works:

  • Writes go to in-memory structure first (memtable)
  • When full, flush to disk as sorted, immutable files (SSTables)
  • All writes are sequential—no random I/O
  • Background compaction merges old SSTables

The catch: Reads check multiple files. Uses bloom filters (probabilistic data structure) to skip files that definitely don’t contain a key.

Good for: Time-series, logs, metrics, write-heavy workloads

Key Differences

AspectB+ TreesLSM Trees
Read speedFast, consistentVariable (check multiple files)
Write speedModerateVery high
Write patternRandom I/OSequential I/O only
Best use caseTransactions, OLTPLogs, metrics, append-heavy

Real Example

Discord migrated from MongoDB (B-Tree) to Cassandra (LSM) because random reads/writes became too slow as data grew. Message history is mostly append-only—perfect for LSM’s sequential write advantage. They later moved to ScyllaDB (faster LSM) for performance.

When to Pick What

  • Need strong consistency + ACID? → B+ Tree (PostgreSQL, MySQL)
  • Need high write throughput? → LSM (Cassandra, RocksDB)
  • Banking/e-commerce? → B+ Tree
  • Logs/metrics/events? → LSM

No universal winner—just tradeoffs based on workload.

References