Quick notes on database storage internals. Watched this video and it clicked.
The Core Problem
Databases can’t keep everything in RAM—it’s volatile and expensive. Data lives on disk, which is 1000x+ slower than RAM.
- RAM: ~100 nanoseconds access time
- SSD: ~10-20 microseconds
- HDD: ~5-10 milliseconds
The real killer? Random access. HDD seeks cost 5-10ms each. Reading 1000 random records = 10 seconds just in seek time. Sequential reads are way faster.
So databases obsess over locality—keeping related data close together on disk.
Two Main Storage Strategies
B+ Trees (Read-Optimized)
Used by: PostgreSQL, MySQL, SQLite
How it works:
- Data organized in a sorted tree structure
- Each lookup takes 3-4 disk reads (even with billions of records)
- Tree height is predictable:
log(records)
The catch: Writes are expensive. Inserting can trigger page splits, rebalancing, and multiple writes for a single logical insert. This is write amplification.
Good for: OLTP, transactions, mixed workloads with frequent lookups
LSM Trees (Write-Optimized)
Used by: Cassandra, ScyllaDB, RocksDB, ClickHouse
How it works:
- Writes go to in-memory structure first (memtable)
- When full, flush to disk as sorted, immutable files (SSTables)
- All writes are sequential—no random I/O
- Background compaction merges old SSTables
The catch: Reads check multiple files. Uses bloom filters (probabilistic data structure) to skip files that definitely don’t contain a key.
Good for: Time-series, logs, metrics, write-heavy workloads
Key Differences
| Aspect | B+ Trees | LSM Trees |
|---|---|---|
| Read speed | Fast, consistent | Variable (check multiple files) |
| Write speed | Moderate | Very high |
| Write pattern | Random I/O | Sequential I/O only |
| Best use case | Transactions, OLTP | Logs, metrics, append-heavy |
Real Example
Discord migrated from MongoDB (B-Tree) to Cassandra (LSM) because random reads/writes became too slow as data grew. Message history is mostly append-only—perfect for LSM’s sequential write advantage. They later moved to ScyllaDB (faster LSM) for performance.
When to Pick What
- Need strong consistency + ACID? → B+ Tree (PostgreSQL, MySQL)
- Need high write throughput? → LSM (Cassandra, RocksDB)
- Banking/e-commerce? → B+ Tree
- Logs/metrics/events? → LSM
No universal winner—just tradeoffs based on workload.
References
- PostgreSQL Docs - B+ Tree internals
- RocksDB Wiki - LSM implementation details
- Use The Index, Luke - database indexing fundamentals explained simply
- SQLite B-Tree — how SQLite actually stores data