The Problem¶
Some workloads need to read state, compute on it, and write it back -- thousands of times per second -- while other threads query the same data concurrently. Blockchain nodes, financial ledgers, game state servers, and event sourcing systems all share this pattern: every state transition is a read-modify-write transaction against a large, actively-queried dataset.
No existing storage engine handles this well. B-trees copy entire pages on each mutation. LSM trees defer write cost into compaction stalls that worsen at scale. Neither provides instant, zero-cost snapshots that don't degrade write throughput.
Every persistent key-value store makes the same set of tradeoffs:
- B-trees (LMDB, BoltDB, SQLite, InnoDB) operate on fixed-size pages (4KB-16KB). COW-based B-trees (LMDB, BoltDB) copy entire pages on mutation for snapshot isolation and crash safety -- changing a single byte means copying 4KB-16KB. WAL-based B-trees (InnoDB, SQLite) avoid COW but add write-ahead logging overhead. In both cases, write amplification is proportional to page size, not data size.
- LSM trees (RocksDB, LevelDB, Cassandra) batch writes for throughput but pay for it with read amplification, compaction stalls, and space amplification from tombstones.
- Adaptive radix trees (ART) achieve optimal depth and cache-line efficiency, but only in memory. No persistent ART implementation exists that matches B-tree durability.
PsiTri eliminates these tradeoffs by combining four novel subsystems:
- A radix/B-tree hybrid with sorted leaf nodes and node-level copy-on-write
- An ART-buffered write-ahead log (DWAL) that decouples write latency from COW cost -- writes land in a lock-free in-memory ART buffer, background merge drains to the COW trie in batch
- A relocatable object allocator with lock-free O(1) compaction moves
- A self-tuning physical data layout that sorts objects by access frequency
Comparison with Alternatives¶
| System | Insert/sec (persistent) | Depth (30M keys) | Persistent | Copy-on-Write | Node Size |
|---|---|---|---|---|---|
| PsiTri | 1.1-3.7M | 5 | Yes | Yes | Per-node (64 B multiples) |
| LMDB | 0.3-0.8M | 3-4 | Yes | Yes | 4KB pages |
| RocksDB | 0.5-1.5M | N/A (LSM) | Yes | No | variable |
| ART (in-memory) | 5-10M | 5-8 | No | No | 52-2048B |
| SQLite | 0.1-0.5M | 3-4 | Yes | No | 4KB pages |
Where Existing Approaches Fall Short¶
B-Trees: Page-Level Write Amplification¶
B-trees are the workhorse of persistent storage. But their unit of work is a page -- typically 4KB. When you change one key, copy-on-write forces copying 4KB at every level of the tree. For a 4-level B-tree, that's 16KB of writes for a single byte change.
This is a fundamental consequence of how B-trees organize data: keys are stored in sorted order within pages, and pages are the unit of I/O, locking, and copy-on-write. You can't copy-on-write part of a page -- the whole thing must be duplicated.
LSM Trees: Compaction Tax¶
LSM trees (RocksDB, LevelDB, Cassandra) solve the write amplification problem by batching writes into sorted runs. But they pay for it elsewhere:
- Read amplification: Lookups must check the memtable, immutable memtables, and multiple SSTable levels
- Compaction stalls: Background compaction can cause latency spikes when it falls behind write throughput
- Space amplification: Tombstones and duplicate keys across levels consume extra storage
- Tuning complexity: Level sizes, compaction triggers, and bloom filter configurations require careful tuning per workload
In-Memory Tries: Not Persistent¶
Adaptive radix tries (ART) achieve excellent cache-line efficiency and O(key_length) lookups. But existing implementations are memory-only. No production ART supports:
- Crash-safe persistence
- Copy-on-write for MVCC
- Background compaction for space reclamation
- Concurrent readers with snapshot isolation
PsiTri's Design Point¶
PsiTri occupies a region of the design space that was previously empty:
| Capability | B-tree | LSM-tree | In-memory ART | PsiTri |
|---|---|---|---|---|
| Persistent + crash-safe | Yes | Yes | No | Yes |
| Copy-on-write granularity | 4KB page | N/A | N/A | Per-node (64 B multiples) |
| Online compaction | No (offline copy) | Background (file-level) | N/A | Background (object-level) |
| Object relocation cost | Cannot | File rewrite | N/A | memcpy + 1 atomic CAS |
| Cache management | OS or buffer pool | OS or buffer pool | N/A | Self-tuning MFU with physical relocation |
| Range count | O(k) | O(k) | O(k) | O(log n) |
| Range delete | O(k) | O(k) tombstones | O(k) | O(log n) |
| Write amplification | High (page-level) | High (compaction) | N/A | ~2.3KB/mutation (node-level COW) |
| Composable hierarchical data | Flattened + joins | Embedded blobs | N/A | Native subtrees with O(1) operations |