Recovery¶
PsiTri's core COW engine has no write-ahead log -- recovery is built into the data layout itself. When using the DWAL layer (the recommended write path), an additional WAL-based recovery step replays any committed transactions that hadn't yet been merged into the COW trie. See the DWAL recovery section for details on WAL replay.
This page covers the base-layer recovery of the COW trie and segment allocator, which applies regardless of whether DWAL is used.
Why Control Blocks Can't Survive a Crash¶
Control blocks track two things: where each object lives (location) and how many references point to it (ref count). Both become untrustworthy after a crash:
- Reference counts include in-flight stack references. Every
smart_ptron the stack holds a reference. When the process crashes, those destructors never run -- ref counts are left permanently inflated. - Control blocks are pinned to RAM to avoid SSD wear. They change on every retain, release, and relocation -- far too frequently to flush to disk. They are volatile by design.
On clean shutdown, a clean_shutdown flag is set so control blocks can be reloaded as-is. On crash, they must be rebuilt from segments.
Recovery Modes¶
The caller tells the database what kind of failure occurred via recovery_mode:
| Mode | When to use | What it does |
|---|---|---|
none |
Clean shutdown detected | No recovery needed |
deferred_cleanup |
App crash, fast restart | Mark ref counts stale, defer leak reclamation |
app_crash |
App crash, full cleanup | Reset ref counts, reclaim leaked memory |
power_loss |
OS or hardware crash | Validate segments, rebuild control blocks and roots |
full_verify |
Suspected corruption | Deep checksum verification of all objects |
If no mode is specified and the clean_shutdown flag is unset, the database defaults to deferred_cleanup -- the tree is consistent, so the only cost is leaked objects occupying space until the user explicitly runs recovery.
How Recovery Works¶
The segments -- append-only, ordered by allocation sequence -- are the durable source of truth.
Root Pointer Sources¶
- The roots file -- memory-mapped, synced at each commit. Primary source of root pointers.
- Sync headers -- 64-byte records at segment sync boundaries with backward chain, timestamp, XXH3 checksum, and optional root info. Fallback when the roots file is corrupt.
App-Crash Recovery (Common Case)¶
Phase 1: Rebuild locations. Clear all control block metadata. Sort segments newest-to-oldest by _provider_sequence. Scan each segment's objects sequentially. For each object ID:
- If unseen: record its location, set ref count to 1
- If already mapped from a newer segment: skip (newer copy wins)
- If in the same segment: update (later offset = newer)
Phase 2: Retain reachable nodes. Starting from all valid roots (from roots file or sync headers), recursively traverse using visit_branches(). Each reachable node gets retain(), bumping ref count to 2+.
Phase 3: Release unreachable. Decrement every ref count by 1. Objects that drop to 0 are freed. Reachable objects settle to their correct ref counts.
Power-Loss Recovery¶
Adds segment validation before the three phases above: validate sync header checksums to identify the last trustworthy sync boundary. Data beyond the last valid sync header may be torn; data before it is guaranteed consistent.
Design Considerations¶
These are design notes, not fully implemented features
Determining Most Recent Copy¶
Given two segments with a copy of a node with the same ID, recovery must correctly identify which one is most recent:
- Within a segment, later offset = newer
- Each segment tracks the session that allocated it
- Segment timestamps provide ordering between sessions
- A 16-bit allocation sequence number in the node header (
ptr_address_seq::sequence) disambiguates across overlapping segments
Worst Case Scenario¶
- Thread A uses ID1 in Segment 1
- Compactor moves it to Segment 2
- Thread B re-allocs ID1 after A releases it in Segment 3
Three copies of the same ID in three segments. The allocation sequence number provides a total order.
Segment Header¶
Each segment stores:
- Time the segment was popped from the allocation queue
- Time the segment was finalized after being filled
- Size-weighted average time of the data within
- Session number that allocated it
- Source segment info for compacted data