Concurrent Writers¶
PsiTri supports multiple writer sessions operating simultaneously on independent trees with near-zero contention. This page explains why writers on different roots never block each other, and does a deep dive into the zone-striped ID allocator that makes it possible.
Why Writers Don't Conflict¶
Every resource a writer touches is partitioned so that independent writers never compete for the same thing:
Per-Root Locking¶
The database has 512 independent roots at the PsiTri layer, each protected by its own mutex. The underlying SAL allocator has 1,024 root object slots with their own mutexes:
std::mutex _modify_lock[512]; // PsiTri layer: one per top-level root
std::array<std::mutex, 1024> _write_mutex; // SAL layer: one per root object
Writer on root 0 locks _modify_lock[0]. Writer on root 1 locks _modify_lock[1]. They never touch the same mutex.
Per-Session Segments¶
Each write session gets its own 32 MB segment. Data allocation is a bump pointer -- _alloc_pos += size -- with no atomics, no locks, no CAS:
Writer A: [segment 7 ........] <-- bump allocate here
Writer B: [segment 12 ......] <-- bump allocate here, independently
No two sessions ever write to the same segment. When a session fills its segment, a background provider hands it a fresh one from a pre-allocated pool -- writers never block on file I/O or mmap.
Per-Session Release Queues¶
When a writer frees objects (e.g., the old version of a node after copy-on-write), it pushes them into its own 32K-entry circular buffer. No contention. The compactor drains all queues asynchronously in the background.
Atomic Root Swap¶
Committing a transaction is a single std::atomic::exchange() on _root_objects[root_index], followed by releasing the per-root mutex. Other roots are completely unaffected.
The Full Picture¶
Two concurrent writers on different roots:
| Step | Writer A (Root 0) | Writer B (Root 1) |
|---|---|---|
| Lock | _modify_lock[0] |
_modify_lock[1] |
| Allocate data | Bump in segment 7 | Bump in segment 12 |
| Allocate IDs | Random CAS in zone bitmap | Random CAS in zone bitmap |
| Free old nodes | Push to _release_queue[0] |
Push to _release_queue[1] |
| Commit | Atomic swap _root_objects[0] |
Atomic swap _root_objects[1] |
The only shared resource is the zone-striped ID allocator -- and as described below, it's designed so that contention is vanishingly rare even under heavy concurrent allocation.
| Operation | Traditional (B-tree/LSM) | SAL |
|---|---|---|
| Data allocation | Page latch or WAL lock | Per-session bump (zero contention) |
| ID allocation | Global free-list mutex | Zone-striped random CAS |
| Object release | Shared free-list lock | Per-session queue (zero contention) |
| Capacity expansion | Inline (blocks writer) | Background provider (non-blocking) |
Zone-Striped ID Allocation¶
The zone-striped allocator is the most innovative piece of the concurrency design. It manages ~4 billion 32-bit ptr_address IDs across a memory-mapped free bitmap, with concurrent allocation, automatic growth, balanced distribution, and SIMD-accelerated free-space scanning -- all without locks.
Structure¶
The 4-billion-entry address space is divided into zones of 4,194,304 IDs each (2^22). Each zone has:
- A free bitmap: 65,536 atomic
uint64_twords, each tracking 64 IDs (1 = free, 0 = allocated) - An allocation counter: atomic
uint32_ttracking how many IDs are in use in this zone - 32 MB of control block data: the actual 8-byte control blocks, memory-mapped contiguously
Zone size (32 MB) deliberately matches segment size (32 MB), making address-to-zone mapping a single division.
Three memory-mapped files back the allocator:
| File | Contents | Size per zone |
|---|---|---|
zone.bin |
Control block data (8 bytes each) | 32 MB |
free_list.bin |
Free bitmap (1 bit per ID) | 512 KB |
header.bin |
Per-zone counters, global metadata | shared |
The Allocation Algorithm¶
The algorithm has two phases: a speculative scan (non-atomic, SIMD-accelerated) to find the best candidate, then a targeted CAS (atomic) to claim it.
Phase 1: Find the best candidate (speculative)
1. Pick the least-filled zone (min_alloc_zone)
2. Generate a random starting offset in the zone's free bitmap
3. Round down to a 64-byte boundary (8 uint64_t = 512 bits)
4. Load all 512 bits (speculatively, without atomics)
5. SIMD popcount: find the byte with the most set (free) bits
The SIMD popcount (max_pop_cnt8_index64) is the key optimization. It examines 512 free-list bits -- representing 512 potential IDs -- in a single pass, using platform-specific SIMD:
- ARM NEON:
vcntq_u8computes byte popcount,vmaxvq_u8finds the horizontal maximum, comparison +move_mask_neonlocates the winning byte - SSE/SSSE3:
PSHUFBnibble-lookup popcount,_mm_max_epu8for horizontal max,_mm_movemask_epi8to extract the result - Scalar fallback: loop over 64 bytes with
std::popcount
This scan is deliberately non-atomic -- it reads 64 bytes of the free bitmap as raw memory. The result is a heuristic: it tells us which 8-bit region has the most free slots. A stale or torn read just means we pick a slightly suboptimal byte -- the actual claim is always done with an atomic CAS, so correctness is never at risk.
Phase 2: Claim via atomic CAS (targeted)
6. Compute the ptr_address of the best candidate byte
7. Load the atomic uint64_t containing that byte
8. Mask to the 16 bits corresponding to the target cacheline
9. countr_zero to find the first free bit within the mask
10. CAS to flip that bit from 1 (free) to 0 (allocated)
11. On CAS failure: reload and retry within the same 16-bit window
12. If window exhausted: go back to phase 1 with a new random start
The masking step is critical. Each uint64_t in the free bitmap covers 64 IDs. A CPU cacheline in the control block array holds 8 control blocks (8 x 8 bytes = 64 bytes). The allocator masks the uint64_t to a 16-bit window aligned to the target cacheline boundary (0xffffull << base_offset), ensuring that the allocated ID shares a cacheline with the hint address.
Why Contention Is Vanishingly Rare¶
Two threads contend only if they:
- Pick the same zone (likely if one zone is least-filled, but zones hold 4M IDs)
- Pick the same random starting cacheline out of 8,192 cachelines per zone
- Try to CAS the same
uint64_tword
The probability of step 2 alone is 1/8,192 per attempt. And even when two threads do collide on the same word, the CAS loop resolves it instantly -- the loser reloads and picks the next free bit in the same word, which is still a single atomic operation.
At the 50% fill target, each 512-bit scan finds ~256 free bits. The chance of scanning 512 bits and finding zero free is astronomically low at normal fill levels. Even at 99% fill, the probability that all 512 scanned bits are taken is ~0.5%, and the retry just picks a new random starting point.
Growth and Balanced Distribution¶
The allocator grows and rebalances automatically with no configuration:
Growth trigger: When the average allocations per zone exceeds 50%, a new zone is allocated. New zones are initialized with all bitmap bits set to 1 (free):
The 50% target is not arbitrary -- it ensures the SIMD scan almost always finds free bits on the first try (512 bits at 50% fill = ~256 free bits per scan), while avoiding excessive memory waste.
Balanced steering via min_alloc_zone: The allocator tracks which zone has the fewest allocations and steers new allocations there. This is maintained incrementally with two rules:
- On allocation: If we just allocated from the current
min_alloc_zoneand it's now above the per-zone average, rescan all zones to find the new minimum - On deallocation: If the zone we freed from now has fewer allocations than the current minimum, it becomes the new minimum
// On allocation (simplified):
if (this_zone == min_alloc_zone && this_zone_count >= average)
min_alloc_zone = scan_for_minimum();
// On deallocation (simplified):
if (this_zone_count < min_alloc_zone_count)
min_alloc_zone = this_zone;
This creates a self-balancing feedback loop:
- A fresh zone is added with zero allocations -- it becomes
min_alloc_zone - All threads steer toward it, filling it up
- Once it passes the average, the allocator rescans and steers to the next least-filled zone
- Over time, all zones converge toward equal fill levels
The rescan (update_min_zone) is O(num_zones) but triggers infrequently -- only when the current minimum zone crosses the average. With 1024 max zones, this is a scan of 1024 atomic loads, which is negligible compared to the millions of allocations between triggers.
Deallocation feedback: When objects are freed (via the compactor draining release queues), dec_alloc_count updates the per-zone counter. If the freed zone drops below the current minimum, it becomes the new steering target. This means zones that experience heavy churn naturally attract new allocations, keeping fill levels balanced even under skewed workloads.
Hint-Based Allocation for Cacheline Co-Location¶
The most powerful feature of the allocator is hint-based allocation, which serves a dual purpose: allocating an ID and engineering cache locality.
When a tree node allocates a child, it passes the ptr_address values of existing siblings as hints. The allocator tries each hint in order:
allocation alloc(const alloc_hint& hint) {
// Try each sibling's cacheline first
for (addr : hint) {
if (auto result = try_alloc(addr))
return *result;
}
// Fall back to random allocation in least-filled zone
return alloc();
}
For each hint, try_alloc(ptr_address):
- Rounds the hint down to a 16-element cacheline boundary
- Masks the free bitmap to just the 16 bits on that cacheline
- CAS-claims the first free bit within the mask
If the cacheline is full, it tries the next hint. If all hints fail, it falls back to the random SIMD-guided path. The fallback deliberately avoids the hinted cachelines -- it picks from the least-filled zones, reducing the chance of stealing a slot that a future hint-based allocation might want.
Prior Art Comparison¶
No existing allocator combines all of these properties:
| System | Free tracking | Concurrency | Locality control | Growth |
|---|---|---|---|---|
| glibc malloc | Free lists per size class | Per-arena locks | None | mmap/sbrk |
| jemalloc | Bitmap + free lists | Per-thread caches | Size-class proximity | Extent allocation |
| tcmalloc | Free lists + spans | Per-thread + per-CPU | Size-class only | Page heap |
| Linux buddy allocator | Bitmap per order | Zone locks | Power-of-2 only | Cannot grow |
| PostgreSQL | Free space map (FSM) | Page-level locks | None | Table extension |
| LMDB | B-tree of free pages | Single writer only | None | File growth |
| SAL zone allocator | Atomic bitmap | Lock-free CAS | Hint-based cacheline | Auto-balanced zones |
The closest prior art is jemalloc's bitmap allocator within slabs, which also uses bitmaps for sub-page allocation. But jemalloc's bitmaps are per-size-class and don't support caller-directed placement hints. The combination of SIMD-guided speculative scanning, hint-based cacheline co-location, and automatic zone rebalancing is unique to SAL.
Traditional database allocators (PostgreSQL FSM, LMDB free-page B-tree) operate at page granularity and use locks or single-writer designs. None support the concept of "allocate near this sibling" because pages don't have the same co-location opportunity that a flat control block array provides.
SIMD Popcount: Finding the Freest Region¶
The max_pop_cnt8_index64 function is the engine that drives the speculative scan. Given 64 bytes of free bitmap data (512 potential IDs), it returns the index of the byte with the most set bits -- the densest free region.
ARM NEON implementation:
1. vld1q_u8_x4: Load all 64 bytes into 4 NEON registers
2. vcntq_u8: Byte-level popcount (hardware instruction, 1 cycle)
3. vmaxq_u8: Pairwise max across all 4 registers
4. vmaxvq_u8: Horizontal max across single register -> scalar max value
5. vceqq_u8: Compare all 64 popcounts against max -> match mask
6. move_mask_neon: Extract 1 bit per byte into a uint64_t
7. countr_zero: First set bit = index of winning byte
SSE/SSSE3 implementation:
1. _mm_load_si128 x4: Load 64 bytes into 4 SSE registers
2. PSHUFB lookup: Nibble-based popcount via 16-byte lookup table
3. _mm_max_epu8: Pairwise max, then horizontal reduction via shifts
4. _mm_cmpeq_epi8: Compare all bytes against max
5. _mm_movemask_epi8: Extract 1 bit per byte into uint32_t
6. Combine 4 masks into uint64_t, countr_zero for first match
The PSHUFB trick splits each byte into two 4-bit nibbles, looks up each nibble's popcount in a 16-entry table stored in an SSE register, and adds the results. This computes 16 byte popcounts in 5 instructions.
Both implementations complete in ~10-15 cycles for 512 bits -- fast enough that the speculative scan adds negligible overhead to the allocation path.