libart++

The adaptive radix tree map for modern C++.
std::map's interface, with cost counted in the unit that actually matters: at most one cold cacheline per level, and levels bounded by your key's bytes — never by your data's size.

header-only C++20 zero dependencies allocator-aware · PMR fuzzed vs std::map fault-injection tested

Get started View on GitHub

// the interface you already know — at radix speed
#include <artpp/map.hpp>

artpp::map<std::string, uint64_t> m;   // takes string_view args: no temp strings
m.insert("alpha", 1);
m.try_emplace("beta", 2);                // present → untouched, nothing constructed

if (auto it = m.find(user_input); it != m.end())
    use(it->first, it->second);

for (const auto& [key, value] : m)       // ascending key order, always
    process(key, value);

Count cachelines, not comparisons

A lookup's real cost is its cold cacheline loads — every other instruction hides behind them. Big-O in comparisons flatters comparison trees; measured in cachelines, a B-tree lookup is O(log n) dependent misses. libart++ is engineered so each level of descent costs at most ONE cold line — and the level count is bounded by your key's bytes, not your data's size.

Headerless dispatch

The node's kind rides in the tagged pointer, so descent knows how to route before the node arrives: no header line load before the child slot load. One line per hop is the contract, not the goal.

Prefixes live in the header line

A compressed run is stored inside the cacheline the node already owns; the pointer tag hints when it must be read, and a prefix-free node never touches it. A separate prefix node exists only past 63 bytes — overflow, not the default. Path compression at zero extra lines.

Terminals cost almost nothing

Values ≤5 bytes pack inline in the handle bits — zero lines. A small leaf can live inside the routing setlist's own cacheline (no separate load at all). Anything else sits in a dedicated 16-byte-aligned terminal region — a leaf is pure payload, so it gets exactly what it needs, not a 128-byte line.

Levels collapse adaptively

Sparse byte levels fuse: wide stems consume two key bytes per line, buckets pack whole key tails into one 4-line page, dense spans direct-index. The tree spends lines where your data's entropy actually is.

Measured, not promised

One templated benchmark engine drives every contestant through identical code — same keys, same op sequence, same checksums — so there are no per-library benchmark deviations. Upstream libart is vendored; absl::btree_map and std::map use heterogeneous lookup. Reproduce with -DARTPP_BENCHMARKS=ON.

1.2–2.0×

faster point lookups than the original libart — on all four workloads (plain radix): 2.0× sequential, 1.37× uniform, 1.32× clustered, 1.18× dictionary

1.9–4.3×

faster point lookups than absl::btree_map, across all four workloads

2.1–7.6×

faster point lookups than std::map

1.2–1.6×

faster lower_bound than absl::btree_map on clustered & integer keys — an ordered query libart has no API for

Point lookups (the operation maps exist for)

Point lookup latency, dictionary words: artpp 89ns, buckets 77ns, libart 106ns, absl::btree_map 172ns, std::map 271ns

Point lookup latency, clustered strings: artpp 107ns, libart 142ns, btree 252ns, std::map 398ns

Point lookup latency, uniform uint64: artpp 31ns, libart 43ns, btree 91ns, std::map 173ns

Point lookup latency, sequential uint64: artpp 19ns, libart 39ns, btree 84ns, std::map 156ns

Every lookup costs the cachelines its descent touches — and artpp is built so each level of descent touches at most one cold line: the node kind is in the tagged pointer (no header read to route), a compressed run lives in that line instead of a separate node, and a small leaf can ride in the routing node's own line rather than a line of its own. Against libart that's 2.0× on sequential integers, 1.37× uniform, 1.32× clustered, and 1.18× on dictionary words; against the comparison-based maps, 1.9–4.3× over absl::btree_map and 2.1–7.6× over std::map.

`lower_bound` (the ordered query a hash map can't answer — and libart has no API for)

lower_bound, clustered strings: artpp 155ns, btree 232ns, std::map 261ns, libart unsupported

lower_bound, uniform uint64: artpp 68ns, btree 85ns, std::map 142ns, libart unsupported

Positioning to the next-greater key (predecessor/successor, range starts) is what an ordered map is for — and it's the line where the field thins out: a hash map can't do it at all, and upstream libart exposes no lower_bound either (shown as unsupported above). Among the containers that can, artpp leads: 1.5× faster than absl::btree_map on clustered strings, 1.24× on uniform and 1.6× on sequential integers, and on par on dictionary words — versus std::map, up to 2.1×. The descent is the same shape as a point lookup, so the same one-line-per-level economics apply.

Inserts (build time is allocator-bound — so compare the allocator)

Insert latency, clustered strings: artpp pool 144ns, std::allocator 168ns, libart 168ns, btree 241ns, std::map 345ns

Insert latency, uniform uint64: artpp pool 44ns, std::allocator 61ns, libart 54ns, btree 97ns, std::map 207ns

A bulk build is dominated by where the nodes come from, so the honest insert comparison holds the tree fixed and swaps the allocator. The flagship line_pool — a contiguous reservation handed out by a bump frontier with exact-size free lists, no per-node malloc — builds the same tree 1.1–1.4× faster than std::allocator, the allocator's contribution cleanly isolated (both rows are on every insert chart). Held to the same malloc class, artpp and libart build at similar rates: artpp leads on sequential and runs even on clustered, and libart's flatter node growth leads on the deeply-shared dictionary, where a radix does the most splitting — the up-front structuring that buys the lookup, scan, and ordered-iteration wins above. Net with the pool: builds beat absl::btree_map (1.7–2.8×) and std::map (2.4–7.2×) on every workload.

Ordered scans (bucket mode keeps pace with B-trees)

Ordered scan, dictionary words: bucket mode 2.6 ns/element, 2.2x faster than libart

Ordered scan, clustered strings: bucket mode 6.5 ns/element, faster than both libart and absl::btree_map

Bucket mode scans the packed (suffix, value) pages contiguously: 2.2× faster than libart on dictionary keys (2.6 vs 5.6 ns/element), and on clustered strings it edges absl::btree_map itself (6.5 vs 8.9 ns/element). For the absolute crown on dense integer keys, a B-tree's fully contiguous leaves lead — so pick bucket mode when a workload both looks up and scans, and the default radix when it's lookup-dominated.

Methodology: Apple M5, clang -O3, quiet machine, one process. The artpp::map rows run the library's flagship allocator — line_pool over an anonymous mapping (4-byte u32-index branch handles, bump placement laying children near parents); the std::allocator row shows zero-setup behavior. Workloads: 236k dictionary words, 885k clustered strings, 1M uniform & sequential uint64 (typed APIs; libart receives the identical big-endian bytes). Point-lookup latency is the fastest of five repetitions per contestant. Full data: results.csv · engine: bench/engine.hpp. A separate machine-calibrated ratio gate guards artpp-vs-libart regressions in CI fashion.

Built for the keys real systems use

Real keys are rarely a bare integer — they're strings, 32-byte hashes, UUIDs, composite tuples: bytes that don't fit in a CPU register. A radix descent reads only the few bytes that distinguish a key, touching a bounded number of cache lines however large the map or the key. Here are random 32-byte keys scaled from one million to one hundred million — artpp, upstream libart, unodb (the canonical C++ ART), the comparison trees, and a hash map.

32-byte-key point-lookup latency from 1M to 100M: artpp 52 to 90 ns, std::unordered_map 64 to 100, libart 66 to 118, unodb 210 to 352, absl::btree_map 505 to 1471, std::map 807 to 1966

Grow the map 100× and artpp's lookup barely moves — ~52 → 90 ns on these std::string keys — while absl::btree_map climbs 505 → 1,471 and std::map 807 → 1,966: each level is a full-key compare over cold cache lines, O(log n) of them. The radix reads only the bytes that distinguish a key, so it scales best of the field (1.75× across the growth vs 2.4–2.9× for the trees) and at 100M is 16× over absl::btree_map and 22× over std::map.

Variable-length (string) keys — fastest, ordered or not

Point lookups at 100M 32-byte string keys: artpp 90 ns, std::unordered_map 100, libart 118, unodb 352, absl::btree_map 1471, std::map 1966; art_map does not compile

lower_bound at 100M 32-byte string keys: artpp 159 ns, unodb 638, absl::btree_map 1489, std::map 2169; libart, unordered_map and art_map have no ordered query

Variable-length keys (names, paths, the dictionary words above, or the 32-byte strings here) are the radix's home turf: a hash must consume the whole key, so at 100M artpp (90 ns) edges even std::unordered_map (100 ns) while staying fully ordered — lower_bound in 159 ns vs 1,489 (btree) and 2,169 (std::map), where the hash map and libart answer no ordered query at all. unodb — the canonical C++ ART (art_map derives from it) — handles these via key_view but trails ~4× (352 ns; its iterator reconstructs the key).

Fixed-size inline keys — the fastest ordered map

Point lookups at 100M fixed std::array 32-byte keys: std::unordered_map 55 ns, artpp 67, libart 102, absl::btree_map 1057, std::map 1422; art_map does not compile

lower_bound at 100M fixed std::array 32-byte keys: artpp 165 ns, absl::btree_map 947, std::map 1078; unordered_map, libart, art_map have no ordered query

Store the same 32 bytes as a fixed std::array (a hash, UUID, or POD id — inline, no per-key allocation) and it's honest in the other direction: hashing a cheap inline key is hard to beat, so std::unordered_map wins pure point lookups (55 ns vs artpp 67) — but it answers nothing else. artpp is the fastest ordered map by a wide margin: hit 67 ns vs btree 1,057 and std::map 1,422 (16–21×), lower_bound 165 ns vs 947 / 1,078, and at 100M it even passes the B-tree on ordered scan. art_map would be faster still on a key this size — but its key must be ≤ 16 bytes, so a 32-byte key doesn't compile.

The C++ ART field, by capability

C++ ART	keys it handles	`lower_bound`	build	32-byte string hit @ 100M
artpp::map	any — strings, hashes, doubles, composites (via codec)	✓	header-only, no deps, ARM + x86	90 ns
unodb	fixed-width or `key_view`	✓ (scan / seek)	needs Boost; ARM + x86	352 ns
art_map	≤ 16-byte fixed-width only	✓	needs Boost; x86-only (our PR adds ARM)	— won't compile
libart (C)	byte strings	✗ (scan only)	C, no deps, portable	118 ns

Bottom line, stated straight: a hash map is faster for pure point lookups on fixed-size keys, and a register-specialized ART (art_map) faster still on ≤16-byte integers — but the moment keys are variable-length, or you need lower_bound / ordered scans / ranges, artpp is the fastest option, and the only one that's header-only, dependency-free, portable, and keys anything.

Methodology: Apple M5 (128 GB), clang -O3, idle machine, one process; each container built and measured in isolation. Random insertion; point ops are the fastest of repeated passes. The two regimes differ only in key storage — std::map/btree/unordered_map hold std::string (heap) vs std::array<unsigned char,32> (inline, unsigned ⇒ same byte order); artpp/libart/unodb key on the identical bytes either way. artpp uses line_pool to ~50M 32-byte keys then std::allocator (no 4 GB terminal cap) to 100M — lookups within noise between them. unodb's scan reconstructs the key (not compared on scan). Source: bench_buffer.cpp + bench_buffer_array.cpp / bench_buffer_unodb.cpp.

The "++" is earned: vs the original libart

libart++ is measured head-to-head against armon/libart — the canonical C ART — in every chart on this page, and holds a dedicated, thermally-fair ratio gate against it (contestants interleaved per rep, baseline-tracked). The structural edge: libart loads each node's header before it can route; artpp routes from the tagged pointer itself.

	libart (C)	libart++ (`artpp::map`)
Point lookups	the bar to beat	faster on all four workloads, plain radix: 2.0× sequential, 1.37× uniform, 1.32× clustered, 1.18× dictionary
Small leaves	a malloc'd node per key	a leaf that fits goes inside the routing node's cacheline — the byte that selects it and the value it yields are one line, so the lookup ends without a second load (the 1-byte `inl` header flags which children are inline)
Leaf storage	one `malloc` per leaf	a dedicated 16-byte-granular terminal region — the tag carries the 4 bits a terminal pointer needs, so a leaf is sized to its payload instead of rounded up to a node-sized block
Path compression	per-node prefix read with the header on every hop	prefix fused into the header line, read only when the pointer tag says one exists; prefix-free nodes route without touching CL0 at all
Values	`void*` — you own the lifetime, casts everywhere	real typed `T`: constructed in place, destroyed on erase, moved on restructure; small ones live inline in the pointer
Keys	raw byte pointers + lengths	`string`/`string_view`/integrals/byte vectors built in; doubles & reflected structs via psio — numeric ordering for free
Memory	`malloc`/`free`, fixed	any standard allocator (PMR included); 4-byte indexed handles; mmap file-backed pools; cross-pool moves as one `memcpy`
Iteration	forward callback (`art_iter`)	bidirectional iterators, `lower_bound`/`upper_bound`, reverse ranges, `erase(iterator)`, range-`for`
Safety	C: no exceptions, manual cleanup	basic/strong guarantees, fault-injection tested at every allocation point — zero leaks, no double-destroys
Structure	one fixed node ladder (4/16/48/256)	selectable modes per workload: `buckets` packs key tails into cacheline pages (the fastest dictionary lookups here), `dense_tiers` and `wide_stems` trade footprint vs hop depth — same semantics, same tests

Engineered like a standard library component

The goal is a container you could submit for review: exact contracts, exhaustive tests, honest documentation.

Any key type

Strings, views, byte vectors, all integrals (numeric iteration order) built in. With psio: doubles, reflected structs, optionals, variants — one macro per type.

Structural modes

mode::buckets packs sparse key tails into cacheline pages (best dictionary numbers above); dense_tiers and wide_stems trade footprint vs hop depth. Same semantics, tested identically.

Allocators, all the way down

Every node and value goes through your allocator — PMR works. Index-based pool handles shrink branches to 4 bytes, back the map with an mmap'd file, and make cross-pool moves a single memcpy.

Whole-tree ops use structure

Equality compares node-by-node (memcmp'd edges, no key materialization); cross-allocator moves clone node-by-node — never key-by-key rebuilds.

Exception-safety, proven

Every allocation point is fault-swept with throwing values and failing allocators; leak, corruption, and double-destroy counters assert against a std::map oracle.

Bidirectional, bounded, complete

lower_bound / upper_bound / equal_range, reverse iteration, --end(), erase(iterator), structural ==, max_key_bytes guarded with length_error.

Thirty seconds of code

Integral keys iterate in numeric order

artpp::map<int64_t, std::string> m;
m.insert(-5, "negative");    // big-endian, sign-flipped
m.insert( 7, "positive");    // encoding: byte order ==
                              // numeric order, free
for (const auto& [k, v] : m)  // → -5, then 7
    log(k, v);

Floating-point keys do too — via psio

ARTPP_PSIO_KEY(double)        // order-preserving codec, one macro

artpp::map<double, std::string> m;
m.insert(-3.5, "a");          // IEEE sign-transform: byte
m.insert( 0.0, "b");          // order becomes numeric
m.insert( 2.25, "c");         // order — negatives and all
for (const auto& [k, v] : m) // → -3.5, 0.0, 2.25
    log(k, v);
// reflected structs, optionals, variants key the same way

File-backed, 4-byte branches

artpp::line_pool pool("store.bin");  // mmap'd file
using A = artpp::pool_alloc<uint64_t>;
artpp::map<std::string_view, uint64_t,
           artpp::mode::none, A> t{A{&pool}};
// branches cost 4 bytes; the whole store is one
// relocatable range — moves are a memcpy

Pick the structure per workload

// dictionary-shaped keys? pack the tails:
artpp::map<std::string_view, uint64_t,
           artpp::mode::buckets> dict;
// 77 ns dictionary lookups — 3.5x faster
// than std::map; scans rival absl::btree

Get started in one minute

Header-only. Copy include/artpp, or:

# CMake ≥ 3.20
FetchContent_Declare(artpp GIT_REPOSITORY https://github.com/gofractally/libartpp.git)
FetchContent_MakeAvailable(artpp)
target_link_libraries(your_target PRIVATE artpp::artpp)

# or hack on it directly
git clone https://github.com/gofractally/libartpp && cd libartpp
cmake -B build && cmake --build build -j && ctest --test-dir build
cmake -B build -DARTPP_BENCHMARKS=ON      # the compare suite from this page

Requirements: a C++20 compiler, little-endian target (ARM64/x86-64; NEON and SSE2 paths included). Nothing else. Full API reference and contracts: the README.