libart++
The adaptive radix tree map for modern C++.
std::map's interface, with cost counted in the unit that actually
matters: at most one cold cacheline per level, and levels bounded by your
key's bytes — never by your data's size.
header-only
C++20
zero dependencies
allocator-aware · PMR
fuzzed vs std::map
fault-injection tested
// the interface you already know — at radix speed
#include <artpp/map.hpp>
artpp::map<std::string, uint64_t> m; // takes string_view args: no temp strings
m.insert("alpha", 1);
m.try_emplace("beta", 2); // present → untouched, nothing constructed
if (auto it = m.find(user_input); it != m.end())
use(it->first, it->second);
for (const auto& [key, value] : m) // ascending key order, always
process(key, value);
Count cachelines, not comparisons
A lookup's real cost is its cold cacheline loads — every other
instruction hides behind them. Big-O in comparisons flatters comparison trees;
measured in cachelines, a B-tree lookup is O(log n) dependent misses.
libart++ is engineered so each level of descent costs at most ONE cold line —
and the level count is bounded by your key's bytes, not your data's size.
Headerless dispatch
The node's kind rides in the tagged pointer, so descent knows how to route
before the node arrives: no header line load before the child slot load. One line
per hop is the contract, not the goal.
Prefixes live in the header line
A compressed run is stored inside the cacheline the node already owns; the
pointer tag hints when it must be read, and a prefix-free node never touches it. A
separate prefix node exists only past 63 bytes — overflow, not the default. Path
compression at zero extra lines.
Terminals cost almost nothing
Values ≤5 bytes pack inline in the handle bits — zero lines. A small leaf
can live inside the routing setlist's own cacheline (no separate load at all).
Anything else sits in a dedicated 16-byte-aligned terminal region — a leaf is
pure payload, so it gets exactly what it needs, not a 128-byte line.
Levels collapse adaptively
Sparse byte levels fuse: wide stems consume two key bytes per line, buckets pack
whole key tails into one 4-line page, dense spans direct-index. The tree spends
lines where your data's entropy actually is.
Measured, not promised
One templated benchmark engine drives every contestant through
identical code — same keys, same op sequence, same checksums — so there are no
per-library benchmark deviations. Upstream
libart is vendored;
absl::btree_map and std::map use heterogeneous lookup.
Reproduce with -DARTPP_BENCHMARKS=ON.
1.2–2.0×
faster point lookups than the original libart — on all four workloads (plain radix): 2.0× sequential, 1.37× uniform, 1.32× clustered, 1.18× dictionary
1.9–4.3×
faster point lookups than absl::btree_map, across all four workloads
2.1–7.6×
faster point lookups than std::map
1.2–1.6×
faster lower_bound than absl::btree_map on clustered & integer keys — an ordered query libart has no API for
Point lookups (the operation maps exist for)
Every lookup costs the cachelines its descent touches — and artpp is built so each
level of descent touches at most one cold line: the node kind is in the tagged
pointer (no header read to route), a compressed run lives in that line instead of a
separate node, and a small leaf can ride in the routing node's own line rather than
a line of its own. Against libart that's 2.0× on sequential integers,
1.37× uniform, 1.32× clustered, and 1.18× on dictionary words;
against the comparison-based maps, 1.9–4.3× over absl::btree_map and
2.1–7.6× over std::map.
lower_bound (the ordered query a hash map can't answer — and libart has no API for)
Positioning to the next-greater key (predecessor/successor, range starts) is what
an ordered map is for — and it's the line where the field thins out: a hash
map can't do it at all, and upstream libart exposes no lower_bound
either (shown as unsupported above). Among the containers that can, artpp leads:
1.5× faster than absl::btree_map on clustered strings, 1.24× on
uniform and 1.6× on sequential integers, and on par on dictionary words — versus
std::map, up to 2.1×. The descent is the same shape as a point lookup,
so the same one-line-per-level economics apply.
Inserts (build time is allocator-bound — so compare the allocator)
A bulk build is dominated by where the nodes come from, so the honest insert
comparison holds the tree fixed and swaps the allocator. The flagship
line_pool — a contiguous reservation handed out by a bump frontier with
exact-size free lists, no per-node malloc — builds the same tree
1.1–1.4× faster than std::allocator, the allocator's
contribution cleanly isolated (both rows are on every insert chart).
Held to the same malloc class, artpp and libart build at similar rates:
artpp leads on sequential and runs even on clustered, and libart's flatter node
growth leads on the deeply-shared dictionary, where a radix does the most splitting
— the up-front structuring that buys the lookup, scan, and ordered-iteration wins
above. Net with the pool: builds beat absl::btree_map (1.7–2.8×) and
std::map (2.4–7.2×) on every workload.
Ordered scans (bucket mode keeps pace with B-trees)
Bucket mode scans the packed (suffix, value) pages contiguously: 2.2×
faster than libart on dictionary keys (2.6 vs 5.6 ns/element), and on clustered
strings it edges absl::btree_map itself (6.5 vs 8.9 ns/element). For the
absolute crown on dense integer keys, a B-tree's fully contiguous leaves lead — so
pick bucket mode when a workload both looks up and scans, and the default
radix when it's lookup-dominated.
Methodology: Apple M5, clang -O3, quiet machine, one process. The
artpp::map rows run the library's flagship allocator —
line_pool over an anonymous mapping (4-byte u32-index branch
handles, bump placement laying children near parents); the
std::allocator row shows zero-setup behavior. Workloads: 236k
dictionary words, 885k clustered strings, 1M uniform & sequential
uint64 (typed APIs; libart receives the identical big-endian bytes).
Point-lookup latency is the fastest of five repetitions per contestant. Full data:
results.csv · engine:
bench/engine.hpp.
A separate machine-calibrated ratio gate
guards artpp-vs-libart regressions in CI fashion.
Built for the keys real systems use
Real keys are rarely a bare integer — they're strings, 32-byte hashes, UUIDs,
composite tuples: bytes that don't fit in a CPU register. A radix descent reads only the few
bytes that distinguish a key, touching a bounded number of cache lines however
large the map or the key. Here are random 32-byte keys scaled from one million to
one hundred million — artpp, upstream libart, unodb (the canonical C++ ART), the
comparison trees, and a hash map.
Grow the map 100× and artpp's lookup barely moves — ~52 → 90 ns on these
std::string keys — while absl::btree_map climbs 505 → 1,471 and
std::map 807 → 1,966: each level is a full-key compare over cold cache lines,
O(log n) of them. The radix reads only the bytes that distinguish a key, so it scales best of
the field (1.75× across the growth vs 2.4–2.9× for the trees) and at 100M is 16× over
absl::btree_map and 22× over std::map.
Variable-length (string) keys — fastest, ordered or not
Variable-length keys (names, paths, the dictionary words above, or
the 32-byte strings here) are the radix's home turf: a hash must consume the whole key,
so at 100M artpp (90 ns) edges even std::unordered_map (100 ns) while
staying fully ordered — lower_bound in 159 ns vs 1,489
(btree) and 2,169 (std::map), where the hash map and libart answer no
ordered query at all. unodb — the canonical C++ ART
(art_map derives from it) — handles these
via key_view but trails ~4× (352 ns; its iterator reconstructs the key).
Fixed-size inline keys — the fastest ordered map
Store the same 32 bytes as a fixed std::array (a hash, UUID, or POD id — inline,
no per-key allocation) and it's honest in the other direction: hashing a cheap inline key is
hard to beat, so std::unordered_map wins pure point lookups (55 ns vs
artpp 67) — but it answers nothing else. artpp is the fastest ordered map by a wide
margin: hit 67 ns vs btree 1,057 and std::map 1,422 (16–21×),
lower_bound 165 ns vs 947 / 1,078, and at 100M it even passes the B-tree on ordered
scan. art_map would be faster still on a key this size — but its key must be ≤ 16 bytes, so a
32-byte key doesn't compile.
The C++ ART field, by capability
| C++ ART | keys it handles | lower_bound | build | 32-byte string hit @ 100M |
| artpp::map | any — strings, hashes, doubles, composites (via codec) | ✓ | header-only, no deps, ARM + x86 | 90 ns |
| unodb | fixed-width or key_view | ✓ (scan / seek) | needs Boost; ARM + x86 | 352 ns |
| art_map | ≤ 16-byte fixed-width only | ✓ | needs Boost; x86-only (our PR adds ARM) | — won't compile |
| libart (C) | byte strings | ✗ (scan only) | C, no deps, portable | 118 ns |
Bottom line, stated straight: a hash map is faster for pure point lookups on fixed-size
keys, and a register-specialized ART (art_map) faster still on ≤16-byte integers — but the
moment keys are variable-length, or you need lower_bound / ordered scans / ranges,
artpp is the fastest option, and the only one that's header-only, dependency-free, portable, and
keys anything.
Methodology: Apple M5 (128 GB), clang -O3, idle machine, one process; each container built and
measured in isolation. Random insertion; point ops are the fastest of repeated passes. The two
regimes differ only in key storage — std::map/btree/unordered_map
hold std::string (heap) vs std::array<unsigned char,32> (inline,
unsigned ⇒ same byte order); artpp/libart/unodb key on the identical bytes either way.
artpp uses line_pool to ~50M 32-byte keys then std::allocator
(no 4 GB terminal cap) to 100M — lookups within noise between them. unodb's scan
reconstructs the key (not compared on scan). Source:
bench_buffer.cpp
+ bench_buffer_array.cpp / bench_buffer_unodb.cpp.
The "++" is earned: vs the original libart
libart++ is measured head-to-head against
armon/libart — the canonical C ART —
in every chart on this page, and holds a dedicated, thermally-fair
ratio gate
against it (contestants interleaved per rep, baseline-tracked). The structural edge:
libart loads each node's header before it can route; artpp routes from the tagged
pointer itself.
| libart (C) | libart++ (artpp::map) |
| Point lookups | the bar to beat |
faster on all four workloads, plain radix: 2.0× sequential,
1.37× uniform, 1.32× clustered, 1.18× dictionary |
| Small leaves | a malloc'd node per key |
a leaf that fits goes inside the routing node's cacheline — the byte
that selects it and the value it yields are one line, so the lookup ends without
a second load (the 1-byte inl header flags which children are inline) |
| Leaf storage | one malloc per leaf |
a dedicated 16-byte-granular terminal region — the tag carries the
4 bits a terminal pointer needs, so a leaf is sized to its payload instead of
rounded up to a node-sized block |
| Path compression | per-node prefix read with the header on
every hop |
prefix fused into the header line, read only when the pointer tag says
one exists; prefix-free nodes route without touching CL0 at all |
| Values | void* — you own the lifetime, casts everywhere |
real typed T: constructed in place, destroyed on erase,
moved on restructure; small ones live inline in the pointer |
| Keys | raw byte pointers + lengths |
string/string_view/integrals/byte vectors built in;
doubles & reflected structs via psio — numeric ordering for free |
| Memory | malloc/free, fixed |
any standard allocator (PMR included); 4-byte indexed handles; mmap
file-backed pools; cross-pool moves as one memcpy |
| Iteration | forward callback (art_iter) |
bidirectional iterators, lower_bound/upper_bound,
reverse ranges, erase(iterator), range-for |
| Safety | C: no exceptions, manual cleanup |
basic/strong guarantees, fault-injection tested at every allocation
point — zero leaks, no double-destroys |
| Structure | one fixed node ladder (4/16/48/256) |
selectable modes per workload: buckets packs key tails into
cacheline pages (the fastest dictionary lookups here), dense_tiers
and wide_stems trade footprint vs hop depth — same semantics, same tests |
Engineered like a standard library component
The goal is a container you could submit for review: exact contracts,
exhaustive tests, honest documentation.
Any key type
Strings, views, byte vectors, all integrals (numeric iteration order) built in.
With psio: doubles, reflected
structs, optionals, variants — one macro per type.
Structural modes
mode::buckets packs sparse key tails into cacheline pages (best
dictionary numbers above); dense_tiers and wide_stems
trade footprint vs hop depth. Same semantics, tested identically.
Allocators, all the way down
Every node and value goes through your allocator — PMR works. Index-based
pool handles shrink branches to 4 bytes, back the map with an
mmap'd file, and make cross-pool moves a single memcpy.
Whole-tree ops use structure
Equality compares node-by-node (memcmp'd edges, no key materialization);
cross-allocator moves clone node-by-node — never key-by-key rebuilds.
Exception-safety, proven
Every allocation point is fault-swept with throwing values and failing
allocators; leak, corruption, and double-destroy counters assert against a
std::map oracle.
Bidirectional, bounded, complete
lower_bound / upper_bound / equal_range,
reverse iteration, --end(), erase(iterator), structural
==, max_key_bytes guarded with length_error.
Thirty seconds of code
Integral keys iterate in numeric order
artpp::map<int64_t, std::string> m;
m.insert(-5, "negative"); // big-endian, sign-flipped
m.insert( 7, "positive"); // encoding: byte order ==
// numeric order, free
for (const auto& [k, v] : m) // → -5, then 7
log(k, v);
Floating-point keys do too — via psio
ARTPP_PSIO_KEY(double) // order-preserving codec, one macro
artpp::map<double, std::string> m;
m.insert(-3.5, "a"); // IEEE sign-transform: byte
m.insert( 0.0, "b"); // order becomes numeric
m.insert( 2.25, "c"); // order — negatives and all
for (const auto& [k, v] : m) // → -3.5, 0.0, 2.25
log(k, v);
// reflected structs, optionals, variants key the same way
File-backed, 4-byte branches
artpp::line_pool pool("store.bin"); // mmap'd file
using A = artpp::pool_alloc<uint64_t>;
artpp::map<std::string_view, uint64_t,
artpp::mode::none, A> t{A{&pool}};
// branches cost 4 bytes; the whole store is one
// relocatable range — moves are a memcpy
Pick the structure per workload
// dictionary-shaped keys? pack the tails:
artpp::map<std::string_view, uint64_t,
artpp::mode::buckets> dict;
// 77 ns dictionary lookups — 3.5x faster
// than std::map; scans rival absl::btree
Get started in one minute
Header-only. Copy include/artpp, or:
# CMake ≥ 3.20
FetchContent_Declare(artpp GIT_REPOSITORY https://github.com/gofractally/libartpp.git)
FetchContent_MakeAvailable(artpp)
target_link_libraries(your_target PRIVATE artpp::artpp)
# or hack on it directly
git clone https://github.com/gofractally/libartpp && cd libartpp
cmake -B build && cmake --build build -j && ctest --test-dir build
cmake -B build -DARTPP_BENCHMARKS=ON # the compare suite from this page
Requirements: a C++20 compiler, little-endian
target (ARM64/x86-64; NEON and SSE2 paths included). Nothing else.
Full API reference and contracts:
the README.