Skip to content

x86 SIMD Benchmark: AVX-512 / SSSE3 Optimizations

This page documents the SIMD-accelerated hot paths added for x86-64 and their measured impact on system-level throughput at 20 million keys with 12 concurrent reader threads.

Test Environment

Component Spec
CPU AMD EPYC-Turin, 8 cores / 16 threads (SMT), 2.4 GHz
ISA extensions SSE2, SSE4.1, SSSE3, AVX, AVX2, AVX-512F/BW/DQ/VL/VBMI/VBMI2/VNNI/BF16
RAM 121 GB
Storage 960 GB virtual disk (cloud VM — Vultr, Linux 6.17)
OS Ubuntu, Linux 6.17.0 x86_64, 4 KB page size
Compiler Clang 20, C++20
Build Release (-O3, LTO, -march=native)

Optimized Functions

Five hot-path functions received explicit x86 SIMD implementations, dispatched at compile time via #if defined(__AVX512F__) / __SSE2__ / __SSSE3__ guards.

lower_bound (byte array, sorted)

Used at every inner-node traversal to find the child branch for the next key byte. With an average tree depth of 4–5 this is called millions of times per second.

SSE2 approach: XOR both the array and the search byte with 0x80 to convert unsigned < into signed comparison, then use _mm_cmplt_epi8 + _mm_movemask_epi8 + __builtin_popcount to count elements less than the target in 16-byte chunks.

Scalar SSE2 Speedup
lower_bound (size=64) 7.1 ns 0.23 ns 31x
lower_bound_padded (size=64) 7.0 ns 0.23 ns 31x

copy_branches + update cline index (SSSE3)

Called on every inner-node clone during copy-on-write. Copies branch data and remaps control-line indices via a 16-entry LUT using _mm_shuffle_epi8 (PSHUFB).

SSSE3 approach: Forward iteration in 16-byte chunks with a scalar tail. No padding requirement (contrast with the NEON version which uses end-aligned first + pointer wraparound).

Scalar SSSE3 Speedup
copy_branches (N=16) 4.5 ns 0.69 ns 6.5x
copy_branches (N=64) 16.3 ns 1.8 ns 9x
copy_branches (N=128) 32.0 ns 3.6 ns 8.9x

find_u32x16 — child pointer search (AVX-512F)

Finds the first matching 32-bit child pointer in a padded 16-element array.

AVX-512F approach: Single 512-bit load + vpcmpeqd mask comparison — 5 instructions total. The compiler auto-upgrades _mm_cmpeq_epi32 to k-register ops anyway, but the explicit path is cleaner and forward-compatible.

vpbroadcastd zmm0, edx        ; broadcast search value
vpcmpeqd k0, zmm0, [rdi]     ; compare all 16 in one 512-bit op → 16-bit mask
kmovw eax, k0
btsl esi, eax                 ; set sentinel bit at position `size`
tzcntl eax, eax               ; index of first match

Measured ~1.6 ns for both scalar-unrolled and AVX-512 — L1 load latency (64-byte load ≈ 4–5 cycles) is the ceiling regardless of instruction count.

create_nth_set_bit_table — exclusive prefix sum (SSE2)

Builds a 16-entry lookup table mapping each branch slot to its rank among non-empty slots. Used during inner-node construction.

SSE2 approach: 4-step parallel prefix-sum tree using _mm_slli_si128 byte-shifts:

sum = _mm_add_epi8(sum, _mm_slli_si128(sum, 1));
sum = _mm_add_epi8(sum, _mm_slli_si128(sum, 2));
sum = _mm_add_epi8(sum, _mm_slli_si128(sum, 4));
sum = _mm_add_epi8(sum, _mm_slli_si128(sum, 8));
sum = _mm_sub_epi8(sum, ones);  // exclusive prefix

Generates ~14 instructions vs 40+ for the scalar equivalent. Both bottleneck on the 8-deep arithmetic dependency chain at ~1.6 ns/call.

find_min_index — branchless tournament (no SSE path)

Finds the index of the minimum uint16 in a 32- or 64-element array. SSE4.1 _mm_minpos_epu16 was benchmarked and found slower than the existing branchless tournament (0.57 ns vs 0.23 ns for 32 elements) due to _mm_extract_epi16 latency chains. The dispatch falls through to the tournament on all x86 targets.

Naive scan Tournament Speedup
find_min_index_32 18.2 ns 0.23 ns 80x
find_min_index_64 53.1 ns 0.23 ns 234x

System-Level Impact at 20M Keys

The benchmark inserts 20 million random keys (8-byte values) in 20 rounds of 1 million keys each, with 12 concurrent reader threads performing random point lookups throughout.

Insert Throughput (single writer)

The lower_bound improvement is the primary driver — at 20M keys the tree is deep enough that every insert requires multiple traversals, each of which calls lower_bound repeatedly.

Workload Pre-SIMD Post-SIMD Delta
Sequential (round 1) ~1,860K/sec ~1,790K/sec -4%
Sequential (round 20, 20M keys) 1,860K/sec 1,672K/sec -10%
Dense random (round 20, 20M keys) 381K/sec 1,129K/sec +196%
String random (round 20, 20M keys) 299K/sec 1,003K/sec +235%

Random insert throughput at 20M keys improved ~3x. Before SIMD, the tree's growing depth caused progressive degradation; 31x faster lower_bound keeps insert throughput stable across the full key range.

The sequential insert slight regression (-4% to -10%) is a separate effect: sequential keys form a degenerate tree shape that stresses the COW clone path more heavily, where the SSSE3 copy_branches adds a small constant overhead compared to the previous scalar path on this specific pattern.

Read Throughput

Workload Pre-SIMD Post-SIMD Delta
Sequential get (20M found) 4.30M/sec 5.52M/sec +28%
Dense random get (20M found) 1.07M/sec 1.17M/sec +9%
Lower bound scan (20M ops) 514K/sec 521K/sec +1%

Concurrent Read Throughput (12 threads)

Workload Pre-SIMD Post-SIMD Delta
Multithread get (rand keys, ~21M DB) 12.4M/sec 19.4M/sec +57%
Multithread get (known keys, ~21M DB) ~13M/sec 13.2M/sec +2%

The +57% concurrent random-read gain is the headline result: 12 reader threads achieving 19.4M lookups/sec, up from 12.4M.

Notes

  • All benchmarks run with -march=native (the default when ENABLE_NATIVE_ARCH=ON). Binaries are not portable to older x86 CPUs.
  • The ARM NEON paths (for Apple Silicon / ARM Linux) remain unchanged; SIMD dispatch is compile-time via #if defined(__ARM_NEON) / __AVX512F__ / __SSE2__ guards.
  • On machines without AVX-512F the dispatch falls through to the SSE2 find_u32x16 path (4x 128-bit compares + movemask chain), which the compiler upgrades to k-register operations when AVX-512 is available.

Reproducing

cmake --build build/release --target simd-x86-benchmark -j$(nproc)
./build/release/bin/simd-x86-benchmark

For the system-level benchmark:

cmake --build build/release --target psitri-benchmark -j$(nproc)
./build/release/bin/psitri-benchmark \
    --rounds 20 --items 1000000 --batch 512 \
    --read-threads 12 --db-dir /tmp/psitri_simd_bench