Cache Coherency: Packed vs Padded Thread Counters

Question 2 / 17 • Correct so far: 0 (0 answered)

Snippet A

Packed Counters

struct Counters {
    std::atomic<long long> a{0};
    std::atomic<long long> b{0};
};

c.fetch_add(1, std::memory_order_relaxed);

Snippet B

Padded Counters

struct alignas(kCacheLineBytes) PaddedCounter {
    std::atomic<long long> value{0};
};
struct Counters {
    PaddedCounter a;
    PaddedCounter b;
};

c.fetch_add(1, std::memory_order_relaxed);

Shared test data (shared-setup)

static constexpr std::size_t kCacheLineBytes = 64;

Which snippet is faster?

Snippet B is faster. When two threads continuously write to variables that share the same 64-byte cache line, the hardware cache coherency protocol must transfer ownership of that line between cores on every write — a phenomenon called false sharing. Although the threads never read each other's counter, the CPU treats the whole cache line as the unit of coherency. Wrapping each counter in an alignas(64) struct forces the hardware to allocate a dedicated cache line per counter, eliminating the inter-core ping-pong and allowing both threads to write at full speed.

Benchmark results

clang · C++17 · -O3 -march=native

Snippet	CPU time / iteration	Speedup
Packed Counters	8.34 ns	1.0×
Padded Counters	1.29 ns	6.5×

Explore the source

Open in Compiler Explorer

Cache Coherency: Packed vs Padded Thread Counters

Benchmark results

Explore the source

Per-question summary

Tracking settings