Loop-Carried Dependency: Single vs Split Accumulator

Question 5 / 17 • Correct so far: 0 (0 answered)

Snippet A

Single Sum

double sumArray(const std::vector<double>& data) {
    double sum = 0.0;
    for (double x : data)
        sum += x;
    return sum;
}

double result = sumArray(DATA);

Snippet B

Split Sum

double sumArray(const std::vector<double>& data) {
    double s0 = 0.0, s1 = 0.0, s2 = 0.0, s3 = 0.0;
    const std::size_t n = data.size();
    std::size_t i = 0;
    for (; i + 3 < n; i += 4) {
        s0 += data[i];
        s1 += data[i + 1];
        s2 += data[i + 2];
        s3 += data[i + 3];
    }
    for (; i < n; ++i) s0 += data[i];
    return s0 + s1 + s2 + s3;
}

double result = sumArray(DATA);

Shared test data (shared-setup)

constexpr int kDataSize = 65536;

static std::vector<double> makeData() {
    std::vector<double> v(kDataSize);
    for (int i = 0; i < kDataSize; ++i)
        v[i] = 1.0 + static_cast<double>(i) * 0.000001;
    return v;
}

static const std::vector<double> DATA = makeData();

Which snippet is faster?

Snippet B is faster. Floating-point addition is not associative, so the compiler cannot legally reorder the additions in snippet A — each iteration must wait for the previous sum to be ready before it can add the next element. A double-precision add typically has 4–5 cycles of latency while CPUs can sustain more than one add per cycle via pipelining, so a single accumulator limits throughput to at most 1/latency adds per cycle. Snippet B uses four independent partial sums, letting the processor keep multiple additions in flight simultaneously across its execution units. The four partial sums are combined at the end. The result differs only by floating-point rounding, which is acceptable in most numerical code.

Benchmark results

clang · C++17 · -O3 -march=native

Snippet	CPU time / iteration	Speedup
Single Sum	37.3 us	1.0×
Split Sum	9.11 us	4.1×

Explore the source

Open in Compiler Explorer

Loop-Carried Dependency: Single vs Split Accumulator

Benchmark results

Explore the source

Per-question summary

Tracking settings