Loop-Carried Dependency: Single vs Split Accumulator
Question 5 / 17 • Correct so far: 0 (0 answered)
Single Sum
double sumArray(const std::vector<double>& data) {
double sum = 0.0;
for (double x : data)
sum += x;
return sum;
}
double result = sumArray(DATA); Split Sum
double sumArray(const std::vector<double>& data) {
double s0 = 0.0, s1 = 0.0, s2 = 0.0, s3 = 0.0;
const std::size_t n = data.size();
std::size_t i = 0;
for (; i + 3 < n; i += 4) {
s0 += data[i];
s1 += data[i + 1];
s2 += data[i + 2];
s3 += data[i + 3];
}
for (; i < n; ++i) s0 += data[i];
return s0 + s1 + s2 + s3;
}
double result = sumArray(DATA); Shared test data (shared-setup)
constexpr int kDataSize = 65536;
static std::vector<double> makeData() {
std::vector<double> v(kDataSize);
for (int i = 0; i < kDataSize; ++i)
v[i] = 1.0 + static_cast<double>(i) * 0.000001;
return v;
}
static const std::vector<double> DATA = makeData(); Which snippet is faster?
Snippet B is faster. Floating-point addition is not associative, so the compiler cannot legally reorder the additions in snippet A — each iteration must wait for the previous sum to be ready before it can add the next element. A double-precision add typically has 4–5 cycles of latency while CPUs can sustain more than one add per cycle via pipelining, so a single accumulator limits throughput to at most 1/latency adds per cycle. Snippet B uses four independent partial sums, letting the processor keep multiple additions in flight simultaneously across its execution units. The four partial sums are combined at the end. The result differs only by floating-point rounding, which is acceptable in most numerical code.
Benchmark results
| Snippet | CPU time / iteration | Speedup |
|---|---|---|
| Single Sum | 37.3 us | 1.0× |
| Split Sum | 9.11 us | 4.1× |
Explore the source
Open in Compiler ExplorerQuiz complete. You can return to the question list to restart and compare.