Accurate summation algorithm

x86 architecture has essentially 2 parallel sets of floating point functionality:

  • the x87 instructions (e.g. FMUL), which descend from the original 8087 coprocessor. These have 80-bit registers, and the original intention was that all computations would be done at higher precision, truncating only when spilling the registers. The problem is that the results were dependent on register allocation (unless you spill after every operation), so an option was later added to reduce the operational precision to double or single precision. This behaviour was later proscribed by the C99 specification.

  • the SSE2 and later instructions which defined specific instructions for single and double operations (e.g. MULSS, MULSD), along with SIMD variants (e.g. MULPS, MULPD). These are typically much faster, and also support nice features like pipelining, and in later additions, FMAs.

SSE2+ instructions are basically used by all modern compilers unless you explicitly use long double types (though some recent-ish versions of gcc will still use x87 by default on 32-bit machines).

x87 also defined instructions for certain special functions (e.g. FSIN), though it turns out to be faster (and in some cases, more accurate) to compute these in software. The only exception is the remainder functions (e.g. fmod) which often still use x87 underneath.

3 Likes