Optimizing sums of products (dot products)

s-baumann · September 17, 2024, 4:40pm

Is there a way to avoid the intermediate allocation for a sumproduct scenario?
In the following case I can redo it with a for loop which improves the memory allocation but causes it to take longer to execute (the difference seems small here but I have run it a few times and it seems robust).

vec_1 = rand(100)
vec_2 = rand(100)

function sumproduct_one(x,y)
    return sum(x .* y)
end

function sumproduct_two(x,y)
    res = 0.0
    for i in eachindex(x)
        res += x[i] * y[i]
    end
    return res
end

@btime sumproduct_one(vec_1,vec_2)
#  94.501 ns (2 allocations: 912 bytes)

@btime sumproduct_two(vec_1,vec_2)
#  100.227 ns (1 allocation: 16 bytes)

mikmoore · September 17, 2024, 5:31pm

Note that this question is slightly different than the original, since you want the sum rather than cumsum.

I would suggest you use LinearAlgebra.dot(vec_1, vec_2)(just “dot” after using LinearAlgebra) or vec_1' * vec_2. Also, you can use $ with @btime to avoid the effects of benchmarking with global variables.

julia> @btime sumproduct_one($vec_1, $vec_2)
  75.082 ns (1 allocation: 896 bytes)
26.780547839185267

julia> @btime sumproduct_two($vec_1, $vec_2)
  37.042 ns (0 allocations: 0 bytes)
26.780547839185274

julia> @btime LinearAlgebra.dot($vec_1, $vec_2)
  10.744 ns (0 allocations: 0 bytes)
26.780547839185264

julia> @btime $vec_1' * $vec_2
  10.028 ns (0 allocations: 0 bytes)
26.780547839185264

Notice that these give very slightly different answers due to reassociating the sum in different ways. But it’s not that one is significantly more right/wrong than another ~~(although “on average” the dot or * versions should be very slightly more accurate due to their use of SIMD and FMA)~~. Don’t get hung up on the <1ns difference in speed on the last two.

stevengj · September 17, 2024, 6:54pm

If you just put @simd in front of the loop in sumproduct_two, then it matches or exceeds the speed of dot on my machine. For such a small length (100), it is actually faster than dot on my machine because the BLAS call in dot imposes some additional overhead.

Compilers can’t do @simd automatically for this sum because it changes the answers slightly (by re-ordering the additions).

For long vectors, the sum version is almost certainly more accurate, because it uses pairwise summation whereas the BLAS call in dot probably does a “naive” sum. This won’t affect you for length-100 vectors, however, since the pairwise algorithm only turns on for length > 1024 (for performance reasons).

However, note that you are talking about a different function than the OP in this old thread (sum rather than cumsum), so I will split it off into a new thread. In general, please be more reluctant to revive ancient threads rather than starting new ones (and cross-referencing/linking older threads as appropriate).

xiaodai · September 17, 2024, 11:25pm

I have seen this multiple times. But I wonder what others think, when we right loops to sum over floats, does anyone actually rely on the order in which they are summed? Or is there an implicit assumption that floats are approximations anyway, so there is likely to be some small residuals of unmatched. Case in point, I was taught in numerical compute to compare floats like this abs(a-b) < machine_epsilon or something like.

So why the fascination with changing the order and hence it will be off in some decimal place to the far right? I think @smind should be the default and @nosimd needs to be explicit.

stevengj · September 18, 2024, 12:53am

In some algorithms you can re-order things without any big problem (this is why many functions like reduce and sum in Julia leave their associativity/ordering/algorithm unspecified). But in others it would lead to catastrophe, or at least undesirable behavior.

The compiler doesn’t know which is which, so it defaults to doing what computer languages are usually supposed to do:

“The good news about computers is that they do what you tell them to do. The bad news is that they do what you tell them to do.” — Ted Nelson

The compiler is not allowed to change the results of your code, even if it thinks the change is small (it isn’t always!). You don’t want a compiler that avoids disastrously rewriting your code only 99% of the time.

That’s why usually languages require some flag like @simd or @fastmath to give them explicit permission to re-order/re-associate floating-point operations in ways that might change the result. This is not unique to Julia, e.g. gcc has -ffast-math to enable “unsafe” floating-point optimizations.

Note that this is wrong — the machine epsilon is a relative tolerance, not an absolute tolerance. This is a common confusion. See also (e.g.) History of `isapprox` in languages - #14 by stevengj

(Nor will different algorithms generally agree to a relative tolerance of the machine epsilon either … you often lose more digits than that. And sometimes you are more accurate than that. Precision ≠ accuracy.)

xiaodai · September 18, 2024, 1:22am

Ok. TIL.

I was not a very student in numerical computing tbh.

I still think fastmath should be the default and needs to be opt out.

mbauman · September 18, 2024, 1:27am

In the extreme, you can get nearly any floating point value just by choosing different orderings when summing this one list of 2046 floats.

These “ffast” or “funsafe” flags just make programs impossible to reason about. It frequently means you get different answers for the same operation, causing all sorts of surprises and broken invariants. x might equal y in one context but be less than it in another.

@simonbyrne has a good writeup here:

https://simonbyrne.github.io/notes/fastmath/

xiaodai · September 18, 2024, 2:36am

hmmm, if guarding the extremes are so important, why not just default to using the biggest floats? ppl obviously don’t do that so i think the extremes should be handled by pros who opt out of fastmath and for mortals like me just default to fastmath. imho

simonbyrne · September 22, 2024, 9:40pm

hmmm, if guarding the extremes are so important, why not just default to using the biggest floats?

Two reasons:

On modern CPUs, Float64 hit a fairly good tradeoff of performance vs accuracy (GPUs are a different matter)
This is what most other languages do (and so what people tend to expect). The only exceptions I’m aware of tend to be fairly niche languages like Racket.

ppl obviously don’t do that so i think the extremes should be handled by pros who opt out of fastmath and for mortals like me just default to fastmath. imho

Again, it’s a matter of tradeoffs and expectations. The performance benefits of fastmath are pretty small outside of these sorts of vectorization cases (and in these cases can be typically achieved by safer means, like @simd). It’s also what most other languages do.

An interesting exception is Fortran: most Fortran compilers allow limited re-association unless parentheses are used to explicitly denote evaluation order (I don’t know if re-associations across a loop are allowed).

You can obviously do what you want with your code, but I would advise against using it by default. I think a good mindset is to think of it as “YOLO math”: try it out, see if it makes your code faster. If not, then don’t use it. If it does, then check its giving the correct results, or (even better) see if there are slightly less dangerous ways to get the same boost.

mikmoore · September 23, 2024, 4:49pm

@fastmath doesn’t do magic. For the most part, it rewrites your code in ways that you could hand-rewrite with some effort. Here is the list of fast-math flags available to LLVM (and hence Julia).

I’ll use this list (as of Summer 2024) and provide some commentary:

ninf “no infs” - This one can be devastating when your code actually does have infs. If the compiler can get rid of a branch that only happens when the value is ±Inf, it will. If an Inf makes it to that point, the result may be very wrong. Removing these branches can allow SIMD to happen in a loop where it otherwise couldn’t, but otherwise is unlikely to improve performance.
nnan “no nans” - Basically the same as ninf but applies to NaN instead.
nsz “no signed zeros” - This allows the compiler to pretend that +0.0 and -0.0 are the same and use them interchangeably. It’s usually a pretty safe optimization since +0.0 == -0.0. One of the main benefits is that it can replace stuff like x + 0.0 with x (note -0.0 + (+0.0) === +0.0, so normally this is not allowed). Normally, this is not possible because -0.0 is the additive identity for IEEEFloat types (i.e., the compiler can already replace x - 0.0 with x). It also allows the compiler to replace x * 0.0 with 0.0 (when x is known to be not inf/nan by context or the ninf/nnan flags), whereas normally the resulting zero would need to inherit the sign of x (1.0 * 0.0 === +0.0, -1.0 * 0.0 === -0.0. Ultimately, this optimization will mostly remove adds or (with other flags) multiplies with zero (only if the zero is known at compile time).
arcp “allow reciprocal multiplication” - This allows the compiler to replace x / y with x * (1/y). This can be faster when the same denominator is used “many” times (more than 4-8 times, usually). This one is usually easy to rewrite yourself. Note that this can change results - for example (e.g., 49e0/49e0 == 1.0 but 49e0*(1/49e0) < 1.0). For denominators that have exact floating point inverses known at compile time, this is already done (e.g., x / 4 will use multiplication rather than division).
contract - this enables expressions like a*b+c to be replaced by fma(a,b,c). Doing these operations together is more accurate (but sometimes this is not desirable) and is usually faster than doing them separately (if the hardware has builtin fma instructions, which most nowadays do). This flag can be enabled manually via muladd(a,b,c).
afn “allow approximate functions” - This allows some functions like sin, cos, log, exp, to be replaced with less accurate versions that are faster. However, Julia does not use the LLVM versions of most of these functions so I don’t believe this compiler flag has any effect in Julia (as of v1.10, at least, and probably for a while yet).
- EDIT (thanks to @danielw below) - there are some functions with specialized fastmath implementations. However, these are transformed by Julia’s @fastmath directly, not the compiler.
reassoc “allow reassociation” - This allows the compiler to pretend that basic arithmetic is associative (true for real numbers but not for floats). This one is big for enabling loops to be parallelized via SIMD. It can provide massive speedups to large sums and products, often 4-16x. This (and also contract) can be enabled within a loop by using the @simd macro.

nnan and ninf are very dangerous (example here, from a time when they were accidentally applied by @simd) unless you have extremely tight control over your inputs, but occasionally can be hugely profitable when the removal of branches allows SIMD (useless if the code becomes wrong, however). nsz can be substantial in a few cases but usually isn’t. arcp can be substantial, maybe up to 4x on the division itself, but in most cases can be written manually without much hassle. contract can be up to 2x in many linear algebra applications, but is usually much less for “general math.” contract can be enabled on a single operation via muladd. afn is irrelevant in Julia, as far as I know. reassoc is potentially massive and can be enabled within a loop via @simd.

Currently, Julia can only opt into all of these flags via @fastmath or else none of them (@simd and muladd aside). I hope that we eventually get more control, as some of these can be useful. In particular, reassoc is extremely difficult to do manually and is harmless to most “normal math” (though is deadly to some algorithms and should still not be enabled by default). Although the situations where it gets its biggest gains are already covered by @simd, I have sometimes wanted to apply it to a non-loop expression for smaller gains.

mbauman · September 23, 2024, 5:50pm

I’m always on the hunt for good “mere mortal” code that goes awry.

One of the fundamentals is that fast math allows the compiler to sometimes give you different answers for basic maths depending on the context and surrounding code and available optimizations. That makes debugging very challenging, and especially so in Julia where small functions often inline and constant propagate and then subsequently optimize in different ways.

So a huge practical consequence of this — yes even in “mere mortal” higher-level less-numericsy code — is that you can’t assume that the maths you did in an if conditional will match the results of the same maths elsewhere. Just because f() < g() in one place in your code doesn’t necessarily mean that f() != g() with ffast/funsafe-math. In fact, f() might be bigger than g() elsewhere. So even if you try to “defend” against fast-math with branches, your adversary is already there in your defense, mucking around in your conditionals. That most infamously happens with isnan and isinf checks (and @mikmoore has some great examples above), but it also applies to finite math, too — subtractions can get re-arranged to be additions, things can swap sides of the equality, etc., etc.

My favorite examples are in the contract category because it feels like converting a mul-add to the higher-precision fused multiply and add should be a no-brainer — better accuracy and better performance, what’s not to love?

The expression a*b + c*d now has three possible answers, and the one that’s most accurate is value-dependent. This is particularly bad for complex multiplication, because x*conj(x) definitely needs to be real — and it is, but only if you don’t contract. It’s actually quite relevant to the original conversation here, because that’s precisely what’s on the inside of the dot-product x'x. With FMAs, the imaginary parts would be sometimes positive, sometimes negative. Whatever, Matt, you might say, I’m not implementing complex arithmetic…
Taking the square root of a negative float is an error, so you might have a check to ensure that your argument is positive in that branch. What if you’re solving the quadratic equation? The operand to sqrt there is b*b - 4*a*c — look familiar? Just like a*b + c*d, there are multiple possible ways to compute this with FMAs in the mix. I’d love to find time to build a realistic example that demonstrates different values in an if than the sqrt, but it’s the same idea as this example. What would make it even more fun is that — upon getting a spurious domain error — you might add logging and then never see it again, because just looking at the intermediate values would break the fma.

danielwe · September 23, 2024, 6:43pm

@fastmath doesn’t just set compiler flags; the ninf, nnan, and afn flags are (mostly?) implemented in Julia, replacing Julia-native functions like sin and exp with Julia-native equivalents like sin_fast and exp_fast. See the full list at julia/base/fastmath.jl at 0fade450a183470b01c656a9001512ef2f1aae47 · JuliaLang/julia · GitHub.

For the most part, the implementations only differ for complex numbers, but you also have things like hypot_fast(x, y) = sqrt(x * x + y * y), i.e., throwing caution about over/underflow to the wind, which I guess makes sense next to isinf_fast(x) = false and issubnormal_fast(x) = false.

danielwe · September 23, 2024, 6:49pm

Regarding this, can anyone enlighten me as to whether @fastmath sets compiler flags at all? As far as I can see it doesn’t insert any :meta expressions, and besides, functions are the unit of compilation but @fastmath applies to arbitrary expressions/blocks.

mbauman · September 23, 2024, 7:04pm

It kinda depends on what “setting flags” means to you. You’re right that on the Julia side it’s just a function substitution. But then those functions often find their way to intrinsics that emit LLVM operations with fast flags. For example, getting an fma and fast sqrt from @fastmath can look like this:

julia> f(a,b,c) = @fastmath a*b+sqrt(c)

julia> @code_typed f(.1,.2,.3)
CodeInfo(
1 ─ %1 = Base.FastMath.mul_float_fast(a, b)::Float64
│   %2 = Base.FastMath.sqrt_llvm_fast(c)::Float64
│   %3 = Base.FastMath.add_float_fast(%1, %2)::Float64
└──      return %3
) => Float64

julia> Base.FastMath.mul_float_fast
mul_float_fast (intrinsic function #21)

julia> @code_llvm debuginfo=:none f(.1,.2,.3)
define double @julia_f_3343(double %0, double %1, double %2) #0 {
top:
  %3 = fmul fast double %1, %0
  %4 = call fast double @llvm.sqrt.f64(double %2)
  %5 = fadd fast double %4, %3
  ret double %5
}

julia> @code_native debuginfo=:none f(.1,.2,.3)
; ...
	fsqrt	d2, d2
	fmadd	d0, d1, d0, d2
	ret
; ...

mikmoore · September 23, 2024, 7:16pm

Just afn, I would say. The @simd issue linked in my post above never would have occurred if @fastmath merely affected the isinf/isnan/isfinite functions (the example only checks x == Inf).

~~The redefinition isinf_fast(x) = false is consistent with ninf, but I suspect unnecessary as the compiler is clearly capable of more.~~ EDIT: As pointed out below, the compiler would not be allowed to eliminate the check because @fastmath context is not otherwise inherited by child functions in a call (even when inlined). So mechanically, @fastmath works by substituting many Julia functions. But the effect most of these substitutions have is to emit the same LLVM intrinsic as the normal function but with the fast LLVM flag.

danielwe · September 23, 2024, 7:28pm

I think it’s necessary because @fastmath only works by substituting function calls and operators, so a function that isn’t substituted will keep doing what it’s always done.

In other words, if your check is x == Inf then ninf is implemented by the substitution (==) => :eq_fast, which dispatches to the intrinsic eq_float_fast, which works on the LLVM level as @mbauman explained. But if your check is isinf(x), you wouldn’t get ninf behavior without this definition of isinf_fast.

I suppose it’s also necessary for non-IEEEFloat types.

danielwe · September 23, 2024, 7:41pm

I think as pointed out in the first comment on that issue, @simd does not invoke @fastmath, so that issue was only related to what @simd does to the LLVM on its own, not what @fastmath does or doesn’t do.

mbauman · September 24, 2024, 7:40pm

OK, this was just too juicy to resist. Here’s a barely-contrived example (Julia v1.10.5, Apple M1):

julia> discriminant(a,b,c) = @fastmath b^2 - 4*a*c
discriminant (generic function with 1 method)

julia> function roots(a,b,c)
           if discriminant(a,b,c) < 0
               "no real roots"
           else
               roots = [(-b ± sqrt(discriminant(a,b,c)))/(2*a) for (±) in (+, -)]
               "root at $(unique(roots))"
           end
       end
roots (generic function with 1 method)

julia> roots(.2, .1, .0125)
"no real roots"

That works just fine. Sure, you might miss an exact root here or there, but those are the edge of numerical stability anyhow. What you probably wouldn’t expect is how easy it is for the above answer to change (and error!) depending on what you happen to do outside that function!

julia> f1(a,b,c) = roots(a,b,c), b^2.0
f1 (generic function with 1 method)

julia> f1(.2, .1, .0125)
("no real roots", 0.010000000000000002)

julia> f2(a,b,c) = roots(a,b,c), b*b
f2 (generic function with 1 method)

julia> f2(.2, .1, .0125)
ERROR: DomainError with -8.326672684688674e-19:
sqrt was called with a negative real argument but will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).

Is that scary enough? That said, clang and gcc do this sort of thing by default and you don’t see many folks complaining of issues like this.

Topic		Replies	Views
@fastmath macro accuracy General Usage numerics , fast-math	26	11245	May 12, 2020
Julia equivalent of Python's "fsum" for floating point summation General Usage python	54	7487	August 16, 2019
How to tell VS Code to use the fast math option? VS Code fast-math	30	4032	December 7, 2018
@fastmath switched off in Julia 0.7-Dev, deliberately? Performance fast-math	7	1779	June 24, 2018
How to improve performance of sum() Performance question , fast-math , tullio , loopvectorization , sum	19	4947	July 19, 2021

Optimizing sums of products (dot products)

Related topics