Dividing array by float: is multiplying with precalculated inverse worth it

What is best,

bigvec ./ some_float()

or

bigvec .* (1 / some_float())

?

I looked at @code_native to see if the first gets compiled to the second, but — from the little I understand of the output — it doesn’t seem so?

Benchmarking says both are equally fast.
Isn’t division much more expensive than multiplication?

1 Like

How did you benchmark?

I did the following.

x = randn(10000);
y::typeof(x) = x;
fm() = y .* (1/7); fm();
fd() = y ./ 7; fd();
using BenchmarkTools
@benchmark fm()
@benchmark fd()

Both methods had the same mean and median runtimes (differing between runs of @benchmark, but on average the same).

Output:
julia> @benchmark fm()
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (min … max):   6.075 μs …  1.215 ms  ┊ GC (min … max):  0.00% … 97.99%
 Time  (median):     23.925 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   23.652 μs ± 54.240 μs  ┊ GC (mean ± σ):  16.96% ±  7.81%

  ▂▇█▅▄▂▁           ▁▁▂▂▄▇▇▅▄▄▂▁▁                             ▂
  ████████▆▅▅▄▄▆▅▄▅▃████████████████▆▆▆▆▆▅▄▄▄▅▄▅▃▄▃▄▄▅▄▅▂▅▆▅▇ █
  6.08 μs      Histogram: log(frequency) by time      62.1 μs <

 Memory estimate: 78.17 KiB, allocs estimate: 2.

julia> @benchmark fd()
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):   6.967 μs …  1.635 ms  ┊ GC (min … max):  0.00% … 96.76%
 Time  (median):     21.950 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   24.433 μs ± 64.416 μs  ┊ GC (mean ± σ):  17.00% ±  6.88%

  ▅█▆▄▄▂▂           ▁▂▂▂▄▆▆▅▅▄▄▃▂▂▁ ▁                         ▂
  █████████▇▇▇▅▆▄▅▅▅███████████████████▇▅▆▅▆▅▅▁▆▃▅▄▅▅▅▄▄▄▄▅▇▇ █
  6.97 μs      Histogram: log(frequency) by time      62.2 μs <

 Memory estimate: 78.17 KiB, allocs estimate: 2.

Trying a slightly different benchmark, they are different (by a bit):

fd(x,y) = x ./ y
fm(x,y) = x .* (1/y)

x = rand(100000);
y = 7

fd(x,y); fm(x,y);  # compile

using BenchmarkTools
julia> @btime fm(x,y);
  73.800 μs (2 allocations: 781.30 KiB)

julia> @btime fd(x,y);
  89.400 μs (2 allocations: 781.30 KiB)

(these runtimes are consistent between repeated @btime calls)

Yeah, the problem in the first approach is that you are using non-constant global variables in your benchmarking. You are also using the same x and y which looks strange to me

julia> a = randn(2);

julia> b::typeof(a) = a
2-element Vector{Float64}:
 -0.10486666637026627
  0.25520465464656167

julia> a === b
true

I get this

using BenchmarkTools
x = randn(10000);
y = rand(eltype(x));

fd(x, y) = x ./ y;
fm(x, y) = x .* (1/y);

@btime fd($x, $y);
@btime fm($x, $y);

julia> @btime fd($x, $y);

  2.715 μs (2 allocations: 78.17 KiB)

julia> @btime fm($x, $y);
  2.873 μs (2 allocations: 78.17 KiB)
1 Like

Yes, division is more “expensive” in a couple of ways, neither of which matter in this case.

First, division has more latency than multiplication (so the CPU has to wait longer for the result). The compiler can get around that by scheduling other work in the meantime unless the result is needed immediately for the next computation.

Second, the CPU (depending on type) might be able to do more multiplication operations in parallel, because more silicon is allocated to this operation (which is used more often than division). However, this only matters if enough work can be supplied in parallel.

The operation that has the lowest throughput in this particular case is probably the memory read/writes, so the difference between multiplication and division “cost” is not important.

6 Likes

Yeah that first approach was confusing, especially cause x and y mean something different in the second approach.
I did the ::typeof thing cause I didn’t want a const, but not have the performance problem of non-typed globals.

On

a === b indeed, but

julia> code_typed() do
           a,b
       end
1-element Vector{Any}:
 CodeInfo(
1 ─ %1 = Main.a::Any
│   %2 = Main.b::Vector{Float64}
│   %3 = Core.tuple(%1, %2)::Tuple{Any, Vector{Float64}}
└──      return %3
) => Tuple{Any, Vector{Float64}}

What I meant by

julia> a === b
true

is that it will lead to misleading benchmarking results if both are used in a ./ b since you only have to fetch half as much memory to perform the computation, and thus might be able to do it faster than if a !== b

This will affect also the second approach where you pass the arrays as variables, and is independent on non-constant global problems.

Ah gotcha; I never grokked why people used ‘$’ with BenchmarkTools, now I see. Thanks

1 Like

In my experience doing the exact same thing, inv(y) is faster and more accurate than using 1/y

I don’t think inv(x) and 1/x could be any different in their result (for built-in Real types except maybe Irrational), due to the rules of IEEE754 arithmetic. I might believe inv could have a small performance benefit in some circumstance when the compiler can’t recognize the constant-ness of the 1 in 1/x.

When you are willing to sacrifice a bit or two of accuracy in the result, there can be performance benefits to these transformations. In many (most?) applications this is acceptable. But if full accuracy is a concern, you should absolutely not use x*inv(y) as a substitute for x/y. The compiler doesn’t do this transformation for you (except when inv(y) has an exact representation, such as for powers of 2) because they are different. For example,

julia> 49.0*inv(49.0)
0.9999999999999999
1 Like

The error is usually less than a bit which would be eps() ie the rounding error. You can see that will the following code. The first set mean and max shown are the error of one method vs the other. The next two sets are the errors relative to a bigfloat. In all the cases the errors are much smaller than epsilon

using Statistics
Tbetter=BigFloat
T=Float64
x=rand(T,100_000);
x= Tbetter.(x)
divisor = Tbetter(T(pi))
accuracy_error(v) = (T(T(v) * inv(T(divisor))) - T(T(v) / T(divisor)))/(T(T(v) / T(divisor)));
inv_accuracy(v) = (T(T(v) * inv(T(divisor))) - (v/divisor))/v
div_accuracy(v) = (T(T(v) / T(divisor)) - (v/divisor))/v;
##
@show T
@show eps(T);
##
@show mean(abs(accuracy_error(v)) for v in x);
@show maximum(abs(accuracy_error(v)) for v in x);
##
@show mean(abs(inv_accuracy(v)) for v in x);
@show maximum(abs(inv_accuracy(v)) for v in x);
##
@show mean(abs(div_accuracy(v)) for v in x);
@show maximum(abs(div_accuracy(v)) for v in x);
##

T = Float64
eps(T) = 2.220446049250313e-16

mean((abs(accuracy_error(v)) for v = x)) = 2.2765442951283494e-17
maximum((abs(accuracy_error(v)) for v = x)) = 2.2203276142503066e-16

mean((abs(inv_accuracy(v)) for v = x)) = 1.402660799251385796744785423641788904003187204601942703209097263778712947617089e-17
maximum((abs(inv_accuracy(v)) for v = x)) = 4.246814547038967479720962926321405463399871806338832126916190959044469092421897e-17

mean((abs(div_accuracy(v)) for v = x)) = 1.296835669171136569002146415223936518200939361869659888104630040244513160688326e-17
maximum((abs(div_accuracy(v)) for v = x)) = 3.518393627612048269235808282963237680727758091593581919586693839064292872920742e-17

Use smaller arrays if you want to see a performance difference.

julia> x = rand(256); y = similar(x);

julia> @benchmark @. $y = $x / 0.99
BenchmarkTools.Trial: 10000 samples with 905 evaluations.
 Range (min … max):  121.924 ns … 133.351 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     121.927 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   122.108 ns ±   0.544 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                  ▁  ▁          ▁            ▁
  █▁▁▅▆▅▆▆▆▇▄▄▄▄▄▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▆████▇█████▇██████▇▇▇▇▆▅▅▅▆ █
  122 ns        Histogram: log(frequency) by time        124 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark @. $y = $x * inv(0.99)
BenchmarkTools.Trial: 10000 samples with 997 evaluations.
 Range (min … max):  21.591 ns … 38.935 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     21.951 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.000 ns ±  0.469 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▂▄▇▇▇█▇▄▃▁▁▁               ▁                           ▂
  ▃▁▃▃▄█████████████▇▆▇▅▆▅▁▃▃▃▃▁▃▇█▅▁▃▁▃▁▁▁▃▅▄▆▅▅▅▅▄▅▆▆▇▇▇▇▆▇ █
  21.6 ns      Histogram: log(frequency) by time      23.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Make the arrays bigger, and memory bandwidth will be the only thing that matters.

So if your question is about bigvec ./ scalar vs bigvec .* inv(scalar), the answer is, no, it doesn’t matter, because memory bandwidth is the limiting factor, not CPU performance.
But smallvec ./ scalar is much slower than smallvec .* inv(scalar).

4 Likes