I have the following code, where the nice and compact array comprehension code is 34x slower (880 ms vs 26 ms) than the written kernel and allocates memory where I would like to prevent this. Is there a way to make the second timed assignment as fast as the first, and prevent the allocation?
using BenchmarkTools
using LoopVectorization
function assign!(u, x, y, z)
@turbo for k in 1:length(z)
for j in 1:length(y)
for i in 1:length(x)
u[i, j, k] = sin(x[i]) + sin(y[j]) + sin(z[k])
end
end
end
end
itot = 384
dx = 1. / itot
x = dx*collect(0:itot-1); y = dx*collect(0:itot-1); z = dx*collect(0:itot-1)
u = zeros(itot+8, itot+8, itot+8)
uv = @view u[5:5+itot-1, 5:5+itot-1, 5:5+itot-1]
@btime assign!(uv, x, y, z)
@btime uv[:, :, :] = [ sin(x) + sin(y) + sin(z) for x=x, y=y, z=z ]
No, not as in your examples/MWE, because you compare a function, where the allocations are done before the benchmark, with another one with allocation included in the benchmark.
In other words: the LoopVectorization does need to allocate too, but you do it outside of the benchmark.
But in general I think, even if you do the benchmarking fair, that comprehensions can’t beat LoopVectorization optimized loops.
Is there another way to do it fast, without having to write out the kernel? I am mainly missing the None of numpy that made this so easy in Python, but most likely it is my lack of experience in Julia.
using BenchmarkTools
using LoopVectorization
function assign!(u, x, y, z)
@turbo for k in 1:length(z)
for j in 1:length(y)
for i in 1:length(x)
u[i, j, k] = sin(x[i]) + sin(y[j]) + sin(z[k])
end
end
end
end
itot = 384
dx = 1. / itot
x = dx*collect(0:itot-1); y = dx*collect(0:itot-1); z = dx*collect(0:itot-1)
u = zeros(itot+8, itot+8, itot+8)
uv = @view u[5:5+itot-1, 5:5+itot-1, 5:5+itot-1]
xx, yy, zz = reshape(x, (:, 1, 1)), reshape(y, (1, :, 1)), reshape(z, (1, 1, :))
assign!(uv, x, y, z)
a = deepcopy(uv)
uv[:, :, :] = [ sin(x) + sin(y) + sin(z) for x=x, y=y, z=z ]
b = deepcopy(uv)
@. uv = sin(xx) + sin(yy) + sin(zz)
c = deepcopy(uv)
@show a ≈ b ≈ c
print("LoopVectorization.@turbo:")
@btime assign!($uv, $x, $y, $z)
print("comprehension: ")
@btime $uv[:, :, :] = [ sin(x) + sin(y) + sin(z) for x=$x, y=$y, z=$z ]
print("broadcast: ")
@btime @. $uv = sin($xx) + sin($yy) + sin($zz);
Output:
a ≈ b ≈ c = true
LoopVectorization.@turbo: 39.741 ms (0 allocations: 0 bytes)
comprehension: 782.135 ms (2 allocations: 432.00 MiB)
broadcast: 707.889 ms (0 allocations: 0 bytes)
Code:
using Pkg
Pkg.status("SLEEFPirates")
Pkg.status("LoopVectorization")
Output:
Status `D:\.julia\environments\v1.6\Project.toml`
[476501e8] SLEEFPirates v0.6.23 `https://github.com/JuliaSIMD/SLEEFPirates.jl.git#master`
Status `D:\.julia\environments\v1.6\Project.toml`
[bdcacae8] LoopVectorization v0.12.56 `https://github.com/JuliaSIMD/LoopVectorization.jl.git#master`
Thanks! This helps. It was me who filed that issue, while figuring out this problem, but it has been solved. I am a little disappointed though how far the broadcasting solution is behind the optimized loop as this is a rather trivial function.
I’m not sure if LoopVectorization is extra smart here, but native Julia would cause many redundant calculations of sin. If you are free to mutate x, y, z you could precompute the elementwise sin of these arrays inplace and avoid the redundant computations.
Even without mutating, this is a huge effect – 3N not N^3 evaluations (and likewise 3N allocations instead of N^3 for the comprehension).
julia> function assign2!(u, x, y, z)
xs = sin.(x)
ys = sin.(y)
zs = sin.(z)
@inbounds for k in 1:length(z)
for j in 1:length(y)
for i in 1:length(x)
u[i, j, k] = (xs[i]) + (ys[j]) + (zs[k])
end
end
end
end
assign2! (generic function with 1 method)
julia> @btime assign2!($uv, $x, $y, $z)
8.122 ms (3 allocations: 9.38 KiB)
julia> @btime [ sin(x) + sin(y) + sin(z) for x=$x, y=$y, z=$z ];
488.327 ms (2 allocations: 432.00 MiB)
julia> @btime @. $uv = sin($xx) + sin($yy) + sin($zz);
467.296 ms (0 allocations: 0 bytes)
julia> @btime $uv .= identity(sin.($xx)) .+ identity(sin.($yy)) .+ identity(sin.($zz)); # un-fused broadcasts, should be like assign2!
julia> @btime assign!($uv, $x, $y, $z)
ERROR: MethodError: no method matching to_vecunrollscalar(::VectorizationBase.VecUnroll{1, 2, Float64, VectorizationBase.Vec{2, Float64}}, ::Static.StaticInt{3})
Stacktrace:
[1] sin_fast
@ ~/.julia/packages/SLEEFPirates/XGrnS/src/SLEEFPirates.jl:147 [inlined]
The LV version isn’t working for me (even with #master as noted); I think it can pull some computations out of loops but IIRC won’t allocate the temporary vectors needed to do all of them ahead of time.
Just so people don’t get too excited here, LoopVectorization is that much faster because (as far as I remember) it’ll use a different sin and cos implementation that’s easier to SIMD at the cost of precision. Making those sin part of all regular julia calls is not necessarily the best choice and having a “safe” precise version as the default makes sense from a language point of view.
If you want to have the succinctness of the array comprehension combined with the speed of LoopVectorization, check out Tullio.
It’s not quite fair to compare a loop with @turbo to a comprehension or broadcast without it. This is less about loop vs broadcast than about LoopVectorization.jl.
And you can use @turbo on broadcast expressions as well. Just try it.
Maybe you can also try @tturbo (two ts) if you have started Julia with multiple threads.
Oh, that makes sense! In this case, the following assign_broadcast!(uv, X, Y, Z, xx, yy, zz) is as fast as the LoopVectorization.@turbo version.
function assign_broadcast!(uv, X, Y, Z, xx, yy, zz)
@. X = sin(xx)
@. Y = sin(yy)
@. Z = sin(zz)
@. uv = X + Y + Z
end
Code
using BenchmarkTools
using LoopVectorization
function assign!(u, x, y, z)
@turbo for k in 1:length(z)
for j in 1:length(y)
for i in 1:length(x)
u[i, j, k] = sin(x[i]) + sin(y[j]) + sin(z[k])
end
end
end
end
function assign_broadcast!(uv, X, Y, Z, xx, yy, zz)
@. X = sin(xx)
@. Y = sin(yy)
@. Z = sin(zz)
@. uv = X + Y + Z
end
itot = 384
dx = 1. / itot
x = dx*collect(0:itot-1); y = dx*collect(0:itot-1); z = dx*collect(0:itot-1)
u = zeros(itot+8, itot+8, itot+8)
uv = @view u[5:5+itot-1, 5:5+itot-1, 5:5+itot-1]
xx, yy, zz = reshape(x, (:, 1, 1)), reshape(y, (1, :, 1)), reshape(z, (1, 1, :))
X, Y, Z = similar(xx), similar(yy), similar(zz)
assign!(uv, x, y, z)
a = deepcopy(uv)
uv[:, :, :] = [ sin(x) + sin(y) + sin(z) for x=x, y=y, z=z ]
b = deepcopy(uv)
@. uv = sin(xx) + sin(yy) + sin(zz)
c = deepcopy(uv)
assign_broadcast!(uv, X, Y, Z, xx, yy, zz)
d = deepcopy(uv)
@show a ≈ b ≈ c ≈ d
print("LoopVectorization.@turbo:")
@btime assign!($uv, $x, $y, $z)
print("comprehension: ")
@btime $uv[:, :, :] = [ sin(x) + sin(y) + sin(z) for x=$x, y=$y, z=$z ]
print("simple broadcast: ")
@btime @. $uv = sin($xx) + sin($yy) + sin($zz)
print("assign_broadcast!: ")
@btime assign_broadcast!($uv, $X, $Y, $Z, $xx, $yy, $zz);
Result
LoopVectorization.@turbo: 39.716 ms (0 allocations: 0 bytes)
comprehension: 647.147 ms (2 allocations: 432.00 MiB)
simple broadcast: 600.589 ms (0 allocations: 0 bytes)
assign_broadcast!: 44.035 ms (0 allocations: 0 bytes)
For reference, maximum error in terms of units in last place of the functions LoopVectorization uses by default on a grid of about 600_000 points:
tan_fast : max 2.603572045148083 at x = 6.5094506e6 : mean 0.42450363181437606
Test Summary: | Pass Total
accuracy tan_fast | 9 9
sin_fast : max 2.003643767211625 at x = 4.7222 : mean 0.27851722375147403
Test Summary: | Pass Total
accuracy sin_fast | 9 9
cos_fast : max 2.184374798535446 at x = 620307.5 : mean 0.28243872204358456
vs the base functions:
sin : max 0.7734797606293216 at x = 2536.0506696103625 : mean 0.21277513105878326
Test Summary: | Pass Total
accuracy sin | 9 9
cos : max 0.779843895920064 at x = 2.3705822e6 : mean 0.2133223162511702
Test Summary: | Pass Total
accuracy cos | 9 9
tan : max 0.8251311667300948 at x = 8.6743325e6 : mean 0.25113740154668796
Test Summary: | Pass Total
accuracy tan | 9 9
Maximum error is still under 3 ULP, meaning whatever it returns will be within prevfloat(x,3) and nextfloat(x,3) of the correct answer. Mean error is still under 0.5 ULP, meaning it is still normally rounded to the correct answer.
If you want accuracy like in bases functions, it’s possible via
using SLEEFPirates
@inline accurate_sin(x) = SLEEFPirates.sin(x)
and then call accurate_sin in the loop.
This will have an accuracy of:
julia> test_acc(T, fun_table, xx, tol)
sin : max 0.7989949289913338 at x = 1.5724 : mean 0.208193240943644
Test Summary: | Pass Total
accuracy sin | 9 9
cos : max 0.7448464753656046 at x = 9.4218 : mean 0.20862328031282823
Test Summary: | Pass Total
accuracy cos | 9 9
tan : max 0.6892110945895238 at x = 6337.377780453992 : mean 0.25091196874489863
which is comparable to the base functions.
Anyway, this benchmark is memory bound for me, simply filling uv with a constant value takes as long as the @turbo versions calculating sin.
function assign_broadcast_turbo!(uv, X, Y, Z, xx, yy, zz)
@turbo @. X = sin(xx)
@turbo @. Y = sin(yy)
@turbo @. Z = sin(zz)
@turbo @. uv = X + Y + Z
end
should be much faster.
The latest release of LoopVectorization (0.12.57) should also optimize this by hoisting the xt[i] = sin(x[i]) and yt[j] = sin(y[j]) into separate loops.
function assign!(u, xt, yt, x, y, z)
@turbo for k in 1:length(z)
for j in 1:length(y)
for i in 1:length(x)
xt[i] = sin(x[i]); yt[j] = sin(y[j]);
u[i, j, k] = xt[i] + yt[j] + sin(z[k])
end
end
end
end
Full code
using LoopVectorization, SLEEFPirates, BenchmarkTools
@inline accurate_sin(x) = SLEEFPirates.sin(x)
function assign!(u, x, y, z)
@turbo for k in 1:length(z)
for j in 1:length(y)
for i in 1:length(x)
u[i, j, k] = sin(x[i]) + sin(y[j]) + sin(z[k])
end
end
end
end
function assign!(u, xt, yt, x, y, z)
@turbo for k in 1:length(z)
for j in 1:length(y)
for i in 1:length(x)
xt[i] = sin(x[i]); yt[j] = sin(y[j]);
u[i, j, k] = xt[i] + yt[j] + sin(z[k])
end
end
end
end
function assign_accurate!(u, xt, yt, x, y, z)
@turbo for k in 1:length(z)
for j in 1:length(y)
for i in 1:length(x)
xt[i] = accurate_sin(x[i]); yt[j] = accurate_sin(y[j]);
u[i, j, k] = xt[i] + yt[j] + accurate_sin(z[k])
end
end
end
end
function assign_broadcast!(uv, X, Y, Z, xx, yy, zz)
@. X = sin(xx)
@. Y = sin(yy)
@. Z = sin(zz)
@. uv = X + Y + Z
end
function assign_broadcast_turbo!(uv, X, Y, Z, x, y, z)
@turbo @. X = sin(x)
@turbo @. Y = sin(y)
@turbo @. Z = sin(z)
zr = LowDimArray{(false,false,true)}(reshape(Z, (1,1,length(Z))))
@turbo @. uv = X + Y' + zr
end
function assign_broadcast_accurate_turbo!(uv, X, Y, Z, x, y, z)
@turbo @. X = accurate_sin(x)
@turbo @. Y = accurate_sin(y)
@turbo @. Z = accurate_sin(z)
zr = LowDimArray{(false,false,true)}(reshape(Z, (1,1,length(Z))))
@turbo @. uv = X + Y' + zr
end
itot = 384;
dx = 1. / itot;
x = dx*collect(0:itot-1); y = dx*collect(0:itot-1); z = dx*collect(0:itot-1);
u = zeros(itot+8, itot+8, itot+8);
uv = @view u[5:5+itot-1, 5:5+itot-1, 5:5+itot-1];
xx, yy, zz = reshape(x, (:, 1, 1)), reshape(y, (1, :, 1)), reshape(z, (1, 1, :));
X, Y, Z = similar(xx), similar(yy), similar(zz);
xt, yt, zt = similar(x), similar(y), similar(z);
assign!(uv, x, y, z);
a = deepcopy(uv); uv .= NaN;
uv[:, :, :] = [ sin(x) + sin(y) + sin(z) for x=x, y=y, z=z ];
b = deepcopy(uv); uv .= NaN;
@. uv = sin(xx) + sin(yy) + sin(zz);
c = deepcopy(uv); uv .= NaN;
assign_broadcast!(uv, X, Y, Z, xx, yy, zz);
d = deepcopy(uv); uv .= NaN;
assign_broadcast_turbo!(uv, xt, yt, zt, x, y, z);
e = deepcopy(uv); uv .= NaN;
assign_broadcast_accurate_turbo!(uv, xt, yt, zt, x, y, z);
f = deepcopy(uv); uv .= NaN;
assign!(uv, xt, yt, x, y, z);
g = deepcopy(uv); uv .= NaN;
assign_accurate!(uv, xt, yt, x, y, z);
@show a ≈ b ≈ c ≈ d ≈ e ≈ f ≈ g ≈ uv
@btime assign!($uv, $x, $y, $z);
@btime [ sin(x) + sin(y) + sin(z) for x=$x, y=$y, z=$z ];
@btime @. $uv = sin($xx) + sin($yy) + sin($zz);
@btime assign_broadcast!($uv, $X, $Y, $Z, $xx, $yy, $zz);
@btime assign_broadcast_turbo!($uv, $xt, $yt, $zt, $x, $y, $z);
@btime assign_broadcast_accurate_turbo!($uv, $xt, $yt, $zt, $x, $y, $z);
@btime assign!($uv, $xt, $yt, $x, $y, $z);
@btime assign_accurate!($uv, $xt, $yt, $x, $y, $z);
@btime $uv .= 0;
@btime fill!($uv, 0);
@btime @turbo $uv .= 0;
Benchmark results on a 10980XE (Intel CPU with AVX512):
On both computers, fill!(uv, 0) or uv .= 0 took about twice as long as the LoopVectorization sin methods. The Apple M1 was much faster. This often seems to be the case when a problem is memory bound.