Fast performance of array comprehension without allocations

I have the following code, where the nice and compact array comprehension code is 34x slower (880 ms vs 26 ms) than the written kernel and allocates memory where I would like to prevent this. Is there a way to make the second timed assignment as fast as the first, and prevent the allocation?

using BenchmarkTools
using LoopVectorization

function assign!(u, x, y, z)
    @turbo for k in 1:length(z)
        for j in 1:length(y)
            for i in 1:length(x)
                u[i, j, k] = sin(x[i]) + sin(y[j]) + sin(z[k])
            end 
        end 
    end 
end

itot = 384 
dx = 1. / itot
x = dx*collect(0:itot-1); y = dx*collect(0:itot-1); z = dx*collect(0:itot-1)
u = zeros(itot+8, itot+8, itot+8)

uv = @view u[5:5+itot-1, 5:5+itot-1, 5:5+itot-1]
@btime assign!(uv, x, y, z)

@btime uv[:, :, :] = [ sin(x) + sin(y) + sin(z) for x=x, y=y, z=z ]

No, not as in your examples/MWE, because you compare a function, where the allocations are done before the benchmark, with another one with allocation included in the benchmark.
In other words: the LoopVectorization does need to allocate too, but you do it outside of the benchmark.

But in general I think, even if you do the benchmarking fair, that comprehensions can’t beat LoopVectorization optimized loops.

1 Like

Is there another way to do it fast, without having to write out the kernel? I am mainly missing the None of numpy that made this so easy in Python, but most likely it is my lack of experience in Julia.

I can’t think of anything, so I would say no, but I am usually wrong on questions like this. Let’s wait, what others say.

LoopVectorization.jl does a lot of great things. It is very hard to beat it.

If you want to set the memory allocation to zero only, you can do the following, but you will lose to LoopVectorization.jl.

xx, yy, zz = reshape(x, (:, 1, 1)), reshape(y, (1, :, 1)), reshape(z, (1, 1, :))
@btime @. $uv = sin($xx) + sin($yy) + sin($zz);
Code and Output

Warning: At the moment, we need to install the master branches of SLEEFPirates.jl and LoopVectorization.jl. See @turbo on loop with sin functions gives error · Issue #307 · JuliaSIMD/LoopVectorization.jl · GitHub

Code:

using BenchmarkTools
using LoopVectorization

function assign!(u, x, y, z)
    @turbo for k in 1:length(z)
        for j in 1:length(y)
            for i in 1:length(x)
                u[i, j, k] = sin(x[i]) + sin(y[j]) + sin(z[k])
            end 
        end 
    end 
end

itot = 384 
dx = 1. / itot
x = dx*collect(0:itot-1); y = dx*collect(0:itot-1); z = dx*collect(0:itot-1)
u = zeros(itot+8, itot+8, itot+8)

uv = @view u[5:5+itot-1, 5:5+itot-1, 5:5+itot-1]
xx, yy, zz = reshape(x, (:, 1, 1)), reshape(y, (1, :, 1)), reshape(z, (1, 1, :))

assign!(uv, x, y, z)
a = deepcopy(uv)
uv[:, :, :] = [ sin(x) + sin(y) + sin(z) for x=x, y=y, z=z ]
b = deepcopy(uv)
@. uv = sin(xx) + sin(yy) + sin(zz)
c = deepcopy(uv)
@show a ≈ b ≈ c

print("LoopVectorization.@turbo:")
@btime assign!($uv, $x, $y, $z)
print("comprehension:           ")
@btime $uv[:, :, :] = [ sin(x) + sin(y) + sin(z) for x=$x, y=$y, z=$z ]
print("broadcast:               ")
@btime @. $uv = sin($xx) + sin($yy) + sin($zz);

Output:

a ≈ b ≈ c = true
LoopVectorization.@turbo:  39.741 ms (0 allocations: 0 bytes)
comprehension:             782.135 ms (2 allocations: 432.00 MiB)
broadcast:                 707.889 ms (0 allocations: 0 bytes)

Code:

using Pkg
Pkg.status("SLEEFPirates")
Pkg.status("LoopVectorization")

Output:

      Status `D:\.julia\environments\v1.6\Project.toml`
  [476501e8] SLEEFPirates v0.6.23 `https://github.com/JuliaSIMD/SLEEFPirates.jl.git#master`
      Status `D:\.julia\environments\v1.6\Project.toml`
  [bdcacae8] LoopVectorization v0.12.56 `https://github.com/JuliaSIMD/LoopVectorization.jl.git#master`
2 Likes

Thanks! This helps. It was me who filed that issue, while figuring out this problem, but it has been solved. I am a little disappointed though how far the broadcasting solution is behind the optimized loop as this is a rather trivial function.

I’m not sure if LoopVectorization is extra smart here, but native Julia would cause many redundant calculations of sin. If you are free to mutate x, y, z you could precompute the elementwise sin of these arrays inplace and avoid the redundant computations.

4 Likes

The difference is astounding. LoopVectorization is almost 20 times faster! How is that even possible?? :astonished:

I mean, judging by this one would think LoopVectorization should be a mandatory part of the Julia 2.0 compiler. How is the compile-time overhead?

 14.914019 seconds (14.84 M allocations: 860.267 MiB, 1.89% gc time, 99.77% compilation time)
  1.052584 seconds (913.64 k allocations: 49.270 MiB, 21.40% compilation time)

The compile time difference is large. From 0.2 sec to 14.9

1 Like

Even without mutating, this is a huge effect – 3N not N^3 evaluations (and likewise 3N allocations instead of N^3 for the comprehension).

julia> function assign2!(u, x, y, z)
           xs = sin.(x)
           ys = sin.(y)
           zs = sin.(z)
           @inbounds for k in 1:length(z)
               for j in 1:length(y)
                   for i in 1:length(x)
                       u[i, j, k] = (xs[i]) + (ys[j]) + (zs[k])
                   end 
               end 
           end 
       end
assign2! (generic function with 1 method)

julia> @btime assign2!($uv, $x, $y, $z)
  8.122 ms (3 allocations: 9.38 KiB)

julia> @btime [ sin(x) + sin(y) + sin(z) for x=$x, y=$y, z=$z ];
  488.327 ms (2 allocations: 432.00 MiB)

julia> @btime @. $uv = sin($xx) + sin($yy) + sin($zz);
  467.296 ms (0 allocations: 0 bytes)

julia> @btime $uv .= identity(sin.($xx)) .+ identity(sin.($yy)) .+ identity(sin.($zz));  # un-fused broadcasts, should be like assign2!

julia> @btime assign!($uv, $x, $y, $z)
ERROR: MethodError: no method matching to_vecunrollscalar(::VectorizationBase.VecUnroll{1, 2, Float64, VectorizationBase.Vec{2, Float64}}, ::Static.StaticInt{3})
Stacktrace:
  [1] sin_fast
    @ ~/.julia/packages/SLEEFPirates/XGrnS/src/SLEEFPirates.jl:147 [inlined]

The LV version isn’t working for me (even with #master as noted); I think it can pull some computations out of loops but IIRC won’t allocate the temporary vectors needed to do all of them ahead of time.

1 Like

Just so people don’t get too excited here, LoopVectorization is that much faster because (as far as I remember) it’ll use a different sin and cos implementation that’s easier to SIMD at the cost of precision. Making those sin part of all regular julia calls is not necessarily the best choice and having a “safe” precise version as the default makes sense from a language point of view.

If you want to have the succinctness of the array comprehension combined with the speed of LoopVectorization, check out Tullio.

5 Likes

It’s not quite fair to compare a loop with @turbo to a comprehension or broadcast without it. This is less about loop vs broadcast than about LoopVectorization.jl.

And you can use @turbo on broadcast expressions as well. Just try it.

Maybe you can also try @tturbo (two ts) if you have started Julia with multiple threads.

3 Likes

Oh, that makes sense! In this case, the following assign_broadcast!(uv, X, Y, Z, xx, yy, zz) is as fast as the LoopVectorization.@turbo version.

function assign_broadcast!(uv, X, Y, Z, xx, yy, zz)
    @. X = sin(xx)
    @. Y = sin(yy)
    @. Z = sin(zz)
    @. uv = X + Y + Z
end
Code
using BenchmarkTools
using LoopVectorization

function assign!(u, x, y, z)
    @turbo for k in 1:length(z)
        for j in 1:length(y)
            for i in 1:length(x)
                u[i, j, k] = sin(x[i]) + sin(y[j]) + sin(z[k])
            end 
        end 
    end 
end

function assign_broadcast!(uv, X, Y, Z, xx, yy, zz)
    @. X = sin(xx)
    @. Y = sin(yy)
    @. Z = sin(zz)
    @. uv = X + Y + Z
end

itot = 384 
dx = 1. / itot
x = dx*collect(0:itot-1); y = dx*collect(0:itot-1); z = dx*collect(0:itot-1)
u = zeros(itot+8, itot+8, itot+8)

uv = @view u[5:5+itot-1, 5:5+itot-1, 5:5+itot-1]
xx, yy, zz = reshape(x, (:, 1, 1)), reshape(y, (1, :, 1)), reshape(z, (1, 1, :))
X, Y, Z = similar(xx), similar(yy), similar(zz)

assign!(uv, x, y, z)
a = deepcopy(uv)
uv[:, :, :] = [ sin(x) + sin(y) + sin(z) for x=x, y=y, z=z ]
b = deepcopy(uv)
@. uv = sin(xx) + sin(yy) + sin(zz)
c = deepcopy(uv)
assign_broadcast!(uv, X, Y, Z, xx, yy, zz)
d = deepcopy(uv)
@show a ≈ b ≈ c ≈ d

print("LoopVectorization.@turbo:")
@btime assign!($uv, $x, $y, $z)
print("comprehension:           ")
@btime $uv[:, :, :] = [ sin(x) + sin(y) + sin(z) for x=$x, y=$y, z=$z ]
print("simple broadcast:        ")
@btime @. $uv = sin($xx) + sin($yy) + sin($zz)
print("assign_broadcast!:       ")
@btime assign_broadcast!($uv, $X, $Y, $Z, $xx, $yy, $zz);

Result

LoopVectorization.@turbo:  39.716 ms (0 allocations: 0 bytes)
comprehension:             647.147 ms (2 allocations: 432.00 MiB)
simple broadcast:          600.589 ms (0 allocations: 0 bytes)
assign_broadcast!:         44.035 ms (0 allocations: 0 bytes)
1 Like

For reference, maximum error in terms of units in last place of the functions LoopVectorization uses by default on a grid of about 600_000 points:

tan_fast          : max 2.603572045148083 at x = 6.5094506e6                     : mean 0.42450363181437606
Test Summary:     | Pass  Total
accuracy tan_fast |    9      9
sin_fast          : max 2.003643767211625 at x = 4.7222                          : mean 0.27851722375147403
Test Summary:     | Pass  Total
accuracy sin_fast |    9      9
cos_fast          : max 2.184374798535446 at x = 620307.5                        : mean 0.28243872204358456

vs the base functions:

sin               : max 0.7734797606293216 at x = 2536.0506696103625              : mean 0.21277513105878326
Test Summary: | Pass  Total
accuracy sin  |    9      9
cos               : max 0.779843895920064 at x = 2.3705822e6                     : mean 0.2133223162511702
Test Summary: | Pass  Total
accuracy cos  |    9      9
tan               : max 0.8251311667300948 at x = 8.6743325e6                     : mean 0.25113740154668796
Test Summary: | Pass  Total
accuracy tan  |    9      9

Maximum error is still under 3 ULP, meaning whatever it returns will be within prevfloat(x,3) and nextfloat(x,3) of the correct answer. Mean error is still under 0.5 ULP, meaning it is still normally rounded to the correct answer.

If you want accuracy like in bases functions, it’s possible via

using SLEEFPirates
@inline accurate_sin(x) = SLEEFPirates.sin(x)

and then call accurate_sin in the loop.
This will have an accuracy of:

julia> test_acc(T, fun_table, xx, tol)
sin               : max 0.7989949289913338 at x = 1.5724                          : mean 0.208193240943644
Test Summary: | Pass  Total
accuracy sin  |    9      9
cos               : max 0.7448464753656046 at x = 9.4218                          : mean 0.20862328031282823
Test Summary: | Pass  Total
accuracy cos  |    9      9
tan               : max 0.6892110945895238 at x = 6337.377780453992               : mean 0.25091196874489863

which is comparable to the base functions.

Anyway, this benchmark is memory bound for me, simply filling uv with a constant value takes as long as the @turbo versions calculating sin.

function assign_broadcast_turbo!(uv, X, Y, Z, xx, yy, zz)
    @turbo @. X = sin(xx)
    @turbo @. Y = sin(yy)
    @turbo @. Z = sin(zz)
    @turbo @. uv = X + Y + Z
end

should be much faster.
The latest release of LoopVectorization (0.12.57) should also optimize this by hoisting the xt[i] = sin(x[i]) and yt[j] = sin(y[j]) into separate loops.

function assign!(u, xt, yt, x, y, z)
    @turbo for k in 1:length(z)
        for j in 1:length(y)
            for i in 1:length(x)
                xt[i] = sin(x[i]); yt[j] = sin(y[j]);
                u[i, j, k] = xt[i] + yt[j] + sin(z[k])
            end 
        end 
    end
end
Full code
using LoopVectorization, SLEEFPirates, BenchmarkTools

@inline accurate_sin(x) = SLEEFPirates.sin(x)

function assign!(u, x, y, z)
    @turbo for k in 1:length(z)
        for j in 1:length(y)
            for i in 1:length(x)
                u[i, j, k] = sin(x[i]) + sin(y[j]) + sin(z[k])
            end 
        end 
    end 
end
function assign!(u, xt, yt, x, y, z)
    @turbo for k in 1:length(z)
        for j in 1:length(y)
            for i in 1:length(x)
                xt[i] = sin(x[i]); yt[j] = sin(y[j]);
                u[i, j, k] = xt[i] + yt[j] + sin(z[k])
            end 
        end 
    end
end
function assign_accurate!(u, xt, yt, x, y, z)
    @turbo for k in 1:length(z)
        for j in 1:length(y)
            for i in 1:length(x)
                xt[i] = accurate_sin(x[i]); yt[j] = accurate_sin(y[j]);
                u[i, j, k] = xt[i] + yt[j] + accurate_sin(z[k])
            end 
        end 
    end
end

function assign_broadcast!(uv, X, Y, Z, xx, yy, zz)
    @. X = sin(xx)
    @. Y = sin(yy)
    @. Z = sin(zz)
    @. uv = X + Y + Z
end

function assign_broadcast_turbo!(uv, X, Y, Z, x, y, z)
    @turbo @. X = sin(x)
    @turbo @. Y = sin(y)
    @turbo @. Z = sin(z)
    zr = LowDimArray{(false,false,true)}(reshape(Z, (1,1,length(Z))))
    @turbo @. uv = X + Y' + zr
end
function assign_broadcast_accurate_turbo!(uv, X, Y, Z, x, y, z)
    @turbo @. X = accurate_sin(x)
    @turbo @. Y = accurate_sin(y)
    @turbo @. Z = accurate_sin(z)
    zr = LowDimArray{(false,false,true)}(reshape(Z, (1,1,length(Z))))
    @turbo @. uv = X + Y' + zr
end

itot = 384;
dx = 1. / itot;
x = dx*collect(0:itot-1); y = dx*collect(0:itot-1); z = dx*collect(0:itot-1);
u = zeros(itot+8, itot+8, itot+8);

uv = @view u[5:5+itot-1, 5:5+itot-1, 5:5+itot-1];
xx, yy, zz = reshape(x, (:, 1, 1)), reshape(y, (1, :, 1)), reshape(z, (1, 1, :));
X, Y, Z = similar(xx), similar(yy), similar(zz);
xt, yt, zt = similar(x), similar(y), similar(z);

assign!(uv, x, y, z);
a = deepcopy(uv); uv .= NaN;
uv[:, :, :] = [ sin(x) + sin(y) + sin(z) for x=x, y=y, z=z ];
b = deepcopy(uv); uv .= NaN;
@. uv = sin(xx) + sin(yy) + sin(zz);
c = deepcopy(uv); uv .= NaN;
assign_broadcast!(uv, X, Y, Z, xx, yy, zz);
d = deepcopy(uv); uv .= NaN;
assign_broadcast_turbo!(uv, xt, yt, zt, x, y, z);
e = deepcopy(uv); uv .= NaN;
assign_broadcast_accurate_turbo!(uv, xt, yt, zt, x, y, z);
f = deepcopy(uv); uv .= NaN;
assign!(uv, xt, yt, x, y, z);
g = deepcopy(uv); uv .= NaN;
assign_accurate!(uv, xt, yt, x, y, z);
@show a ≈ b ≈ c ≈ d ≈ e ≈ f ≈ g ≈ uv


@btime assign!($uv, $x, $y, $z);
@btime [ sin(x) + sin(y) + sin(z) for x=$x, y=$y, z=$z ];
@btime @. $uv = sin($xx) + sin($yy) + sin($zz);
@btime assign_broadcast!($uv, $X, $Y, $Z, $xx, $yy, $zz);
@btime assign_broadcast_turbo!($uv, $xt, $yt, $zt, $x, $y, $z);
@btime assign_broadcast_accurate_turbo!($uv, $xt, $yt, $zt, $x, $y, $z);
@btime assign!($uv, $xt, $yt, $x, $y, $z);
@btime assign_accurate!($uv, $xt, $yt, $x, $y, $z);
@btime $uv .= 0;
@btime fill!($uv, 0);
@btime @turbo $uv .= 0;

Benchmark results on a 10980XE (Intel CPU with AVX512):

julia> @btime assign!($uv, $x, $y, $z);
  28.200 ms (0 allocations: 0 bytes)

julia> @btime [ sin(x) + sin(y) + sin(z) for x=$x, y=$y, z=$z ];
  744.690 ms (2 allocations: 432.00 MiB)

julia> @btime @. $uv = sin($xx) + sin($yy) + sin($zz);
  772.856 ms (0 allocations: 0 bytes)

julia> @btime assign_broadcast!($uv, $X, $Y, $Z, $xx, $yy, $zz);
  41.856 ms (0 allocations: 0 bytes)

julia> @btime assign_broadcast_turbo!($uv, $xt, $yt, $zt, $x, $y, $z);
  25.863 ms (2 allocations: 112 bytes)

julia> @btime assign_broadcast_accurate_turbo!($uv, $xt, $yt, $zt, $x, $y, $z);
  25.874 ms (2 allocations: 112 bytes)

julia> @btime assign!($uv, $xt, $yt, $x, $y, $z);
  24.977 ms (0 allocations: 0 bytes)

julia> @btime assign_accurate!($uv, $xt, $yt, $x, $y, $z);
  24.710 ms (0 allocations: 0 bytes)

julia> @btime $uv .= 0;
  52.633 ms (0 allocations: 0 bytes)

julia> @btime fill!($uv, 0);
  52.635 ms (0 allocations: 0 bytes)

julia> @btime @turbo $uv .= 0;
  27.263 ms (0 allocations: 0 bytes)

Benchmark results on an Apple M1 Mac:

julia> @btime assign!($uv, $x, $y, $z);
  33.899 ms (0 allocations: 0 bytes)

julia> @btime [ sin(x) + sin(y) + sin(z) for x=$x, y=$y, z=$z ];
  531.773 ms (2 allocations: 432.00 MiB)

julia> @btime @. $uv = sin($xx) + sin($yy) + sin($zz);
  533.149 ms (0 allocations: 0 bytes)

julia> @btime assign_broadcast!($uv, $X, $Y, $Z, $xx, $yy, $zz);
  29.811 ms (0 allocations: 0 bytes)

julia> @btime assign_broadcast_turbo!($uv, $xt, $yt, $zt, $x, $y, $z);
  15.307 ms (2 allocations: 96 bytes)

julia> @btime assign_broadcast_accurate_turbo!($uv, $xt, $yt, $zt, $x, $y, $z);
  15.273 ms (2 allocations: 96 bytes)

julia> @btime assign!($uv, $xt, $yt, $x, $y, $z);
  15.112 ms (0 allocations: 0 bytes)

julia> @btime assign_accurate!($uv, $xt, $yt, $x, $y, $z);
  15.099 ms (0 allocations: 0 bytes)

julia> @btime $uv .= 0;
  30.959 ms (0 allocations: 0 bytes)

julia> @btime fill!($uv, 0);
  30.960 ms (0 allocations: 0 bytes)

julia> @btime @turbo $uv .= 0;
  17.485 ms (0 allocations: 0 bytes)

On both computers, fill!(uv, 0) or uv .= 0 took about twice as long as the LoopVectorization sin methods. The Apple M1 was much faster. This often seems to be the case when a problem is memory bound.

7 Likes