@turbo on sets of operations

Hello,
I’m thinking of adding LoopVectorization.jl to my code to speed it up. All the examples I’ve found are very short on simple operations within a matrix or vector. My code is not as neat, but I often have to perform 3 or 4 identical operations in a row on different things. Unfortunately, I’ve found that for performance it is better not to put them in vectors, but perform them on scalars like so:

a=1.0;
b=2.0;
c=3.0;
a2=a*2;
b2=b*2;
c2=c*2;
a3=fun(a2);
b3=fun(b2);
c3=fun(c2);

(It should go without saying, but this is not my actual code, just a dumb example. Also, I do know that fun can’t have if statements.)
Is there a simple way to apply @turbo to such operations?
Ideally, it would also be nice that my code runs on CPUs with AVX and on GPUs. Not sure there’s a good way to use @turbo in a way that can neatly be skipped when running on GPUs and without too many rewrites, so if there is some code duplication, the CPU and GPU codes are nearly identical.
Thanks a lot!

I don’t know it @turbo has much utility for such small operations, but for your current example you could use tuples:

a = (1.0, 2.0, 3.0)
a2 = a .* 2
a3 = fun.(a2)

Tuples should be about as fast as writing out scalar operations. If you need more functionality than tuples can afford, then you can take a look at StaticArrays.jl

1 Like

Yep, I used to use StaticArrays for these and it turned out the scalars were faster.
Thanks!

If you can use Tullio.jl’s syntax for your operations, it generates both @turbo and GPU-compatible versions of a given kernel without needing to change anything but the argument type.

You could using VectorizationBase.Vec

fun = log # example
using VectorizationBase
a, b, c = 1.0, 2.0, 3.0
v = Vec(a, b, c)
vres = fun(2 * v)
res = ntuple(vres, Val(3))

This should result in SIMD evaluation of fun, but with the same limitations as LoopVectorization (e.g., fun can’t have branches).

Thanks. I tried your code, but got ERROR: MethodError: no method matching log(::Vec{4, Float64}). I also tried trig functions and got the same error. Am I missing a step?
Also, this seems like a nice way of doing things, but also changes the code quite a bit (though maintaining legibility, which is nice). If I’m writing this for CPU and GPU usage, then I’d need to write each function for both, right? I guess this syntax wouldn’t work on the GPU?
Thanks again!

Thanks. I’m reading Tulio’s docs and trying to figure it out.

Oops,

using VectorizationBase, SLEEFPirates

Thanks, @Elrod. That works and on a simple log(x) and atan(x,y) over 4 elements I do get over 2x speedup over the same operation broadcasted over a StaticArray (4x on atan(x)).
Do I understand correctly that this approach forces me to write my functions twice if I want to be able to run on CPU and GPU?
Thanks again!

On second examination, maybe it’s not so fast.

using VectorizationBase, SLEEFPirates, StaticArrays

function tmpV(a,b,c,d)
       v = Vec(a, b, c,d);
       vres = log(v);
       return vres(1), vres(2), vres(3), vres(4);
end

function tmpSA(a,b,c,d)
       v = SA_F64[a, b, c,d];
       vres = log.(v);
       return vres[1], vres[2], vres[3], vres[4];
end

@btime tmpV(1,2,3,4) # 17.8 ns
@btime tmpSA(1,2,3,4) # 1.5 ns

I’m guessing there’s a big overhead which makes this example not worth the vectorization?

Be careful to not be tricked by caching of values:

julia> @btime tmpSA(a,b,c,d) setup=(a=rand();b=rand();c=rand();d=rand()) evals=1
  58.000 ns (0 allocations: 0 bytes)
(-0.09608092242402255, -0.39723403405144936, -1.1578856002927018, -0.7984489154375171)

julia> @btime tmpSA(1.0,2.0,3.0,4.0)
  1.452 ns (0 allocations: 0 bytes)
(0.0, 0.6931471805599453, 1.0986122886681098, 1.3862943611198906)

julia> @btime tmpV(a,b,c,d) setup=(a=rand();b=rand();c=rand();d=rand()) evals=1
  52.000 ns (0 allocations: 0 bytes)
(-0.45226032644549463, -0.7661123251956726, -1.579712645771708, -3.281483090946502)



Well, that kind of ruins @btime for me =P. Evaluating it only once also kind of defeats its purpose, right? Might as well use @time.
I did this instead:

@btime begin
    x=0.0;
    for i in 1:10000
       a=1.1*i;
       b=2.2*i;
       c=3.3*i;
       d=4.4*i;
       q,w,e,r=tmp(a,b,c,d);
       x+=q+w+e+r;
    end
end

And the results were that the Vec version was 2.3x faster, which is about the same I got when testing log by itself, so seems more consistent.
Thanks!

If anyone can comment on clever ways to write this so I can have a GPU version also working, that’d be great!

Actually it is evaluating once per sample. BenchmarkTools runs many samples, each sample may contain several runs of the function, with the same parameters. Here we just specify that each sample will only run the function once, but there will be still many samples. We can see that with:

julia> @benchmark tmpSA(a,b,c,d) setup=(a=rand();b=rand();c=rand();d=rand()) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  58.000 ns … 486.000 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     65.000 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   68.355 ns ±   9.819 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▁▆ █▇ ▄                                                 
  ▂▂▁▃▆▁██▁██▁██▁▅▄▁▄▄▁▄▁▄▄▁▄▄▁▄▄▁▄▃▁▃▃▁▃▂▁▂▁▂▂▁▂▂▁▂▂▁▂▂▁▂▂▁▂▂ ▃
  58 ns           Histogram: frequency by time           97 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.


Thus, the function was run 10000 times here.

Ps: Always be suspicious about benchmarks of less than a few tenths nano-seconds. They are usually wrong.

Huh. Then why did you not get any speedup while I seem to when using the for loop?

The problem might be that the computer (where, I don’t know) is caching values (the result, intermediate values, I don’t know). Then you are not really measuring the time the function really takes to run. That is why restarting the function at every sample with a new set of values is safer.

I am not sure when exactly this kind of problem may arise in benchmarking. But it can arise even if you benchmark independent runs of compiled binaries in a computer, one after the other. A lot of work of putting things into memory, etc, may be saved by the OS.

This is very common, actually:

julia> @btime sin(5.0) # this is wrong
  1.537 ns (0 allocations: 0 bytes)
-0.9589242746631385

julia> x = 5.0

julia> @btime sin($x) # this is correct, I think
  7.776 ns (0 allocations: 0 bytes)
-0.9589242746631385

julia> @btime sin(x) setup=(x=rand()) evals=1  # this will vary the value of the input
  31.000 ns (0 allocations: 0 bytes)
0.6563352918810222

Also, here, there is the fact that computing the sin of one number has a different cost than taking the sin of another number, thus one needs to know exactly what one want’s to benchmark, considering the input that the function will take.

edit:

Actually:

julia> x = 5.0
5.0

julia> @btime sin($x)
  7.777 ns (0 allocations: 0 bytes)
-0.9589242746631385

julia> @btime sin($x) evals=1
  38.000 ns (0 allocations: 0 bytes)
-0.9589242746631385

I don’t know. Maybe we just should file an issue. Probably evals=1 should be default.

I share the concerns… I started a new thread here: How to benchmark properly? Should defaults change?.

@ribeiro, so I was wrong there, apparently there is a ~30ns (in my machine) delay when one does a single evaluation that gets diluted when many evaluations are performed.

In this case, the correct benchmarks are probably these:

julia> a = rand(); b= rand(); c = rand(); d = rand();

julia> @btime tmpSA($a,$b,$c,$d)
  20.965 ns (0 allocations: 0 bytes)
(-0.8734155439518515, -0.5969015730781967, -0.46053090003180003, -0.6237616099727648)

julia> @btime tmpV($a,$b,$c,$d)
  10.864 ns (0 allocations: 0 bytes)
(-0.8734155439518515, -0.5969015730781967, -0.4605309000318, -0.6237616099727648)


In more recent (nightly) Julia it seems that you need to do something like:

julia> @btime tmpV($(Ref(a))[],$(Ref(b))[],$(Ref(c))[],$(Ref(d))[])
  11.132 ns (0 allocations: 0 bytes)
(-0.8734155439518515, -0.5969015730781967, -0.4605309000318, -0.6237616099727648)


to avoid artifacts on constant propagation. But in 1.6.2 that does not seem to be the case for this benchmark.