Hello,
I’m thinking of adding LoopVectorization.jl to my code to speed it up. All the examples I’ve found are very short on simple operations within a matrix or vector. My code is not as neat, but I often have to perform 3 or 4 identical operations in a row on different things. Unfortunately, I’ve found that for performance it is better not to put them in vectors, but perform them on scalars like so:
(It should go without saying, but this is not my actual code, just a dumb example. Also, I do know that fun can’t have if statements.)
Is there a simple way to apply @turbo to such operations?
Ideally, it would also be nice that my code runs on CPUs with AVX and on GPUs. Not sure there’s a good way to use @turbo in a way that can neatly be skipped when running on GPUs and without too many rewrites, so if there is some code duplication, the CPU and GPU codes are nearly identical.
Thanks a lot!
I don’t know it @turbo has much utility for such small operations, but for your current example you could use tuples:
a = (1.0, 2.0, 3.0)
a2 = a .* 2
a3 = fun.(a2)
Tuples should be about as fast as writing out scalar operations. If you need more functionality than tuples can afford, then you can take a look at StaticArrays.jl
If you can use Tullio.jl’s syntax for your operations, it generates both @turbo and GPU-compatible versions of a given kernel without needing to change anything but the argument type.
Thanks. I tried your code, but got ERROR: MethodError: no method matching log(::Vec{4, Float64}). I also tried trig functions and got the same error. Am I missing a step?
Also, this seems like a nice way of doing things, but also changes the code quite a bit (though maintaining legibility, which is nice). If I’m writing this for CPU and GPU usage, then I’d need to write each function for both, right? I guess this syntax wouldn’t work on the GPU?
Thanks again!
Thanks, @Elrod. That works and on a simple log(x) and atan(x,y) over 4 elements I do get over 2x speedup over the same operation broadcasted over a StaticArray (4x on atan(x)).
Do I understand correctly that this approach forces me to write my functions twice if I want to be able to run on CPU and GPU?
Thanks again!
Well, that kind of ruins @btime for me =P. Evaluating it only once also kind of defeats its purpose, right? Might as well use @time.
I did this instead:
@btime begin
x=0.0;
for i in 1:10000
a=1.1*i;
b=2.2*i;
c=3.3*i;
d=4.4*i;
q,w,e,r=tmp(a,b,c,d);
x+=q+w+e+r;
end
end
And the results were that the Vec version was 2.3x faster, which is about the same I got when testing log by itself, so seems more consistent.
Thanks!
If anyone can comment on clever ways to write this so I can have a GPU version also working, that’d be great!
Actually it is evaluating once per sample. BenchmarkTools runs many samples, each sample may contain several runs of the function, with the same parameters. Here we just specify that each sample will only run the function once, but there will be still many samples. We can see that with:
The problem might be that the computer (where, I don’t know) is caching values (the result, intermediate values, I don’t know). Then you are not really measuring the time the function really takes to run. That is why restarting the function at every sample with a new set of values is safer.
I am not sure when exactly this kind of problem may arise in benchmarking. But it can arise even if you benchmark independent runs of compiled binaries in a computer, one after the other. A lot of work of putting things into memory, etc, may be saved by the OS.
This is very common, actually:
julia> @btime sin(5.0) # this is wrong
1.537 ns (0 allocations: 0 bytes)
-0.9589242746631385
julia> x = 5.0
julia> @btime sin($x) # this is correct, I think
7.776 ns (0 allocations: 0 bytes)
-0.9589242746631385
julia> @btime sin(x) setup=(x=rand()) evals=1 # this will vary the value of the input
31.000 ns (0 allocations: 0 bytes)
0.6563352918810222
Also, here, there is the fact that computing the sin of one number has a different cost than taking the sin of another number, thus one needs to know exactly what one want’s to benchmark, considering the input that the function will take.
@ribeiro, so I was wrong there, apparently there is a ~30ns (in my machine) delay when one does a single evaluation that gets diluted when many evaluations are performed.
In this case, the correct benchmarks are probably these: