Adding constant size vectors

I need to add two vectors of size 256. What would be the fastest way to do this?
It needs to be as much optimized as possible as it will be inside an alpha-beta search.
Sub questions would it be much faster if I use int8 vectors instead of float32? Would it be faster to use tuple instead of vectors.
Any hint to a solution or code I could take inspiration from would be appreciated.
Thanks in advance

Int8 will be much faster since you can fit more per AVX instruction.

Is it something the compiler will do automatically?
That is can I get something faster than
a. +=b. Usually as long as I don’t allocate, I never find a way to make faster code than pure julia. And yet there are stuff like vectorizationbase…

julia> a=rand(UInt8, 256);
julia> b=rand(UInt8, 256);

julia> @btime a.+=b;
  363.153 ns (2 allocations: 64 bytes)

using LoopVectorization
julia> @btime @turbo a.+=b;
  270.824 ns (2 allocations: 64 bytes)

I think the difference is AVX-512, but I’m not 100% sure.

Reminder that global variables should be interpolated while benchmarking:

julia> ai = rand(UInt8, 256);
julia> bi = rand(UInt8, 256);
julia> af = rand(Float32, 256);
julia> bf = rand(Float32, 256);


julia> using BenchmarkTools,  LoopVectorization
julia> @btime $ai .+= $bi;
  7.600 ns (0 allocations: 0 bytes)

julia> @btime @turbo $ai .+= $bi;
  6.800 ns (0 allocations: 0 bytes)

julia> @btime  $af .+= $bf;
  15.431 ns (0 allocations: 0 bytes)

julia> @btime @turbo $af .+= $bf;
  14.329 ns (0 allocations: 0 bytes)
3 Likes

256 sounds slightly too large for StaticArrays but maybe it is worth a shot?

Just for the note, doesn’t seem that @turbo is getting much better than @simd (and it increases initial latency, so may not be worthwhile here):

julia> function sumab(a,b) 
           @inbounds @simd for i in eachindex(a)
               a[i] += b[i] 
           end
           return a
       end
sumab (generic function with 2 methods)

julia> @btime sumab($a,$b);
  5.556 ns (0 allocations: 0 bytes)

julia> @btime @turbo $a .+= $b;
  5.842 ns (0 allocations: 0 bytes)


1 Like