This requires the master branch of both VectorizationBase and LoopVectorization.
They should be released within the next few hours.
using VectorizationBase, LoopVectorization, IfElse
collatz_SIMD(x) =
IfElse.ifelse(VectorizationBase.iseven(x), x ÷ 2, 3x + 1)
function collatz_sequencia_SIMD(x)
n = zero(x)
m = x ≠ 0
while VectorizationBase.vany(VectorizationBase.collapse_or(m))
n += m
x = collatz_SIMD(x)
m &= x ≠ 1
end
return n
end
vmap(collatz_sequencia_SIMD, 1:100) == map(collatz_sequencia_SIMD, 1:100)
Performance seems better for large ranges, but worse for small ones:
julia> dest = Vector{Int}(undef, 100);
julia> @benchmark vmap!(collatz_sequencia_SIMD, $dest, axes($dest,1))
BechmarkTools.Trial: 10000 samples with 7 evaluations.
Range (min … max): 4.552 μs … 8.688 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.563 μs ┊ GC (median): 0.00%
Time (mean ± σ): 4.570 μs ± 77.390 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▆██▆▂ ▁ ▂
▆██████▇▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▃▅▆▇█████ █
4.55 μs Histogram: log(frequency) by time 4.72 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark map!(collatz_sequencia_SIMD, $dest, axes($dest,1))
BechmarkTools.Trial: 10000 samples with 9 evaluations.
Range (min … max): 2.461 μs … 4.313 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.499 μs ┊ GC (median): 0.00%
Time (mean ± σ): 2.512 μs ± 51.718 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▅▇█▅▄▂▁▁▁
▁▁▂▄▇███████████▇▇▅▅▄▄▃▃▃▃▃▃▄▄▄▅▅▅▅▄▄▃▃▃▂▂▂▂▁▁▂▁▁▂▁▁▁▁▁▁▁▁ ▃
2.46 μs Histogram: frequency by time 2.63 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> dest = Vector{Int}(undef, 1000);
julia> @benchmark vmap!(collatz_sequencia_SIMD, $dest, axes($dest,1))
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 44.796 μs … 76.785 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 45.153 μs ┊ GC (median): 0.00%
Time (mean ± σ): 45.152 μs ± 573.617 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▅▆▆▅▄▂ ▂▆██▆▁ ▁▂ ▁▁▁▁ ▂
▅████████▇▅▄██████▆███▃▁▁▁▁▄███▅▄▁▁▁▁▁▁▁▁▃▁▄▆▆▇█▇▇█▇▅▆▆▇████ █
44.8 μs Histogram: log(frequency) by time 46.3 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark map!(collatz_sequencia_SIMD, $dest, axes($dest,1))
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 87.214 μs … 105.938 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 88.666 μs ┊ GC (median): 0.00%
Time (mean ± σ): 88.719 μs ± 614.135 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▁▂▃▅▅▆▆▆▇█▇▇▆▇▅▄▄▂▂▁
▁▁▁▁▁▁▁▂▂▂▂▃▃▄▅▆▆▇█████████████████████▇▇▆▆▄▄▅▃▄▃▃▃▃▃▂▂▂▂▂▂▂ ▄
87.2 μs Histogram: frequency by time 90.2 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
This is with a computer that has AVX512. Performance will probably be a lot worse without it, because (aside from 512-bit vectors), AVX512 is needed for SIMD Int64 multiplication.
Using Int32 gives a roughly 2x performance boost for vmap, while making map slower:
julia> dest = Vector{Int32}(undef, 1000);
julia> @benchmark vmap!(collatz_sequencia_SIMD, $dest, Int32(1):Int32(1_000))
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 22.573 μs … 60.918 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 22.658 μs ┊ GC (median): 0.00%
Time (mean ± σ): 22.721 μs ± 571.101 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁█▇
▂▃███▆▃▂▃▅█▆▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂ ▃
22.6 μs Histogram: frequency by time 23.8 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark map!(collatz_sequencia_SIMD, $dest, Int32(1):Int32(1_000))
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 133.721 μs … 151.036 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 135.518 μs ┊ GC (median): 0.00%
Time (mean ± σ): 135.627 μs ± 668.463 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▄▅▅▆█▇▆▅▄▃▃
▁▁▁▁▁▁▁▁▁▂▂▁▂▂▂▂▃▂▃▄▅▆██████████████▇▆▆▅▅▅▄▄▄▄▄▄▄▃▃▂▃▂▂▂▂▁▂▁▁ ▃
134 μs Histogram: frequency by time 137 μs <
Memory estimate: 0 bytes, allocs estimate: 0.