LoopVectorization.jl vmap gives an error ::VectorizationBase.Vec{4, Int64}

This requires the master branch of both VectorizationBase and LoopVectorization.
They should be released within the next few hours.

using VectorizationBase, LoopVectorization, IfElse

collatz_SIMD(x) =
    IfElse.ifelse(VectorizationBase.iseven(x), x ÷ 2, 3x + 1)

function collatz_sequencia_SIMD(x)
    n = zero(x)
    m = x ≠ 0
    while  VectorizationBase.vany(VectorizationBase.collapse_or(m))
        n += m
        x = collatz_SIMD(x)
        m &= x ≠ 1
    end
    return n
end

vmap(collatz_sequencia_SIMD, 1:100) == map(collatz_sequencia_SIMD, 1:100)

Performance seems better for large ranges, but worse for small ones:

julia> dest = Vector{Int}(undef, 100);

julia> @benchmark vmap!(collatz_sequencia_SIMD, $dest, axes($dest,1))
BechmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.552 μs …  8.688 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.563 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.570 μs ± 77.390 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▂▆██▆▂                                                ▁   ▂
  ▆██████▇▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▃▅▆▇█████ █
  4.55 μs      Histogram: log(frequency) by time     4.72 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark map!(collatz_sequencia_SIMD, $dest, axes($dest,1))
BechmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.461 μs …  4.313 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.499 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.512 μs ± 51.718 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▂▅▇█▅▄▂▁▁▁
  ▁▁▂▄▇███████████▇▇▅▅▄▄▃▃▃▃▃▃▄▄▄▅▅▅▅▄▄▃▃▃▂▂▂▂▁▁▂▁▁▂▁▁▁▁▁▁▁▁ ▃
  2.46 μs        Histogram: frequency by time        2.63 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> dest = Vector{Int}(undef, 1000);

julia> @benchmark vmap!(collatz_sequencia_SIMD, $dest, axes($dest,1))
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  44.796 μs …  76.785 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     45.153 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   45.152 μs ± 573.617 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▃▅▆▆▅▄▂   ▂▆██▆▁ ▁▂                                   ▁▁▁▁ ▂
  ▅████████▇▅▄██████▆███▃▁▁▁▁▄███▅▄▁▁▁▁▁▁▁▁▃▁▄▆▆▇█▇▇█▇▅▆▆▇████ █
  44.8 μs       Histogram: log(frequency) by time      46.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark map!(collatz_sequencia_SIMD, $dest, axes($dest,1))
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  87.214 μs … 105.938 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     88.666 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   88.719 μs ± 614.135 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                    ▁▁▂▃▅▅▆▆▆▇█▇▇▆▇▅▄▄▂▂▁
  ▁▁▁▁▁▁▁▂▂▂▂▃▃▄▅▆▆▇█████████████████████▇▇▆▆▄▄▅▃▄▃▃▃▃▃▂▂▂▂▂▂▂ ▄
  87.2 μs         Histogram: frequency by time         90.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

This is with a computer that has AVX512. Performance will probably be a lot worse without it, because (aside from 512-bit vectors), AVX512 is needed for SIMD Int64 multiplication.

Using Int32 gives a roughly 2x performance boost for vmap, while making map slower:

julia> dest = Vector{Int32}(undef, 1000);

julia> @benchmark vmap!(collatz_sequencia_SIMD, $dest, Int32(1):Int32(1_000))
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  22.573 μs …  60.918 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     22.658 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.721 μs ± 571.101 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁█▇
  ▂▃███▆▃▂▃▅█▆▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂ ▃
  22.6 μs         Histogram: frequency by time         23.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark map!(collatz_sequencia_SIMD, $dest, Int32(1):Int32(1_000))
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  133.721 μs … 151.036 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     135.518 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   135.627 μs ± 668.463 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                         ▂▄▅▅▆█▇▆▅▄▃▃
  ▁▁▁▁▁▁▁▁▁▂▂▁▂▂▂▂▃▂▃▄▅▆██████████████▇▆▆▅▅▅▄▄▄▄▄▄▄▃▃▂▃▂▂▂▂▁▂▁▁ ▃
  134 μs           Histogram: frequency by time          137 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
4 Likes