LoopVectorization.jl vmap gives an error ::VectorizationBase.Vec{4, Int64}

I am trying to do a vmap on a function. map and ThreadsX.map both work.

collatz(x::Int64) =
    if iseven(x)
        x Γ· 2
    else
        3x + 1
    end

function collatz_sequencia(x::Int64)
	n = 0
    while true
        x == 1 && return n
        n += 1
        x = collatz(x)
    end
	return n
end

Then:

vmap(collatz_sequencia, 1:10)

MethodError: no method matching zero_offsets(::VectorizationBase.FastRange{Int64, Static.StaticInt{0}, Static.StaticInt{1}, Int64})
Closest candidates are:
zero_offsets(!Matched::Static.StaticInt{N}) where N at /home/storopoli/.julia/packages/VectorizationBase/geEQH/src/static.jl:130
zero_offsets(!Matched::VectorizationBase.StridedPointer{T, N, C, B, R, X, O} where {X, O}) where {T, N, C, B, R} at /home/storopoli/.julia/packages/VectorizationBase/geEQH/src/strided_pointers/stridedpointers.jl:115

I tried to remove the range, but to no avail:

vmap(collatz_sequencia, collect(1:10))

MethodError: no method matching collatz_sequencia(::VectorizationBase.Vec{4, Int64})
Closest candidates are:
collatz_sequencia(!Matched::Int64) at /home/storopoli/Documents/Julia/Computacao-Cientifica/notebooks/3_Parallel.jl#==#a7be2174-a7dd-4259-aab9-64cdcc749fb0:1

At least two issues here.

  1. Add vmap support for ranges. This is a bug that should be easy to fix.
  2. collatz_sequencia is restricted to ::Int64.

The simplest fix for β€œ2.” would be to redefine it to work correctly with VectorizationBase.Vec{4,Int64} inputs. This would require loosening the signatures, but also adjusting the while loop.
You can see, for example, how gcd is defined for AbstractSIMD types and compare to the definition in Base.

Some day, it’d be cool to work on an SPMD-style program transformer for Julia that can automate this.

3 Likes

Am I going in the right direction?

using VectorizationBase: AbstractSIMDVector, vany

collatz_SIMD(x) =
    if x % 2 == 0
        x Γ· 2
    else
        3x + 1
    end

function collatz_sequencia_SIMD(x::AbstractSIMDVector{W,I}) where {W,I<:Base.HWReal}
	n = 0
    while vany(x β‰  1)
        n += 1
        x = collatz_SIMD(x)
    end
	return n
end

vmap(collatz_sequencia_SIMD, [1,2]) # just a test
TypeError: non-boolean (VectorizationBase.Mask{2, UInt8}) used in boolean context

collatz_SIMD@Other: 1[inlined]
collatz_sequencia_SIMD(::VectorizationBase.Vec{2, Int64})@Other: 5
vmap_singlethread!(::typeof(Main.workspace59.collatz_sequencia_SIMD), ::VectorizationBase.StridedPointer{Union{}, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{0}}}, ::Static.StaticInt{0}, ::Int64, ::Val{false}, ::Tuple{VectorizationBase.StridedPointer{Int64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{0}}}})@map.jl:91
vmap_singlethread!@map.jl:58[inlined]
macro expansion@map.jl:224[inlined]
gc_preserve_vmap!@map.jl:224[inlined]
vmap!@map.jl:273[inlined]
vmap_call@map.jl:375[inlined]
vmap(::typeof(Main.workspace59.collatz_sequencia_SIMD), ::Vector{Int64})@map.jl:384
top-level scope@Local: 1[inlined]

That is in the right direction, but a few more changes:

  1. To avoid the β€œnon-boolean” error, use IfElse.ifelse
  2. collatz_sequencia will be called with multiple inputs, and will also have to return multiple inputs/lanes. I’d initialize n with n = zero(x). Then you’ll have to manage a mask m indicating which of these lanes are finished, and n += m. You can control the loop with while vany(m), and update the mask with m &= x β‰  1. You can’t determine breaking out of the loop with vany(x β‰  1) because because collatz(1) == 4.
2 Likes

Sorry I am still having a hard time I cannot find documentation on mask in VectorizationBase.jl. How do I define a mask?

function collatz_sequencia_SIMD(x::AbstractSIMDVector{W,I}) where {W,I<:Base.HWReal}
	n = zero(x)
	m = ifelse(collatz_SIMD(x) β‰  1, true, false)
    while vany(m)
        n += 1
        x = collatz_SIMD(x)
		m &= x β‰  1
    end
	return n
end

vmap(collatz_sequencia_SIMD, [1,2])
TypeError: non-boolean (VectorizationBase.Mask{2, UInt8}) used in boolean context

collatz_SIMD@Other: 1[inlined]
collatz_sequencia_SIMD(::VectorizationBase.Vec{2, Int64})@Other: 3
vmap_singlethread!(::typeof(Main.workspace77.collatz_sequencia_SIMD), ::VectorizationBase.StridedPointer{Union{}, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{0}}}, ::Static.StaticInt{0}, ::Int64, ::Val{false}, ::Tuple{VectorizationBase.StridedPointer{Int64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{0}}}})@map.jl:91
vmap_singlethread!@map.jl:58[inlined]
macro expansion@map.jl:224[inlined]
gc_preserve_vmap!@map.jl:224[inlined]
vmap!@map.jl:273[inlined]
vmap_call@map.jl:375[inlined]
vmap(::typeof(Main.workspace77.collatz_sequencia_SIMD), ::Vector{Int64})@map.jl:384
top-level scope@Local: 1[inlined]

Any comparison with AbstractSIMDs will result in a mask.

Use IfElse.ifelse to avoid the non-boolean errors.

1 Like

Ok got that part:

function collatz_sequencia_SIMD(x::AbstractSIMDVector{W,I}) where {W,I<:Base.HWReal}
	n = zero(x)
	m = IfElse.ifelse(collatz_SIMD(x) β‰  one(x), one(x), zero(x))
    while vany(m)
        n += m
        x = collatz_SIMD(x)
		m &= IfElse.ifelse(x β‰  one(x), one(x), zero(x)) 
    end
	return n
end

vmap(collatz_sequencia_SIMD, [1, 2, 3, 4])

Somehow it complains with non-boolean errors:

TypeError: non-boolean (VectorizationBase.Mask{2, UInt8}) used in boolean context

collatz_SIMD@Other: 1[inlined]
collatz_sequencia_SIMD(::VectorizationBase.Vec{2, Int64})@Other: 3
vmap_singlethread!(::typeof(Main.workspace168.collatz_sequencia_SIMD), ::VectorizationBase.StridedPointer{Union{}, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{0}}}, ::Static.StaticInt{0}, ::Int64, ::Val{false}, ::Tuple{VectorizationBase.StridedPointer{Int64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{0}}}})@map.jl:91
vmap_singlethread!@map.jl:58[inlined]
macro expansion@map.jl:224[inlined]
gc_preserve_vmap!@map.jl:224[inlined]
vmap!@map.jl:273[inlined]
vmap_call@map.jl:375[inlined]
vmap(::typeof(Main.workspace168.collatz_sequencia_SIMD), ::Vector{Int64})@map.jl:384
top-level scope@Local: 1[inlined]
  1. The mask should be collatz_SIMD(x) β‰  one(x) or, if you upgrade to VectorizationBase 0.20.23 (just released a few minutes ago), VectorizationBase.max_mask(x).

  2. Use Ifelse.ifelse in collatz_SIMD.

Also, is your CPU an M1 Mac, or does it not have AVX2?
Currently, VectorizationBase decides to use a vector width of 2 for Int64 on CPUs without AVX2.
I have an M1, but I don’t have a CPU with AVX but not AVX2, so I can’t test what works best on the latter.

1 Like

Also, on LoopVectorization master, I added support and tests for vmap with ranges. I’ll tag a new release in a few hours.

2 Likes

This is a Pluto Notebook for a graduate course on scientific computing using Julia (CiΓͺncia de Dados e Computação CientΓ­fica com Julia). I will run it on a Linux, but I am making the content in a mix of Mac M1 and Linux with AVX2.

I saw the new release, I also saw that you defined the iseven function. So I’ve updated the VectorizationBase.jl to 0.20.23.

I am still getting errors:

collatz_SIMD(x) =
    if IfElse.ifelse(VectorizationBase.iseven(x), one(x), zero(x))
        x Γ· 2
    else
        3x + 1
    end

function collatz_sequencia_SIMD(x::VectorizationBase.AbstractSIMDVector{W,I}) where {W,I<:Base.HWReal}
	n = zero(x)
	m = VectorizationBase.max_mask(x)
    while VectorizationBase.vany(m)
        n += m
        x = collatz_SIMD(x)
		m &= x β‰  1
    end
	return n
end
vmap(collatz_sequencia_SIMD, [1, 2, 3, 4])

TypeError: non-boolean (VectorizationBase.Vec{2, Int64}) used in boolean context

collatz_SIMD(::VectorizationBase.Vec{2, Int64})@Other: 1
collatz_sequencia_SIMD(::VectorizationBase.Vec{2, Int64})@Other: 6
vmap_singlethread!(::typeof(Main.workspace16.collatz_sequencia_SIMD), ::VectorizationBase.StridedPointer{Union{}, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{0}}}, ::Static.StaticInt{0}, ::Int64, ::Val{false}, ::Tuple{VectorizationBase.StridedPointer{Int64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{0}}}})@map.jl:91
vmap_singlethread!@map.jl:58[inlined]
macro expansion@map.jl:224[inlined]
gc_preserve_vmap!@map.jl:224[inlined]
vmap!@map.jl:273[inlined]
vmap_call@map.jl:375[inlined]
vmap(::typeof(Main.workspace16.collatz_sequencia_SIMD), ::Vector{Int64})@map.jl:384
top-level scope@Local: 1[inlined]

Thank you!

collatz_SIMD(x) =
    IfElse.ifelse(iseven(x), x Γ· 2, 3x + 1)
4 Likes

Of course! Makes total sense. Thanks.

Ok but now I think I need to define some sort of convertion:

collatz_SIMD(x) =
    IfElse.ifelse(VectorizationBase.iseven(x), x Γ· 2, 3x + 1)

function collatz_sequencia_SIMD(x::VectorizationBase.AbstractSIMDVector{W,I}) where {W,I<:Base.HWReal}
	n = zero(x)
	m = VectorizationBase.max_mask(x)
    while VectorizationBase.vany(m)
        n += m
        x = collatz_SIMD(x)
		m &= x β‰  1
    end
	return n
end
vmap(collatz_sequencia_SIMD, [1,2,3,4])

MethodError: vconvert(::Type{VectorizationBase.Vec{2, Union{}}}, ::VectorizationBase.Vec{2, Int64}) is ambiguous. Candidates:

vconvert(::Type{VectorizationBase.Vec{W, F}}, v::VectorizationBase.Vec{W, T}) where {W, F<:Union{Float32, Float64}, T<:Union{Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8}} in VectorizationBase at /Users/storopoli/.julia/packages/VectorizationBase/kTRxL/src/llvm_intrin/conversion.jl:29

vconvert(::Type{VectorizationBase.Vec{W, T1}}, v::VectorizationBase.Vec{W, T2}) where {W, T1<:Union{Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8}, T2<:Union{Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8}} in VectorizationBase at /Users/storopoli/.julia/packages/VectorizationBase/kTRxL/src/llvm_intrin/conversion.jl:36

Possible fix, define

vconvert(::Type{VectorizationBase.Vec{W, Union{}}}, ::VectorizationBase.Vec{W, T2}) where {W, T2<:Union{Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8}}

convert@base_defs.jl:152[inlined]
macro expansion@memory_addr.jl:0[inlined]
__vstore!@memory_addr.jl:810[inlined]
_vstore!@stridedpointers.jl:229[inlined]
vmap_singlethread!(::typeof(Main.workspace17.collatz_sequencia_SIMD), ::VectorizationBase.StridedPointer{Union{}, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{0}}}, ::Static.StaticInt{0}, ::Int64, ::Val{false}, ::Tuple{VectorizationBase.StridedPointer{Int64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{0}}}})@map.jl:95
vmap_singlethread!@map.jl:58[inlined]
macro expansion@map.jl:224[inlined]
gc_preserve_vmap!@map.jl:224[inlined]
vmap!@map.jl:273[inlined]
vmap_call@map.jl:375[inlined]
vmap(::typeof(Main.workspace17.collatz_sequencia_SIMD), ::Vector{Int64})@map.jl:384
top-level scope@Local: 1[inlined]

This requires the master branch of both VectorizationBase and LoopVectorization.
They should be released within the next few hours.

using VectorizationBase, LoopVectorization, IfElse

collatz_SIMD(x) =
    IfElse.ifelse(VectorizationBase.iseven(x), x Γ· 2, 3x + 1)

function collatz_sequencia_SIMD(x)
    n = zero(x)
    m = x β‰  0
    while  VectorizationBase.vany(VectorizationBase.collapse_or(m))
        n += m
        x = collatz_SIMD(x)
        m &= x β‰  1
    end
    return n
end

vmap(collatz_sequencia_SIMD, 1:100) == map(collatz_sequencia_SIMD, 1:100)

Performance seems better for large ranges, but worse for small ones:

julia> dest = Vector{Int}(undef, 100);

julia> @benchmark vmap!(collatz_sequencia_SIMD, $dest, axes($dest,1))
BechmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.552 ΞΌs …  8.688 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     4.563 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.570 ΞΌs Β± 77.390 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–‚β–†β–ˆβ–ˆβ–†β–‚                                                ▁   β–‚
  β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–β–β–ƒβ–…β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β–ˆ
  4.55 ΞΌs      Histogram: log(frequency) by time     4.72 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark map!(collatz_sequencia_SIMD, $dest, axes($dest,1))
BechmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.461 ΞΌs …  4.313 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     2.499 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.512 ΞΌs Β± 51.718 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

       β–‚β–…β–‡β–ˆβ–…β–„β–‚β–β–β–
  β–β–β–‚β–„β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–…β–…β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–„β–„β–„β–…β–…β–…β–…β–„β–„β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–β–β–‚β–β–β–‚β–β–β–β–β–β–β–β– β–ƒ
  2.46 ΞΌs        Histogram: frequency by time        2.63 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> dest = Vector{Int}(undef, 1000);

julia> @benchmark vmap!(collatz_sequencia_SIMD, $dest, axes($dest,1))
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  44.796 ΞΌs …  76.785 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     45.153 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   45.152 ΞΌs Β± 573.617 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–ƒβ–…β–†β–†β–…β–„β–‚   β–‚β–†β–ˆβ–ˆβ–†β– ▁▂                                   ▁▁▁▁ β–‚
  β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–ˆβ–ˆβ–ˆβ–ƒβ–β–β–β–β–„β–ˆβ–ˆβ–ˆβ–…β–„β–β–β–β–β–β–β–β–β–ƒβ–β–„β–†β–†β–‡β–ˆβ–‡β–‡β–ˆβ–‡β–…β–†β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆ β–ˆ
  44.8 ΞΌs       Histogram: log(frequency) by time      46.3 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark map!(collatz_sequencia_SIMD, $dest, axes($dest,1))
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  87.214 ΞΌs … 105.938 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     88.666 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   88.719 ΞΌs Β± 614.135 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                    β–β–β–‚β–ƒβ–…β–…β–†β–†β–†β–‡β–ˆβ–‡β–‡β–†β–‡β–…β–„β–„β–‚β–‚β–
  β–β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–„β–…β–†β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–†β–„β–„β–…β–ƒβ–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–„
  87.2 ΞΌs         Histogram: frequency by time         90.2 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

This is with a computer that has AVX512. Performance will probably be a lot worse without it, because (aside from 512-bit vectors), AVX512 is needed for SIMD Int64 multiplication.

Using Int32 gives a roughly 2x performance boost for vmap, while making map slower:

julia> dest = Vector{Int32}(undef, 1000);

julia> @benchmark vmap!(collatz_sequencia_SIMD, $dest, Int32(1):Int32(1_000))
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  22.573 ΞΌs …  60.918 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     22.658 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   22.721 ΞΌs Β± 571.101 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–β–ˆβ–‡
  β–‚β–ƒβ–ˆβ–ˆβ–ˆβ–†β–ƒβ–‚β–ƒβ–…β–ˆβ–†β–ƒβ–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  22.6 ΞΌs         Histogram: frequency by time         23.8 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark map!(collatz_sequencia_SIMD, $dest, Int32(1):Int32(1_000))
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  133.721 ΞΌs … 151.036 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     135.518 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   135.627 ΞΌs Β± 668.463 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                         β–‚β–„β–…β–…β–†β–ˆβ–‡β–†β–…β–„β–ƒβ–ƒ
  β–β–β–β–β–β–β–β–β–β–‚β–‚β–β–‚β–‚β–‚β–‚β–ƒβ–‚β–ƒβ–„β–…β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–†β–…β–…β–…β–„β–„β–„β–„β–„β–„β–„β–ƒβ–ƒβ–‚β–ƒβ–‚β–‚β–‚β–‚β–β–‚β–β– β–ƒ
  134 ΞΌs           Histogram: frequency by time          137 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.
4 Likes

A couple more comments:

  1. vmap also requires the function be defined for scalars, because that is how it determines the element type of the returned array. vmap! doesn’t have that limitation (since it mutates an existing vector, instead of returning a new one).
  2. I’m using m = x β‰  0 because vmap(!) will often call the function with 0 as an input, even if 0 isn’t in your vector. The result of this isn’t used for anything, it’s just padding for when the vector length isn’t divisible by the chunk sizes vmap! uses. Thus, it has to not error or get caught in an infinite loop.
4 Likes

Thank you! I will update in a few hours…

No problem.

Also, because it wasn’t clear and Mask isn’t documented (I/someone else should add some documentation…):
Mask acts like a bunch of booleans.

using VectorizationBase

julia> vxi = Vec(ntuple(Int, Val(4))...)
Vec{4, Int64}<1, 2, 3, 4>

julia> vyi = Vec{4}(2)
Vec{4, Int64}<2, 2, 2, 2>

julia> vxi > vyi
Mask{4,Bit}<0, 0, 1, 1>

julia> vxi β‰₯ vyi
Mask{4,Bit}<0, 1, 1, 1>

julia> vxi == vyi
Mask{4,Bit}<0, 1, 0, 0>

When ordinary code deals with Bools, you need to replace branches with IfElse.ifelse so that it works with masks.
I.e.,

res = if cmp # cmp is a bool
   iftruebranch
else
   iffalsebranch
end

becomes

# cmp can be a `Bool` or a `Mask`
res = IfElse.ifelse(cmp, iftruebranch, iffalsebranch)

because with AbstractSIMD inputs, cmp will be a Mask instead of a Bool.
If one side of the branch is much more likely than another, so that it’s still fairly probably that even with many inputs every single one of them will only go to one side of the branch (and the other side of the branch is also very expensive), you could do something like

# cmp is almost always true
res = iftruebranch
if !vall(collapse_and(cmp))
    res = IfElse.ifelse(cmp, res, iffalsebranch)
end

You can think of calling a function with an AbstractSIMD input as calling it a bunch of times, but that each call has to follow through the same sequence of instructions, and take the same path through branches. (And because the compiler isn’t handling it, you have to do that manually)
You can use masks to control/combine results from different paths/conditions.

4 Likes

Crash Course im LoopVectorization SIMD stuff… Thanks!

Now it all make sense you call an AbstractSIMD and you expect that it will need a Single Instruction and Multiple Data…

1 Like

Works like a charm. Just a FYI:

@benchmark map(collatz_sequencia, 1:100_000)

BenchmarkTools.Trial: 223 samples with 1 evaluation.
 Range (min … max):  21.896 ms …  24.059 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     22.450 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   22.513 ms Β± 311.141 ΞΌs  β”Š GC (mean Β± Οƒ):  0.06% Β± 0.47%

             β–…β–‚β–ˆβ–‚β– β–‚β–‚β–ƒ                                          
  β–ƒβ–β–β–β–β–β–ƒβ–ƒβ–β–…β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–‡β–†β–†β–„β–„β–…β–ƒβ–„β–ƒβ–ƒβ–„β–ƒβ–β–ƒβ–β–ƒβ–β–β–ƒβ–β–β–β–ƒβ–β–β–β–β–ƒβ–β–ƒβ–β–ƒβ–β–ƒβ–β–β–β–ƒ β–ƒ
  21.9 ms         Histogram: frequency by time         23.8 ms <

 Memory estimate: 781.33 KiB, allocs estimate: 2.
@benchmark ThreadsX.map(collatz_sequencia, 1:100_000)

BenchmarkTools.Trial: 1310 samples with 1 evaluation.
 Range (min … max):  3.221 ms …   7.911 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     3.596 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.804 ms Β± 687.383 ΞΌs  β”Š GC (mean Β± Οƒ):  2.62% Β± 7.84%

    β–ƒβ–ˆβ–†β–ƒβ–                                                      
  β–ƒβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–†β–†β–…β–„β–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–‚β–β–β–‚β–β–β–‚β–‚β–‚β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–β–‚β–β–‚ β–ƒ
  3.22 ms         Histogram: frequency by time        6.76 ms <

 Memory estimate: 9.12 MiB, allocs estimate: 2310.
benchmark vmapntt(collatz_sequencia_SIMD, 1:100_000)

BenchmarkTools.Trial: 1505 samples with 1 evaluation.
 Range (min … max):  2.200 ms … 45.955 ms  β”Š GC (min … max): 0.00% … 2.91%
 Time  (median):     2.590 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.316 ms Β±  4.827 ms  β”Š GC (mean Β± Οƒ):  0.50% Β± 0.32%

  β–ˆβ–†                                                          
  β–ˆβ–ˆβ–ˆβ–„β–ƒβ–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–† β–ˆ
  2.2 ms       Histogram: log(frequency) by time     44.5 ms <

 Memory estimate: 781.33 KiB, allocs estimate: 2.
2 Likes