LoopVectorization.jl's @avx does not store results

Hi,

I was trying to write a SIMD version of a to_upper function using LoopVectorization.jl package.

Here’s the branchless version:

function to_upper_branchless(s)
    b = Vector{UInt8}(s)
    for i ∈ eachindex(b)
        is_lower = (b[i] >= convert(UInt8, 'a')) & (b[i] <= convert(UInt8, 'z'))
        b[i] -= is_lower * 0x20
    end
    String(b)
end

It works:

> to_upper_branchless("asAS123")
"ASAS123"

If I now use @avx I have to extract the convert(UInt8, 'a') and convert(UInt8, 'z') from the loop and I have to convert the is_lower to a UInt8.
I end up with the following code:

function to_upper_avx(s)
    b = Vector{UInt8}(s)
    a = convert(UInt8, 'a')
    z = convert(UInt8, 'z')
    @avx for i ∈ eachindex(b)
        is_lower = convert(UInt8, (b[i] >= a) & (b[i] <= z))
        b[i] -= is_lower * 0x20
    end
    String(b)
end

Unfortunately now it no longer works:

> to_upper_avx("asAS123")
"asAS123"

I can replace the @avx by a @simd it works - but then it’s not any faster - it doesn’t seem to use any SIMD instructions.

Am I missing something here?
Could this be a bug in the LoopVectorization.jl package?

Thanks a lot in advance for any help!

Best,
ambiso

@Elrod

Fixed in VectorizationBase 0.19.24.
However, unfortunately, convert(UInt8,...) being passed to LoopVectorization isn’t type stable; it seems to be triggering some no-specialize heuristic that I need to find out how to avoid.

These options are type stable:

function to_upper_avx_anon(s)
    b = Vector{UInt8}(s)
    a = convert(UInt8, 'a')
    z = convert(UInt8, 'z')
    convertUInt8 = x -> convert(UInt8, x)
    @avx for i ∈ eachindex(b)
        is_lower = convertUInt8((b[i] >= a) & (b[i] <= z))
        b[i] -= is_lower * 0x20
    end
    String(b)
end
function to_upper_avx_nocvt(s)
    b = Vector{UInt8}(s)
    a = convert(UInt8, 'a')
    z = convert(UInt8, 'z')
    @avx for i ∈ eachindex(b)
        is_lower = (b[i] >= a) & (b[i] <= z)
        b[i] -= is_lower * 0x20
    end
    String(b)
end
function to_upper_avx_ternary(s)
    b = Vector{UInt8}(s)
    a = convert(UInt8, 'a')
    z = convert(UInt8, 'z')
    @avx for i ∈ eachindex(b)
        is_lower = (b[i] >= a) & (b[i] <= z)
        b[i] = is_lower ? b[i]  - 0x20 : b[i]
    end
    String(b)
end

This yields:

julia> to_upper_avx_anon("asAS123")
"ASAS123"

julia> to_upper_avx_nocvt("asAS123")
"ASAS123"

julia> to_upper_avx_ternary("asAS123")
"ASAS123"

Note that you’ll need much longer strings before the SIMD starts to pay off.
Most of the time is likely to be spent in Vector{UInt8}(s) and String(b).

1 Like

Thank you for taking the time! I was just about to try to find a more minimal example, but it seems you’ve already identified the issue.

Is this the patch that was required for the example to run?

Is there an “easy” way I could have debugged this and found out the underlying issue?

Thanks again for your time!

Oh and I realize that Vector{UInt8}(s) and String(b) will be the limiting factors here, I just didn’t get around to replacing them.

I’m assuming codeunit(s::AbstractString, i::Integer) is the method I should be using.

Yes.
If you want something more minimal, this method alone is a fix for the original example:

@inline vsub_fast(v::Vec{<:Any,<:UnsignedHW}) = vsub(zero(v), v)

The first thing I did after reproducing the example:

julia> @time using LoopVectorization
  1.025361 seconds (2.80 M allocations: 163.858 MiB, 0.93% gc time)

julia> ls = let s = "asAS123", b = Vector{UInt8}(s), a = convert(UInt8, 'a'), z = convert(UInt8, 'z')
           LoopVectorization.@avx_debug for i ∈ eachindex(b)
               is_lower = convert(UInt8, (b[i] >= a) & (b[i] <= z))
               b[i] -= is_lower * 0x20
           end
       end

I looked at the resulting expression, and everything looked correct:

var"####op#273__1" = LoopVectorization._vload(var"##vptr##_b", (LoopVectorization.MM(var"##Wvecwidth##", i, LoopVectorization.Static{1}()),), var"##mask##", LoopVectorization.False(), LoopVectorization.Static{16}())
var"####op#275__1" = var"####op#273__1" >= var"####op#274__1"
var"####op#277__1" = var"####op#273__1" <= var"####op#276__1"
var"####op#278__1" = var"####op#275__1" & var"####op#277__1"
var"####op#279__1" = LoopVectorization.convert(var"####op#272__1", var"####op#278__1")
var"####op#273__1" = LoopVectorization.vfnmadd_fast(var"####op#279__1", var"####op#280__1", var"####op#273__1")
LoopVectorization._vstore!(var"##vptr##_b", var"####op#273__1", (LoopVectorization.MM(var"##Wvecwidth##", i, LoopVectorization.Static{1}()),), var"##mask##", LoopVectorization.False(), LoopVectorization.True(), LoopVectorization.False(), LoopVectorization.Static{16}())
i = LoopVectorization.vadd_fast(var"##Wvecwidth##", i)

so I figured the bug was in VectorizationBase.
It translated b[i] -= is_lower * 0x20 into LoopVectorization.vfnmadd_fast(var"####op#279__1", var"####op#280__1", var"####op#273__1").
Trying something similar:

julia> using VectorizationBase

julia> v = convert(UInt8, Mask{Int(pick_vector_width(UInt8))}(rand(UInt)))
Vec{64, UInt8}<0x01, 0x01, 0x00, 0x01, 0x01, 0x00, 0x01, 0x01, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x01, 0x01, 0x00, 0x01, 0x01, 0x00, 0x01, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x01, 0x01, 0x01, 0x00, 0x01, 0x01, 0x00, 0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x00, 0x01, 0x01, 0x01, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x01, 0x00, 0x01, 0x01, 0x00, 0x01, 0x01>

julia> b = Vec(ntuple(_ -> rand(UInt8), pick_vector_width(UInt8))...);

julia> VectorizationBase.vfnmadd_fast(v, 0x20, b) === b
true

So, basically, even though v was a mix of 0s and 1s, vfnmadd_fast wasn’t actually updating v. Looking at the definition, it was

vfnmadd_fast(a,b,c) = vfmadd_fast(Base.FastMath.sub_fast(a),b,c))

a test confirmed that vfmadd_fast worked correctly, while…

julia> Base.FastMath.sub_fast(v)
Vec{64, UInt8}<0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00>

julia> Base.FastMath.sub_fast(b)
Vec{64, UInt8}<0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00>

sub_fast of Vecs of unsigned numbers was just returning 0s.
Turns out that’s because I made sub_fast for unsigned promise no unsigned wrapping.
Because the non-fast version worked correctly:

julia> -v
Vec{64, UInt8}<0xff, 0xff, 0x00, 0xff, 0xff, 0x00, 0xff, 0xff, 0xff, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0x00, 0xff, 0xff, 0x00, 0xff, 0x00, 0x00, 0xff, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0x00, 0xff, 0xff, 0x00, 0x00, 0x00, 0xff, 0xff, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0x00, 0x00, 0x00, 0xff, 0x00, 0x00, 0xff, 0x00, 0x00, 0x00, 0xff, 0x00, 0xff, 0xff, 0x00, 0xff, 0xff>

I just special cased the unsigned ones to use this flag.

That won’t work with LoopVectorization at the moment, as I never added support for strings.
You’re welcome to make a PR to VectorizationBase and LoopVectorization, or file an issue outlining what you need and I’ll get around to it sometime.

I think the best approach would be to basically treat strings as AbstractVector{UInt8}, and define methods for stridedpointer/grouped_strided_pointer to work.

3 Likes

Would using codeunits("example") work? Aside from not being able to write to it, it should behave just like a Vector{UInt8}.

I’ve implemented the following abomination:

This yields a Vector of codeunits that should not be free’d by the GC (because unsafe_wrap’s own parameter defaults to false).

I’m not sure if this is something that LoopVectorization could/want to add support for directly?

Thanks a lot for the detailed writeup!

Do you think adding support for codeunit would help in this case - Strings are meant to be immutable in Julia, if I understand correctly.

Is there a way to forgo ownership of a value (something like std::move in C++)? I suppose mutating a String could be okay if you are the sole owner of the String?

Seems like it should, but a few methods are missing at the moment.

julia> stridedpointer(codeunits("example"))
ERROR: MethodError: no method matching asvalint(::Nothing)
Closest candidates are:
  asvalint(::T) where T<:Tuple{Vararg{Static.StaticInt}} at /home/chriselrod/.julia/dev/VectorizationBase/src/VectorizationBase.jl:17
Stacktrace:
 [1] val_stride_rank
   @ ~/.julia/dev/VectorizationBase/src/VectorizationBase.jl:31 [inlined]
 [2] stridedpointer(A::Base.CodeUnits{UInt8, String})
   @ VectorizationBase ~/.julia/dev/VectorizationBase/src/strided_pointers/stridedpointers.jl:87
 [3] top-level scope
   @ REPL[35]:1

We could probably add them (ArrayInterface.stride_rank, ArrayInterface.dense_dims) to ArrayInterface.

This yields a Vector of codeunits that should not be free’d by the GC (because unsafe_wrap 's own parameter defaults to false).

I’m not sure if this is something that LoopVectorization could/want to add support for directly?

The codeunits themselves could be freed.
LoopVectorization will GC.@preserve everything it uses, so probably adding support for the codeunits themselves would be best. Yes, I think it makes sense to support that. It should only require a few methods.

Do you think adding support for codeunit would help in this case - Strings are meant to be immutable in Julia, if I understand correctly.

Is there a way to forgo ownership of a value (something like std::move in C++)? I suppose mutating a String could be okay if you are the sole owner of the String?

I don’t know much about how strings are implemented in Julia, but I think so?