I was trying to write a SIMD version of a to_upper function using LoopVectorization.jl package.
Here’s the branchless version:
function to_upper_branchless(s)
b = Vector{UInt8}(s)
for i ∈ eachindex(b)
is_lower = (b[i] >= convert(UInt8, 'a')) & (b[i] <= convert(UInt8, 'z'))
b[i] -= is_lower * 0x20
end
String(b)
end
It works:
> to_upper_branchless("asAS123")
"ASAS123"
If I now use @avx I have to extract the convert(UInt8, 'a') and convert(UInt8, 'z') from the loop and I have to convert the is_lower to a UInt8.
I end up with the following code:
function to_upper_avx(s)
b = Vector{UInt8}(s)
a = convert(UInt8, 'a')
z = convert(UInt8, 'z')
@avx for i ∈ eachindex(b)
is_lower = convert(UInt8, (b[i] >= a) & (b[i] <= z))
b[i] -= is_lower * 0x20
end
String(b)
end
Unfortunately now it no longer works:
> to_upper_avx("asAS123")
"asAS123"
I can replace the @avx by a @simd it works - but then it’s not any faster - it doesn’t seem to use any SIMD instructions.
Am I missing something here?
Could this be a bug in the LoopVectorization.jl package?
Fixed in VectorizationBase 0.19.24.
However, unfortunately, convert(UInt8,...) being passed to LoopVectorization isn’t type stable; it seems to be triggering some no-specialize heuristic that I need to find out how to avoid.
These options are type stable:
function to_upper_avx_anon(s)
b = Vector{UInt8}(s)
a = convert(UInt8, 'a')
z = convert(UInt8, 'z')
convertUInt8 = x -> convert(UInt8, x)
@avx for i ∈ eachindex(b)
is_lower = convertUInt8((b[i] >= a) & (b[i] <= z))
b[i] -= is_lower * 0x20
end
String(b)
end
function to_upper_avx_nocvt(s)
b = Vector{UInt8}(s)
a = convert(UInt8, 'a')
z = convert(UInt8, 'z')
@avx for i ∈ eachindex(b)
is_lower = (b[i] >= a) & (b[i] <= z)
b[i] -= is_lower * 0x20
end
String(b)
end
function to_upper_avx_ternary(s)
b = Vector{UInt8}(s)
a = convert(UInt8, 'a')
z = convert(UInt8, 'z')
@avx for i ∈ eachindex(b)
is_lower = (b[i] >= a) & (b[i] <= z)
b[i] = is_lower ? b[i] - 0x20 : b[i]
end
String(b)
end
The first thing I did after reproducing the example:
julia> @time using LoopVectorization
1.025361 seconds (2.80 M allocations: 163.858 MiB, 0.93% gc time)
julia> ls = let s = "asAS123", b = Vector{UInt8}(s), a = convert(UInt8, 'a'), z = convert(UInt8, 'z')
LoopVectorization.@avx_debug for i ∈ eachindex(b)
is_lower = convert(UInt8, (b[i] >= a) & (b[i] <= z))
b[i] -= is_lower * 0x20
end
end
I looked at the resulting expression, and everything looked correct:
so I figured the bug was in VectorizationBase.
It translated b[i] -= is_lower * 0x20 into LoopVectorization.vfnmadd_fast(var"####op#279__1", var"####op#280__1", var"####op#273__1").
Trying something similar:
sub_fast of Vecs of unsigned numbers was just returning 0s.
Turns out that’s because I made sub_fast for unsigned promise no unsigned wrapping.
Because the non-fast version worked correctly:
I just special cased the unsigned ones to use this flag.
That won’t work with LoopVectorization at the moment, as I never added support for strings.
You’re welcome to make a PR to VectorizationBase and LoopVectorization, or file an issue outlining what you need and I’ll get around to it sometime.
I think the best approach would be to basically treat strings as AbstractVector{UInt8}, and define methods for stridedpointer/grouped_strided_pointer to work.
Do you think adding support for codeunit would help in this case - Strings are meant to be immutable in Julia, if I understand correctly.
Is there a way to forgo ownership of a value (something like std::move in C++)? I suppose mutating a String could be okay if you are the sole owner of the String?
We could probably add them (ArrayInterface.stride_rank, ArrayInterface.dense_dims) to ArrayInterface.
This yields a Vector of codeunits that should not be free’d by the GC (because unsafe_wrap 's own parameter defaults to false).
I’m not sure if this is something that LoopVectorization could/want to add support for directly?
The codeunits themselves could be freed.
LoopVectorization will GC.@preserve everything it uses, so probably adding support for the codeunits themselves would be best. Yes, I think it makes sense to support that. It should only require a few methods.
Do you think adding support for codeunit would help in this case - Strings are meant to be immutable in Julia, if I understand correctly.
Is there a way to forgo ownership of a value (something like std::move in C++)? I suppose mutating a String could be okay if you are the sole owner of the String?
I don’t know much about how strings are implemented in Julia, but I think so?