# Bit manipulations with llvmcall give strange results

I’m playing around with `llvmcall`. The following function `g` is supposed to convert `x::UInt8` into an `Int8` tuple of length `N` whose first 8 entries are the bits of `x` and the higher ones are zero.

``````unvec(t::Tuple{Vararg{VecElement}}) = map(x -> x.value, t)  # remove the `VecElement` from each element of t

@generated function g(x::UInt8, ::Val{N}) where N
ir = """
%a = zext i8 %0 to i\$N                ; zero extend from 8 bits to N bits
%b = bitcast i\$N %a to <\$N x i1>      ; convert to N x i1 vector
%c = zext <\$N x i1> %b to <\$N x i8>   ; zero extend from N x i1 to N x i8
ret <\$N x i8> %c
"""
quote
v = Base.llvmcall(\$ir, NTuple{\$N, VecElement{Int8}}, Tuple{UInt8}, x)
unvec(v)
end
end
``````

The LLVM documentation for the `zext` instruction says that

`zext` fills the high order bits of the `value` with zero bits until it reaches the size of the destination type

However, this is not what I get. For

``````julia> for N in 9:24
t = g(0xff, Val(N))
println("\$N: \$t")
end
``````

I get

`````` 9: (1, 1, 1, 1, 1, 1, 1, 1, 1)
10: (1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
11: (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
12: (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
13: (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
14: (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
15: (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
16: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0)
17: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
18: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1)
19: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1)
20: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1)
21: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1)
22: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1)
23: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1)
24: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
``````

on one machine and

`````` 9: (1, 1, 1, 1, 1, 1, 1, 1, 0)
10: (1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
11: (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
12: (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0)
13: (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
14: (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
15: (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1)
16: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0)
17: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
18: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1)
19: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1)
20: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1)
21: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1)
22: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1)
23: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1)
24: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
``````

on another. Am I getting something wrong here, or is this an LLVM bug?

`versioninfo()` for the first output:

``````Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 4 × Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 1 on 4 virtual cores
``````

I don’t think llvm has good support for non-power of 2 sized integers. I suggest casting up to the next power of 2.

It’s easy to get the next power of 2 via bit manipulation:

``````julia> nextpow2(N) = 1<<((8sizeof(N)) - leading_zeros(N-one(N)))
nextpow2 (generic function with 1 method)

julia> hcat(1:8,nextpow2.(1:8))
8×2 Matrix{Int64}:
1  1
2  2
3  4
4  4
5  8
6  8
7  8
8  8
``````

You can also get the next power of 2 via `nextpow`.

A function implementing your idea gives the correct result.

``````@generated function g2(x::UInt8, ::Val{N}) where N
N2 = nextpow(2, N)
ir = """
%a = zext i8 %0 to i\$N2
%b = bitcast i\$N2 %a to <\$N2 x i1>
%c = zext <\$N2 x i1> %b to <\$N2 x i8>
ret <\$N2 x i8> %c
"""
quote
v2 = Base.llvmcall(\$ir, NTuple{\$N2, VecElement{Int8}}, Tuple{UInt8}, x)
v = v2[1:N]
unvec(v)
end
end
``````

However, it can be much slower than `g`:

``````julia> @btime g(0xff, Val(10));
3.002 ns (0 allocations: 0 bytes)

julia> @btime g2(0xff, Val(10));
2.993 ns (0 allocations: 0 bytes)

julia> @btime g(0xff, Val(11));
2.997 ns (0 allocations: 0 bytes)

julia> @btime g2(0xff, Val(11));
514.766 ns (14 allocations: 304 bytes)
``````

The problem seems to be the reduction of the tuple length from `N2` to `N`.

EDIT: The same phenomenon occurs for

``````function g3(x::UInt8, ::Val{N}) where N
ntuple(i -> Int8(i <= 8 && !iszero(x & 1 << (i-1))), N)
end
``````
``````
julia> @btime g3(0xff, Val(10));
1.492 ns (0 allocations: 0 bytes)

julia> @btime g3(0xff, Val(11));
390.735 ns (2 allocations: 96 bytes)
``````
``````julia> @generated function g4(x::UInt8, ::Val{N}) where N
N2 = nextpow(2, N)
ir = """
%a = zext i8 %0 to i\$N2
%b = bitcast i\$N2 %a to <\$N2 x i1>
%c = zext <\$N2 x i1> %b to <\$N2 x i8>
ret <\$N2 x i8> %c
"""
quote
v2 = Base.llvmcall(\$ir, NTuple{\$N2, VecElement{Int8}}, Tuple{UInt8}, x)
v = Base.Cartesian.@ntuple \$N n -> @inbounds(v2[n])
unvec(v)
end
end
g4 (generic function with 1 method)
``````

works fairly well.

``````julia> @btime g(0xff, Val(10));
2.850 ns (0 allocations: 0 bytes)

julia> @btime g2(0xff, Val(10));
2.614 ns (0 allocations: 0 bytes)

julia> @btime g4(0xff, Val(10));
3.017 ns (0 allocations: 0 bytes)

julia> @btime g(0xff, Val(11));
3.257 ns (0 allocations: 0 bytes)

julia> @btime g2(0xff, Val(11));
684.893 ns (14 allocations: 304 bytes)

julia> @btime g4(0xff, Val(11));
3.254 ns (0 allocations: 0 bytes)
``````

Note that I do not trust your `unvec` implementation.
I have often see it generate incorrect code and cause bugs.
The safer way is to use `llvmcall` with `extractelement`,

Great, that works!

How would you do that? Extract all `N2` elements separately? But how would you then pass them back to Julia? Can one return an LLVM array (like `[8 x i8]`) instead of a vector (`<8 x i8>`)?

Just replace `getproperty(getindex(tup, i), :value)` with `extractelement(tup, i)`.

It seems you are talking about some `extractelement` in Julia, not in LLVM . How does one access it? I can’t even find it in the source tree.

Define a julia function using `extactelement` via `llvmcall`.
VectorizationBase has such an implementation, but it isn’t copy/pastable due to some workaround for supporting both the new and old `llvmcall` syntax.

I have come upon a related issue, which is really annoying: `bitcast <\$N x i1> to i\$N`.

Context is one of my favorite instruction sequences: Compare vectors, compress results into an integer (vpmovmskb / vmovmskps / vmovmskpd), then loop over the indices where the comparison succeeded via trailing_zeros / Base._blsr. Since julia represents the result of a vector comparison as `<\$N x i8>` you need to truncate before bitcasting.

I am somewhat miffed that SIMD.jl doesn’t provide that very natural operation (turn vector of Bool into integer), so I wanted to add it, and discovered that this gives incorrect results for “weird” values of N (you of course need to `zext` the weird-size integer to `i64` before returning it to julia).

But for non-weird vector-lengths, this is a super nice platform-independent idiom for my favorite instruction that generates whatever hand-rolled presumably-optimized-to-death workaround on ARM (which lacks pmovmskb for some unfathomable reason).

Per stack overflow this was UB from the very start. To quote the langref:

The conversion is done as if the `value` had been stored to memory and read back as type `ty2`.

But `i1` has no memory representation, it is a purely logical type invented by llvm. So bitcasting `i1` vectors shouldn’t work anyways!

So this stumped me and prevented me from opening a PR on SIMD.jl. I am happy to use that nice idiom in personal projects until the next LLVM update blows up in my face, but I wouldn’t impose that on serious people who sit far downstream.

So, @Elrod , since you’re more knowledgeable on these matters… any helpful comments?

Should I just pad the vector to a “good” size before bitcasting it into a good-size integer? I didn’t think of that, but this may work.

2 Likes

Example:

``````julia> @generated function extractelement(v::NTuple{N,Core.VecElement{Int8}}, i::Int) where {N}
ir = """
%a = extractelement <\$N x i8> %0, i\$(8sizeof(Int)) %1
ret i8 %a
"""
quote
\$(Expr(:meta,:inline)) # important when using llvmcall
Base.llvmcall(\$ir, Int8, Tuple{NTuple{\$N, VecElement{Int8}}, Int}, v, i)
end
end
extractelement (generic function with 1 method)

julia> @inline unvec(t::Tuple{Vararg{VecElement}}, ::Val{N}) where {N} = ntuple(Base.Fix1(extractelement,t), Val(N))
unvec (generic function with 2 methods)

julia> @generated function g5(x::UInt8, ::Val{N}) where N
N2 = nextpow(2, N)
ir = """
%a = zext i8 %0 to i\$N2
%b = bitcast i\$N2 %a to <\$N2 x i1>
%c = zext <\$N2 x i1> %b to <\$N2 x i8>
ret <\$N2 x i8> %c
"""
quote
\$(Expr(:meta, :inline))
v = Base.llvmcall(\$ir, NTuple{\$N2, VecElement{Int8}}, Tuple{UInt8}, x)
unvec(v, Val(N))
end
end
g5 (generic function with 1 method)

julia> @btime g5(0xff, Val(10));
3.276 ns (0 allocations: 0 bytes)

julia> @btime g5(0xff, Val(10));
16.289 ns (0 allocations: 0 bytes)

julia> @btime g5(0xff, Val(10));
16.288 ns (0 allocations: 0 bytes)

julia> @btime g5(0xff, Val(11));
15.538 ns (0 allocations: 0 bytes)

julia> @btime g5(0xff, Val(11));
15.538 ns (0 allocations: 0 bytes)

julia> @btime g5(0xff, Val(11));
15.535 ns (0 allocations: 0 bytes)

julia> @btime g5(0xff, Val(11));
15.539 ns (0 allocations: 0 bytes)

julia> @code_native syntax=:intel debuginfo=:none g5(0xff, Val(11))
``````
``````julia_g5_1335:                          # @julia_g5_1335
# %bb.0:                                # %top
push	rbp
mov	rbp, rsp
mov	rax, rdi
kmovd	k1, esi
movabs	rcx, offset .LCPI0_0
vmovdqu8	xmm0 {k1} {z}, xmmword ptr [rcx]
vpsrldq	xmm1, xmm0, 1                   # xmm1 = xmm0[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],zero
vmovlpd	qword ptr [rdi], xmm1
vpextrb	byte ptr [rdi + 8], xmm0, 9
vpextrb	byte ptr [rdi + 9], xmm0, 10
vpextrb	byte ptr [rdi + 10], xmm0, 11
pop	rbp
ret
``````

benchmarks appear to be noisy, but the assembly is good.

As a word of caution, `llvmcall` is considered to be expensive by Julia’s inliner. Thus, if you use more than a couple, it won’t inline your code unless you add `@inline`.
This forces you to spam `@inline` everywhere (or `\$(Expr(:meta,:inline))` in `@generated` functions).

2 Likes

Is it really considered undefined behavior at power-of-2 sizes?
I have always been padding the vectors. VectorizationBase won’t even allow you to construct non-power of 2 sizes, or nor will it let you construct overly-long vectors:

``````julia> using VectorizationBase

julia> VectorizationBase.Vec(1,2,3)
Vec{4, Int64}<1, 2, 3, 0>

julia> VectorizationBase.Vec(ntuple(identity, Val(9))...) # probably shouldn't downcast the eltype size...
Vec{16, Int32}<1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 0, 0, 0, 0, 0, 0>

julia> VectorizationBase.Vec(ntuple(float, Val(9))...)
2 x Vec{8, Float64}
Vec{8, Float64}<1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0>
Vec{8, Float64}<9.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0>
``````

These things don’t have a clear hardware representation, so I think it’s generally better to force dealing with the reality directly.
Especially because LLVM’s behavior is sometimes buggy when dealing with them.

(However, under the hood, it does sometimes generate overly long vectors, as this is an LLVM idiom used in generating shuffle sequences, for example).

VB is, like LV, deprecated, so I wouldn’t recommend switching to it.
But that I’d suggest following that approach.
Except the truncation to `Int32`. Definitely not one of my better ideas.

Alternatively, you could PR LLVM to fix this upstream and explicitly defining `bitcast` to do the right thing, but fixing non-power-of-2 integers in general could be involved.

1 Like

As a word of caution, `llvmcall` is considered to be expensive by Julia’s inliner. Thus, if you use more than a couple, it won’t inline your code unless you add `@inline`.
This forces you to spam `@inline` everywhere (or `\$(Expr(:meta,:inline))` in `@generated` functions).

I did not know that. Thanks for that off-hand comment!

Is it really considered undefined behavior at power-of-2 sizes?

I am not a language-lawyer. How should I tell?

(I hate english language specs. C/C++ language lawyering is even worse. IETF is 1000x better at writing specs, many RFCs are a joy to read and interpret and involve relevant examples. If I want to know what an x86 instruction does, then the prose is supplemented by pseudo-code. LLVM docs should be ISA-style, not PL style)

I have always been padding the vectors.

I think I’ll copy you on that.

Alternatively, you could PR LLVM to fix this upstream and explicitly defining `bitcast` to do the right thing, but fixing non-power-of-2 integers in general could be involved.

My admittedly limited experience with submitting llvm PRs felt like a mix of being treated as a petitioner and shouting into the void (especially compared to the extremely welcoming julia community). Maybe I just got unlucky or bounced off for some other reason, but unless my dayjob demands it, I’m unlikely to try that again and will stick to kvetching for now.

If enough people depend on that pmovmskb idiom, somebody might fix it

1 Like