Bit manipulations with llvmcall give strange results

I’m playing around with llvmcall. The following function g is supposed to convert x::UInt8 into an Int8 tuple of length N whose first 8 entries are the bits of x and the higher ones are zero.

unvec(t::Tuple{Vararg{VecElement}}) = map(x -> x.value, t)  # remove the `VecElement` from each element of t

@generated function g(x::UInt8, ::Val{N}) where N
    ir = """
        %a = zext i8 %0 to i$N                ; zero extend from 8 bits to N bits
        %b = bitcast i$N %a to <$N x i1>      ; convert to N x i1 vector
        %c = zext <$N x i1> %b to <$N x i8>   ; zero extend from N x i1 to N x i8
        ret <$N x i8> %c
    """
    quote
        v = Base.llvmcall($ir, NTuple{$N, VecElement{Int8}}, Tuple{UInt8}, x)
        unvec(v)
    end
end

The LLVM documentation for the zext instruction says that

zext fills the high order bits of the value with zero bits until it reaches the size of the destination type

However, this is not what I get. For

julia> for N in 9:24
           t = g(0xff, Val(N))
           println("$N: $t")
       end

I get

 9: (1, 1, 1, 1, 1, 1, 1, 1, 1)
10: (1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
11: (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
12: (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
13: (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
14: (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
15: (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
16: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0)
17: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
18: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1)
19: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1)
20: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1)
21: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1)
22: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1)
23: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1)
24: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

on one machine and

 9: (1, 1, 1, 1, 1, 1, 1, 1, 0)
10: (1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
11: (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
12: (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0)
13: (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
14: (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
15: (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1)
16: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0)
17: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
18: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1)
19: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1)
20: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1)
21: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1)
22: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1)
23: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1)
24: (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

on another. Am I getting something wrong here, or is this an LLVM bug?

versioninfo() for the first output:

Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
  Threads: 1 on 4 virtual cores

I don’t think llvm has good support for non-power of 2 sized integers. I suggest casting up to the next power of 2.

It’s easy to get the next power of 2 via bit manipulation:

julia> nextpow2(N) = 1<<((8sizeof(N)) - leading_zeros(N-one(N)))
nextpow2 (generic function with 1 method)

julia> hcat(1:8,nextpow2.(1:8))
8×2 Matrix{Int64}:
 1  1
 2  2
 3  4
 4  4
 5  8
 6  8
 7  8
 8  8

You can also get the next power of 2 via nextpow.

A function implementing your idea gives the correct result.

@generated function g2(x::UInt8, ::Val{N}) where N
    N2 = nextpow(2, N)
    ir = """
        %a = zext i8 %0 to i$N2
        %b = bitcast i$N2 %a to <$N2 x i1>
        %c = zext <$N2 x i1> %b to <$N2 x i8>
        ret <$N2 x i8> %c
    """
    quote
        v2 = Base.llvmcall($ir, NTuple{$N2, VecElement{Int8}}, Tuple{UInt8}, x)
        v = v2[1:N]
        unvec(v)
    end
end

However, it can be much slower than g:

julia> @btime g(0xff, Val(10));
  3.002 ns (0 allocations: 0 bytes)

julia> @btime g2(0xff, Val(10));
  2.993 ns (0 allocations: 0 bytes)

julia> @btime g(0xff, Val(11));
  2.997 ns (0 allocations: 0 bytes)

julia> @btime g2(0xff, Val(11));
  514.766 ns (14 allocations: 304 bytes)

The problem seems to be the reduction of the tuple length from N2 to N.

EDIT: The same phenomenon occurs for

function g3(x::UInt8, ::Val{N}) where N
    ntuple(i -> Int8(i <= 8 && !iszero(x & 1 << (i-1))), N)
end

julia> @btime g3(0xff, Val(10));
  1.492 ns (0 allocations: 0 bytes)

julia> @btime g3(0xff, Val(11));
  390.735 ns (2 allocations: 96 bytes)
julia> @generated function g4(x::UInt8, ::Val{N}) where N
           N2 = nextpow(2, N)
           ir = """
               %a = zext i8 %0 to i$N2
               %b = bitcast i$N2 %a to <$N2 x i1>
               %c = zext <$N2 x i1> %b to <$N2 x i8>
               ret <$N2 x i8> %c
           """
           quote
               v2 = Base.llvmcall($ir, NTuple{$N2, VecElement{Int8}}, Tuple{UInt8}, x)
               v = Base.Cartesian.@ntuple $N n -> @inbounds(v2[n])
               unvec(v)
           end
       end
g4 (generic function with 1 method)

works fairly well.

julia> @btime g(0xff, Val(10));
  2.850 ns (0 allocations: 0 bytes)

julia> @btime g2(0xff, Val(10));
  2.614 ns (0 allocations: 0 bytes)

julia> @btime g4(0xff, Val(10));
  3.017 ns (0 allocations: 0 bytes)

julia> @btime g(0xff, Val(11));
  3.257 ns (0 allocations: 0 bytes)

julia> @btime g2(0xff, Val(11));
  684.893 ns (14 allocations: 304 bytes)

julia> @btime g4(0xff, Val(11));
  3.254 ns (0 allocations: 0 bytes)

Note that I do not trust your unvec implementation.
I have often see it generate incorrect code and cause bugs.
The safer way is to use llvmcall with extractelement,

Great, that works!

How would you do that? Extract all N2 elements separately? But how would you then pass them back to Julia? Can one return an LLVM array (like [8 x i8]) instead of a vector (<8 x i8>)?

Just replace getproperty(getindex(tup, i), :value) with extractelement(tup, i).

It seems you are talking about some extractelement in Julia, not in LLVM . How does one access it? I can’t even find it in the source tree.

Define a julia function using extactelement via llvmcall.
VectorizationBase has such an implementation, but it isn’t copy/pastable due to some workaround for supporting both the new and old llvmcall syntax.

I have come upon a related issue, which is really annoying: bitcast <$N x i1> to i$N.

Context is one of my favorite instruction sequences: Compare vectors, compress results into an integer (vpmovmskb / vmovmskps / vmovmskpd), then loop over the indices where the comparison succeeded via trailing_zeros / Base._blsr. Since julia represents the result of a vector comparison as <$N x i8> you need to truncate before bitcasting.

I am somewhat miffed that SIMD.jl doesn’t provide that very natural operation (turn vector of Bool into integer), so I wanted to add it, and discovered that this gives incorrect results for “weird” values of N (you of course need to zext the weird-size integer to i64 before returning it to julia).

But for non-weird vector-lengths, this is a super nice platform-independent idiom for my favorite instruction that generates whatever hand-rolled presumably-optimized-to-death workaround on ARM (which lacks pmovmskb for some unfathomable reason).

Per stack overflow this was UB from the very start. To quote the langref:

The conversion is done as if the value had been stored to memory and read back as type ty2.

But i1 has no memory representation, it is a purely logical type invented by llvm. So bitcasting i1 vectors shouldn’t work anyways!

So this stumped me and prevented me from opening a PR on SIMD.jl. I am happy to use that nice idiom in personal projects until the next LLVM update blows up in my face, but I wouldn’t impose that on serious people who sit far downstream.

So, @Elrod , since you’re more knowledgeable on these matters… any helpful comments?

Should I just pad the vector to a “good” size before bitcasting it into a good-size integer? I didn’t think of that, but this may work.

2 Likes

Example:

julia> @generated function extractelement(v::NTuple{N,Core.VecElement{Int8}}, i::Int) where {N}
           ir = """
               %a = extractelement <$N x i8> %0, i$(8sizeof(Int)) %1
               ret i8 %a
           """
           quote
               $(Expr(:meta,:inline)) # important when using llvmcall
               Base.llvmcall($ir, Int8, Tuple{NTuple{$N, VecElement{Int8}}, Int}, v, i)
           end
       end
extractelement (generic function with 1 method)

julia> @inline unvec(t::Tuple{Vararg{VecElement}}, ::Val{N}) where {N} = ntuple(Base.Fix1(extractelement,t), Val(N))
unvec (generic function with 2 methods)

julia> @generated function g5(x::UInt8, ::Val{N}) where N
           N2 = nextpow(2, N)
           ir = """
               %a = zext i8 %0 to i$N2
               %b = bitcast i$N2 %a to <$N2 x i1>
               %c = zext <$N2 x i1> %b to <$N2 x i8>
               ret <$N2 x i8> %c
           """
           quote
               $(Expr(:meta, :inline))
               v = Base.llvmcall($ir, NTuple{$N2, VecElement{Int8}}, Tuple{UInt8}, x)
               unvec(v, Val(N))
           end
       end
g5 (generic function with 1 method)

julia> @btime g5(0xff, Val(10));
  3.276 ns (0 allocations: 0 bytes)

julia> @btime g5(0xff, Val(10));
  16.289 ns (0 allocations: 0 bytes)

julia> @btime g5(0xff, Val(10));
  16.288 ns (0 allocations: 0 bytes)

julia> @btime g5(0xff, Val(11));
  15.538 ns (0 allocations: 0 bytes)

julia> @btime g5(0xff, Val(11));
  15.538 ns (0 allocations: 0 bytes)

julia> @btime g5(0xff, Val(11));
  15.535 ns (0 allocations: 0 bytes)

julia> @btime g5(0xff, Val(11));
  15.539 ns (0 allocations: 0 bytes)

julia> @code_native syntax=:intel debuginfo=:none g5(0xff, Val(11))
julia_g5_1335:                          # @julia_g5_1335
# %bb.0:                                # %top
	push	rbp
	mov	rbp, rsp
	mov	rax, rdi
	kmovd	k1, esi
	movabs	rcx, offset .LCPI0_0
	vmovdqu8	xmm0 {k1} {z}, xmmword ptr [rcx]
	vpsrldq	xmm1, xmm0, 1                   # xmm1 = xmm0[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],zero
	vmovlpd	qword ptr [rdi], xmm1
	vpextrb	byte ptr [rdi + 8], xmm0, 9
	vpextrb	byte ptr [rdi + 9], xmm0, 10
	vpextrb	byte ptr [rdi + 10], xmm0, 11
	pop	rbp
	ret

benchmarks appear to be noisy, but the assembly is good.

As a word of caution, llvmcall is considered to be expensive by Julia’s inliner. Thus, if you use more than a couple, it won’t inline your code unless you add @inline.
This forces you to spam @inline everywhere (or $(Expr(:meta,:inline)) in @generated functions).

2 Likes

Is it really considered undefined behavior at power-of-2 sizes?
I have always been padding the vectors. VectorizationBase won’t even allow you to construct non-power of 2 sizes, or nor will it let you construct overly-long vectors:

julia> using VectorizationBase

julia> VectorizationBase.Vec(1,2,3)
Vec{4, Int64}<1, 2, 3, 0>

julia> VectorizationBase.Vec(ntuple(identity, Val(9))...) # probably shouldn't downcast the eltype size...
Vec{16, Int32}<1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 0, 0, 0, 0, 0, 0>

julia> VectorizationBase.Vec(ntuple(float, Val(9))...)
2 x Vec{8, Float64}
Vec{8, Float64}<1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0>
Vec{8, Float64}<9.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0>

These things don’t have a clear hardware representation, so I think it’s generally better to force dealing with the reality directly.
Especially because LLVM’s behavior is sometimes buggy when dealing with them.

(However, under the hood, it does sometimes generate overly long vectors, as this is an LLVM idiom used in generating shuffle sequences, for example).

VB is, like LV, deprecated, so I wouldn’t recommend switching to it.
But that I’d suggest following that approach.
Except the truncation to Int32. Definitely not one of my better ideas.

Alternatively, you could PR LLVM to fix this upstream and explicitly defining bitcast to do the right thing, but fixing non-power-of-2 integers in general could be involved.

1 Like

As a word of caution, llvmcall is considered to be expensive by Julia’s inliner. Thus, if you use more than a couple, it won’t inline your code unless you add @inline.
This forces you to spam @inline everywhere (or $(Expr(:meta,:inline)) in @generated functions).

I did not know that. Thanks for that off-hand comment!

Is it really considered undefined behavior at power-of-2 sizes?

I am not a language-lawyer. How should I tell?

(I hate english language specs. C/C++ language lawyering is even worse. IETF is 1000x better at writing specs, many RFCs are a joy to read and interpret and involve relevant examples. If I want to know what an x86 instruction does, then the prose is supplemented by pseudo-code. LLVM docs should be ISA-style, not PL style)

I have always been padding the vectors.

I think I’ll copy you on that.

Alternatively, you could PR LLVM to fix this upstream and explicitly defining bitcast to do the right thing, but fixing non-power-of-2 integers in general could be involved.

My admittedly limited experience with submitting llvm PRs felt like a mix of being treated as a petitioner and shouting into the void (especially compared to the extremely welcoming julia community). Maybe I just got unlucky or bounced off for some other reason, but unless my dayjob demands it, I’m unlikely to try that again and will stick to kvetching for now.

If enough people depend on that pmovmskb idiom, somebody might fix it :wink:

1 Like