Generated LLVM code causes register spills on x86_64

akrishnamoorthy · December 28, 2023, 10:15pm

This question concerns circshift implemented here, which is reproduced below.

function circshift(x::Tuple, shift::Integer)
    @inline
    j = mod1(shift, length(x))
    ntuple(k -> __safe_getindex(x, k-j+ifelse(k>j,0,length(x))), Val(length(x)))
end

Note: Here, length(x) is a compile-time constant and __safe_getindex is just getindex without the bounds check. Use Base.__safe_getindex in the above definition when using the REPL.

For each element of the tuple, the following operations (LLVM) are necessary (example below shows the operations for the first element):

; ifelse + ...
%8 = icmp sgt i64 %7, 0
%9 = select i1 %8, i64 10, i64 0
%10 = sub i64 %9, %7
%memcpy_refined_src = getelementptr inbounds [10 x i64], [10 x i64]* %"x::Tuple", i64 0, i64 %10

; load
%46 = load i64, i64* %memcpy_refined_src, align 8

; new tuple element / store
%"new::Tuple.sroa.0.0..sroa_idx" = getelementptr inbounds [10 x i64], [10 x i64]* %sret_return, i64 0, i64 0
store i64 %46, i64* %"new::Tuple.sroa.0.0..sroa_idx", align 8

Problem: In the generated (full) LLVM code for length(x) = 50, the groups of operations (separated by comments and spaces above) are performed together for all indices. That is, first the ifelse part is computed for all indices, followed by loads for all indices, and lastly the new::Tuple stores for all indices. This causes register spills and reloads in the native code, as explained below.

On my pc (x86_64-linux-gnu/i7-1165G7) %memcpy_refined_src’s are computed and stored on stack (spilled, 8 bytes each time due to register shortage) for all indices and retrieved (reloaded, 8 bytes) just before the loads. These spills and reloads (and the stack usage) get worse as length(x) increases.

Questions:

What Julia code dependency is disallowing the ifelse, load, and store combo to be placed together in neat blocks? Note here that the input and output addresses do not overlap (outputs are newly allocated).
How can the Julia code be modified so that such neat blocks can be obtained (i.e. the spills and reloads are avoided)?

anon56330260 · December 29, 2023, 7:13am

Edit : see the comments below. In newer version of Julia, I think using Base.setindex to set each individual field is enough. There’s no need to use pointer anymore. In older version of Julia this doesn’t quite work.

I can come up with 3 solutions, but none of them is fully satisfying…
Firstly this is caused by code generation for long tuple. Julia’s compiler generates codes straightforwardly so there are a lot of repeat instructions.

This is my versioninfo:

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 4 virtual cores

The first safe solution is to use MVector to emulate a on stack buffer. The machine code is fully vectorized. Unfortunately, it adds some strange instructions and it needs to copy the stack buffer back to the output (because when we return from a function, the stackframe is removed).

import StaticArrays
const MVector{T} = StaticArrays.StaticArraysCore.MArray{Tuple{T}, Int64, 1, T}
function circshift_mvector(x::NTuple{T, Int64}, shift::Integer) where T
    @inline
    p = MVector{T}(undef)
    j = mod1(shift, length(x))
    for k in eachindex(p)
        @inbounds p[k] =  Core.getfield(x, k-j+ifelse(k>j,0,length(x)), false)
    end
    c = p.data
    return c
end
z = ntuple(k->k, Val(50))
circshift_mvector(z, 1)

The second unsafe solution uses pointer. Compared to first solution, it has no more strange instructions but it still needs a copy from the Ref to output.

function circshift_ref(x::NTuple{T, Int64}, shift::Integer) where T
    ref = Ref{NTuple{T, Int64}}()
    ptr = Ptr{Int64}(Base.pointer_from_objref(ref))
    Base.GC.@preserve ptr begin
        circshift_ptr!(ptr, x, shift)
    end
    return ref[]
end

function circshift_ptr!(ptr::Ptr{Int64}, x::NTuple{T, Int64}, shift::Integer) where T
    @inline
    j = mod1(shift, length(x))
    for k in eachindex(x)
        Base.unsafe_store!(ptr,  Core.getfield(x, k-j+ifelse(k>j,0,length(x)), false), k) 
    end
    return Nothing
end
z = ntuple(k->k, Val(50))
circshift_ref(z, 1)

The third solution is to use pointer and move return value to parameter. There is no copy anymore! The problem is that, you need to pass a ref. Another problem is that the passed ref is considered as escaped, so I think it will incur a heap allocation…If there is a way to mark the input ref as nonescape, then this maybe a perfect solution.

@noinline function circshift_ref_in(ref::Ref{NTuple{T, Int64}}, x::NTuple{T, Int64}, shift::Integer) where T
    ptr = Ptr{Int64}(Base.pointer_from_objref(ref))
    Base.GC.@preserve ptr begin
        circshift_ptr!(ptr, x, shift)
    end
    return Nothing
end

Zentrik · December 29, 2023, 7:53am

I would note that your last two solutions only work for NTuples of an isbits type.

Somewhat related Specialize `setindex` for isbits `Tuple`s and `_totuple` for isbits `NTuple`s by Zentrik · Pull Request #51748 · JuliaLang/julia · GitHub and GitHub - Zentrik/julia at setfield-immutable.

anon56330260 · December 29, 2023, 7:58am

Yeah. I also notice that there’s a setindex method. But this method doesn’t exist in my Julia.
Maybe on newer version of Julia, this problem can be solved by firstly creating an uninitialized tuple and directly using setindex to set individual field?

Zentrik · December 29, 2023, 8:57am

I don’t think so, setindex struggles on large tuples and falls back to using vectors. I linked the pr because it was also doing the second solution and there was a desire to not rely on the C compatibility of tuples.

nsajko · December 29, 2023, 11:19am

For context, the question concerns this PR to Julia: Implement `circshift(::Tuple, ::Int)` by aravindh-krishnamoorthy · Pull Request #52438 · JuliaLang/julia · GitHub

anon56330260 · December 29, 2023, 12:21pm

It seems that we have to use this unsafe trick to implement on-stack buffer, really annoying…

akrishnamoorthy · December 29, 2023, 11:21pm

Thank you very much for your responses @anon56330260 @Zentrik and @nsajko.

@anon56330260, the methods are very interesting and creative. However, as you already recognised, sadly, the methods may not be applicable to the specific scenario of this problem: This implementation is for Base which cannot have external package dependencies. Furthermore, for a generic solution, the pointer version will also have to handle boxed types, which will not be portable. Moreover, since tuples are immutable, setindex is very inefficient

I feel that Julia code to LLVM code translation is working well. In fact, I also understand why the dependencies are introduced, i.e., why all the indices are evaluated before the tuple allocation at this stage.

However, what I don’t understand is the LLVM code to X86 code translation. That is, why the translator is unable reorganise instructions and instead chooses to use the same linear flow as the LLVM code. As you see, this introduces unnecessary dependencies which force the compiler to utilise stack for storing intermediate values (spills and reloads). A reorganised code can avoid these spills and reloads IMHO.

Also, I’m not sure why these dependencies are carried out to the X86 code.

Perhaps the code is not idiomatic so the compiler is unable to see patterns already in the LLVM stage?
Perhaps there are unintended dependencies in the Julia code that I have missed which must be carried forward? (I cannot see these.)

Furthermore, I’m also not sure if other translators such as LLVM code to Apple or ARM code also introduce these redundant loads and stores? Finding this will be interesting but unfortunately I only have Intel hardware.

Any thoughts on the above would be very helpful

anon56330260 · December 30, 2023, 5:15am

Here is a simple guess.

LLVM fails to reorganise the instructions. Maybe adding an additional loop vectorization pass can solve this problem?
During LLVM bitcode → machine code, it’s difficult to reorder memory operation, because you need to do alias analysis to ensure that the result is not changed (if you move a store before a load, you need to proof that you won’t load the newly stored values).
So it directly translates LLVM IR in this linear form.

The real problem is that why LLVM can’t vectorize automatically. It’s not your fault. You use immutable tuple, that’s why all the load and stores get clustered together. All my three solutions use mutable array and LLVM correctly vectorizes the codes.

I did some simple experiments by running slp vectorizer from opt but it fails. Maybe the pattern generated by Julia’s compiler is too complicated?

Anyway, I think you should not rely on these fragile and platform-dependent compiler optimization in Base. If someone needs a high-performance implementation (generally with bittypes), they can implement one by themselves. This can serve as a good reference implementation…

Keno · December 30, 2023, 1:20pm

I think this one is probably worth investigating in LLVM. There may be some extra aliasing annotations we can wait.

akrishnamoorthy · January 12, 2024, 12:23pm

In the absence of further discussion on this post, I’ll probably file an issue on the Julia GitLab repository and cross-link it here.

Regarding the original problem of poor performance above, luckily, it only applies in a narrow scope, i.e., when shift is not known at compile time. Hence, I will continue with the PR (linked above) “as is” with the hope that any resulting update/fix in LLVM will improve the performance of the above code.

akrishnamoorthy · January 16, 2024, 8:30pm

Link: Generated LLVM code causes register spills · Issue #52933 · JuliaLang/julia (github.com)