This question concerns circshift
implemented here, which is reproduced below.
function circshift(x::Tuple, shift::Integer)
@inline
j = mod1(shift, length(x))
ntuple(k -> __safe_getindex(x, k-j+ifelse(k>j,0,length(x))), Val(length(x)))
end
Note: Here, length(x)
is a compile-time constant and __safe_getindex
is just getindex
without the bounds check. Use Base.__safe_getindex
in the above definition when using the REPL.
For each element of the tuple, the following operations (LLVM) are necessary (example below shows the operations for the first element):
; ifelse + ...
%8 = icmp sgt i64 %7, 0
%9 = select i1 %8, i64 10, i64 0
%10 = sub i64 %9, %7
%memcpy_refined_src = getelementptr inbounds [10 x i64], [10 x i64]* %"x::Tuple", i64 0, i64 %10
; load
%46 = load i64, i64* %memcpy_refined_src, align 8
; new tuple element / store
%"new::Tuple.sroa.0.0..sroa_idx" = getelementptr inbounds [10 x i64], [10 x i64]* %sret_return, i64 0, i64 0
store i64 %46, i64* %"new::Tuple.sroa.0.0..sroa_idx", align 8
Problem: In the generated (full) LLVM code for length(x) = 50
, the groups of operations (separated by comments and spaces above) are performed together for all indices. That is, first the ifelse
part is computed for all indices, followed by load
s for all indices, and lastly the new::Tuple
stores for all indices. This causes register spills and reloads in the native code, as explained below.
On my pc (x86_64-linux-gnu/i7-1165G7) %memcpy_refined_src
’s are computed and stored on stack (spilled, 8 bytes each time due to register shortage) for all indices and retrieved (reloaded, 8 bytes) just before the load
s. These spills and reloads (and the stack usage) get worse as length(x)
increases.
Questions:
- What Julia code dependency is disallowing the ifelse, load, and store combo to be placed together in neat blocks? Note here that the input and output addresses do not overlap (outputs are newly allocated).
- How can the Julia code be modified so that such neat blocks can be obtained (i.e. the spills and reloads are avoided)?