Performance with intermittent singleton dimensions

Hello performance aficionados. I was testing a bit the copy performance, to see if there may be caching issues with multidimensional array copies and am generally quite impressed. However, I found one case that may lead to further improvements:

using BenchmarkTools
sz = (123,1,130,4)
a = Float32.((100.0-50.0) .* 2pi .* rand(sz...));
@btime @views $a[:, :, :,1:2] .= $a[:, :,:,3:4]; # 7.5 µs
@btime @views $a[:, 1,:,1:2] .= $a[:, 1,:,3:4]; # 13.6 µs
@btime @views $a[:, 1:1,:,1:2] .= $a[:, 1:1,:,3:4]; # 13.8 µs

If there is already a mechanism fusing adjacent colons in copy operations reducing cache misses, maybe one can extend such a mechanism to include singleton dimensions and/or full ranges?
Trying this with larger arrays indicates that this seems not just a pre-processing overhead, but the relative speedup decreases (less cache misses?).

Any ideas what is causing this effect, @stevengj, @mkitti? Can it be fixed somehow? A factor of almost two is not negligible, I guess.

With Julia 1.7.3 I’m getting (14.9us, 16.0us, 16.3us), with Julia 1.8.0 it’s (7.4us, 15.9us, 16.2us), and with Julia HEAD it’s (7.45us, 39.8us, 54.2us)

julia> versioninfo()
Julia Version 1.8.0
Commit 5544a0fab76 (2022-08-17 13:38 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 3700X 8-Core Processor
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver2)
  Threads: 16 on 16 virtual cores

I once had an issue that’s maybe similar to this (ldiv!(::Vector, ::LDLt, ::Vector) is taking the slow path through copyto! because of mismatching IndexStyle · Issue #39497 · JuliaLang/julia · GitHub), so what about

f5(a) = selectdim(a,4,1:2) .= selectdim(a,4,3:4)
# or
f6(a) = a[:,:,:,1:2] .= selectdim(a,4,3:4)

It gives me 14.9us on 1.7.3, but 7.5us on HEAD and on 1.8.0. I have no idea what’s going on with this, especially the times I get above 40us, that seems unreasonably slow.

Edit: I just noticed that using selectdim misses the point of trying to get [:,1,:,...] to work fast, so that’s not actually useful.

Realize that colon slices are known at compile time, but whether a range is “full” or a dimension is singleton (except for StaticArrays) is only known at runtime (unless you get lucky with constant propagation).

Yes, I understand. But even at run-time some speedup decisions are possible. I imagine, that there may be algorithms which flatten consecutive dimensions to avoid some index calculations and also insure cache coherency. Such a decision is also possible at runtime. This is what orgininally got me onto this example. I implemented a method of fast calculation of rotationally symmetric function onto a Cartesian Grid and realized that a speedup can be obtained by doing a 1D array flip copy rather than ND indexed copying and fixing the wrongly assigned indices lateron. The code is here:

If you have any comments on the ``SeparableFunctions.jl`, which I plan to Release soonish that would be great.

Interesting. My results were for Julia 1.7.1.

Would SizedArray from StaticArrays.jl be helpful at all here?

Did you mean for the SeparableFunctions.jl toolbox or the above question of singleton dimension access?
The original question was concerning standard arrays. If the problem also exists for SizedArray then it may also need to be addressed there, I guess.

Part of the issue is as stevengj stated above:

One facility from StaticArrays to make size information available at compile time rather than run time is to wrap a standard array as a SizedArray. The size of the array then becomes part of the type. Thus we can know the size at compile time.

julia> using StaticArrays

julia> A = zeros(Int, 1024, 512);

julia> SA = SizedArray{Tuple{1024, 512}}(A);

julia> typeof(A)
Matrix{Int64} (alias for Array{Int64, 2})

julia> typeof(SA)
SizedMatrix{1024, 512, Int64, 2, Matrix{Int64}} (alias for SizedArray{Tuple{1024, 512}, Int64, 2, 2, Array{Int64, 2}})

julia> @code_warntype Size(A)
MethodInstance for Size(::Matrix{Int64})
  from Size(a::T) where T<:AbstractArray in StaticArrays at ~/.julia/packages/StaticArrays/68nRv/src/traits.jl:88
Static Parameters
  T = Matrix{Int64}
Body::Size{(StaticArrays.Dynamic(), StaticArrays.Dynamic())}
1 ─ %1 = StaticArrays.Size($(Expr(:static_parameter, 1)))::Core.Const(Size(StaticArrays.Dynamic(), StaticArrays.Dynamic()))
└──      return %1

julia> @code_warntype Size(SA)
MethodInstance for Size(::SizedMatrix{1024, 512, Int64, 2, Matrix{Int64}})
  from Size(a::T) where T<:AbstractArray in StaticArrays at ~/.julia/packages/StaticArrays/68nRv/src/traits.jl:88
Static Parameters
  T = SizedMatrix{1024, 512, Int64, 2, Matrix{Int64}}
  a::SizedMatrix{1024, 512, Int64, 2, Matrix{Int64}}
Body::Size{(1024, 512)}
1 ─ %1 = StaticArrays.Size($(Expr(:static_parameter, 1)))::Core.Const(Size(1024, 512))
└──      return %1

Above we see that we know the size statically at compile time from the line that states Body::Size{(1024, 512)}.

Actually for your case, we need to know the strides at compile time. For the SizedArray I created above we actually do know this at compile time because we know the parent is just an Array. However, Base.strides fails us here. Fortunately we have ArrayInterface.strides:

julia> using ArrayInterface

julia> ArrayInterface.strides(A)
(static(1), 1024)

julia> ArrayInterface.strides(SA)
(static(1), static(1024))

Applied to your specific problem using your example above, we can now see

julia> sa = SizedArray{Tuple{sz...}}(a);

julia> ArrayInterface.strides(sa)
(static(1), static(123), static(123), static(15990))

We need more machinery to take advantage of this information though. I believe StrideArrays.jl likely helps here, but the documentation is a bit too sparse for me to be sure. Perhaps @Elrod would have more insight here.

Thanks. That’s interesting with the ArrayInterface and this (or StrideArrays.jl) may be a way to construct the necessary machinery, which should also be possible for non-static arrays. Another option could be to allow singleton types like Val(1) or Val(1:1) in array access, such that the user has to chance to help the compiler with stride optimization. But maybe tackling this on the stride-level is anyway the best option.

Val is not necessary due to constant propagation.

Consider how the following compiles:

julia> f() = (A = (2,3); A[2])
f (generic function with 1 method)

julia> @code_llvm f()
;  @ REPL[3]:1 within `f`
define i64 @julia_f_159() #0 {
  ret i64 3

julia> g() = (A = (2,3); A[1:1])
g (generic function with 1 method)

julia> @code_llvm g()
;  @ REPL[5]:1 within `g`
define [1 x i64] @julia_g_186() #0 {
  ret [1 x i64] [i64 2]

Here we did not use Val. Yet LLVM was able to recognize that the index given was constant in order to streamline the compiled code. Thus the constant was propagated through Base.getindex to return a constant value.

Interesting. I know that we did migrate some of our code use use Val when referring to the length of a tuple, such that we got type-stable code. Anyway: good news that the standard interface would not need to be changed, at least when the user just uses a constant such as 1.