When should a package move to the standard library or system image? StaticArrays, what is it?

As an example:

julia> using StaticArrays, BenchmarkTools

julia> @inline function AmulB!(C,A,B)
           @fastmath @inbounds for n ∈ axes(C,2), m ∈ axes(C,1)
               Cmn = zero(eltype(C))
               for k ∈ axes(B,1)
                   Cmn += A[m,k]*B[k,n]
               end
               C[m,n] = Cmn
           end
       end
AmulB! (generic function with 1 method)

julia> @inline function AmulB(A::StaticMatrix{M,K,T1}, B::StaticMatrix{K,N,T2}) where {M,K,N,T1,T2}
           C = MMatrix{M,N,promote_type(T1,T2)}(undef)
           AmulB!(C, A, B)
           return SMatrix(C)
       end
AmulB (generic function with 1 method)

julia> A = @SMatrix rand(7,7); B = @SMatrix rand(7,7);

julia> @btime $(Ref(A))[] * $(Ref(B))[];
  58.635 ns (0 allocations: 0 bytes)

julia> @btime AmulB($(Ref(A))[], $(Ref(B))[]);
  34.616 ns (0 allocations: 0 bytes)

julia> AmulB(A, B) ≈ A * B
true

julia> @inline function AmulBalloc(A::StaticMatrix{M,K,T1}, B::StaticMatrix{K,N,T2}) where {M,K,N,T1,T2}
           C = MMatrix{M,N,promote_type(T1,T2)}(undef)
           AmulB!(C, A, B)
           return C
       end
AmulBalloc (generic function with 1 method)

julia> @btime AmulBalloc($(Ref(A))[], $(Ref(B))[]);
  29.751 ns (1 allocation: 400 bytes)

As you can see here, heap allocating the MArray (30 ns) is actually faster than stack allocating it and then copying it to a stack allocated SMatrix (35 ns), when what the compiler should be doing is just writing the result to the eventual stack allocated matrix directly.
Both are faster than the unrolled version (59 ns).

The compiler does a really bad job with more complicated types:

jula> using ForwardDiff

julia> Ad = ForwardDiff.Dual.(A, A, A, A);

julia> @btime AmulB($(Ref(Ad))[], $(Ref(B))[]);
  2.366 μs (0 allocations: 0 bytes)

julia> @btime $(Ref(Ad))[] * $(Ref(B))[];
  105.213 ns (0 allocations: 0 bytes)

(it’s using gather and scatters in the loopy version), but that’s a fixable compiler problem.
In principle, custom types are where this really should be much better, as the degree of unrolling in generated code isn’t going to scale appropriately as a function of the size of the eltypes.

By unrolling the duals, the loop vectorizer (SIMD-ing across loop iterations) is basically forced into the gather/scatter here.
I suspect that if we also re-rolled Dual numbers, so that it is just loops there too, we may see better runtime performance. Matmul with just a single dual matrix can be reinterpreted into regular Float64 matmul, which would vectorize fine without gather/scatter.

The extreme unrolling of both large (nested) duals + static arrays can lead to 70% of compile time being spent in LLVM, such as in certain Pumas models fits.
I’m sure it slows down the Julia side of compilation a lot, too.

15 Likes

Reading this discussion left me a bit confused. I believe that is because it really concerns two different questions.

  1. Should packages, such as StaticArrays.jl, that have very many dependents, be included in the standard library to reduce compilation times?
  2. Should the Julia standard library include functionality to create and use statically-sized arrays?

The consensus on the answer to the first, I believe, is a clear “No”. Other ways to reduce compilation times seem generally preferred and could be developed (although I am really no expert).

Regarding the second question, I personally believe the answer is yes. To me, the ability to use arrays with a static size is a very important part of a scientific programming language, not just for performance reasons, but also because a StaticArray is easier to reason about in many cases. I believe they are in some sense more vital to scientific computing than many of the array types included in LinearAlgebra, such as an UnitUpperTriangular array.

Note that I have intentionally avoided the question of whether such a StaticArray should be stack-allocated. To me, that is again completely separate and should by default be decided by the compiler instead of me. I would like to be able to create a 10^8-element StaticArray and do computations with it like with a normal Array. Also, if I decide to make an 8-element StaticArray, I think the compiler should definitely stack-allocate it and unroll/SIMD vectorize my loops. (I am greedy.) I believe these things are possible and can go in hand with the array optimizations that are already planned for base Julia (automatic stack-allocation of small Arrays of which the compiler can prove they are not mutated (am I correct that this is the plan?)). In any case, I agree with @Elrod that the current version of StaticArrays.jl should not be included in the standard library without a re-write, for performance reasons.

Whether we need both an immutable and a mutable statically-sized array in the standard library, I am unsure. Maybe others have clear ideas about this.

Please tell me where I am wrong, because I am a real layman here.

EDIT: by the way, I don’t find the same behaviour:

julia> using StaticArrays, BenchmarkTools

julia> @inline function AmulB!(C,A,B)
                  @fastmath @inbounds for n ∈ axes(C,2), m ∈ axes(C,1)
                      Cmn = zero(eltype(C))
                      for k ∈ axes(B,1)
                          Cmn += A[m,k]*B[k,n]
                      end
                      C[m,n] = Cmn
                  end
              end
AmulB! (generic function with 1 method)

julia> @inline function AmulB(A::StaticMatrix{M,K,T1}, B::StaticMatrix{K,N,T2}) where {M,K,N,T1,T2}
                  C = MMatrix{M,N,promote_type(T1,T2)}(undef)
                  AmulB!(C, A, B)
                  return SMatrix(C)
              end
AmulB (generic function with 1 method)

julia> A = @SMatrix rand(7,7); B = @SMatrix rand(7,7);

julia> @btime $(Ref(A))[] * $(Ref(B))[];
  55.927 ns (0 allocations: 0 bytes)

julia> @btime AmulB($(Ref(A))[], $(Ref(B))[]);
  289.416 ns (0 allocations: 0 bytes)

julia> AmulB(A, B) ≈ A * B
true
julia> @inline function AmulBalloc(A::StaticMatrix{M,K,T1}, B::StaticMatrix{K,N,T2}) where {M,K,N,T1,T2}
                  C = MMatrix{M,N,promote_type(T1,T2)}(undef)
                  AmulB!(C, A, B)
                  return C
              end
AmulBalloc (generic function with 1 method)

julia> @btime AmulBalloc($(Ref(A))[], $(Ref(B))[]);
  290.706 ns (1 allocation: 400 bytes)

julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: AMD Ryzen 7 3700X 8-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, znver2)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 1

Removing some of the @inline and @inbounds makes the gap closer, but the first still clearly wins. Maybe updating my Julia/StaticArrays version will fix this. EDIT2: It did not.

5 Likes

Well that’s because you missed the central question. The question is whether those should be two separate questions. If the Julia standard library should include static array support, and thus the entire Julia package ecosystem has adopted StaticArrays.jl as effectively a “standard dependency” that has a high using time, should it be transformed into an actual standard library? If it shouldn’t be, well this is an actual problem, so what should be done?

If the magical better static arrays that is described above actually gets built, I can guarantee you it will be in a package and we will soon have StaticArrays2.jl (or whatever the name is) with thousands of dependencies. Maybe it won’t have the startup time problems, but maybe it will start to creep up with libraries adding @requires dependencies around it etc. and we’ll be right back to this same question: is this a standard library now?

But even more interestingly, a better static array library would have less of an incentive to be put into the standard library because it would not fix any startup times and thus it might as well stay a package together. Does this make sense as a principle?

3 Likes

My computer has AVX512, and LLVM ends up doing a good job optimizing the loop:

.LBB0_1:                                # %L3
                                        # =>This Inner Loop Header: Depth=1
        vbroadcastsd    zmm3, qword ptr [rdx + rcx]
        vmulpd  zmm3, zmm3, zmmword ptr [rsi]
        vbroadcastsd    zmm4, qword ptr [rdx + rcx + 8]
        vfmadd132pd     zmm4, zmm3, zmmword ptr [rsi + 56] # zmm4 = (zmm4 * mem) + zmm3
        vbroadcastsd    zmm3, qword ptr [rdx + rcx + 16]
        vfmadd132pd     zmm3, zmm4, zmmword ptr [rsi + 112] # zmm3 = (zmm3 * mem) + zmm4
        vfmadd231pd     zmm3, zmm0, qword ptr [rdx + rcx + 24]{1to8} # zmm3 = (zmm0 * mem) + zmm3
        vfmadd231pd     zmm3, zmm1, qword ptr [rdx + rcx + 32]{1to8} # zmm3 = (zmm1 * mem) + zmm3
        vfmadd231pd     zmm3, zmm2, qword ptr [rdx + rcx + 40]{1to8} # zmm3 = (zmm2 * mem) + zmm3
        kmovd   k1, edi
        vmovupd zmm4 {k1} {z}, zmmword ptr [rsi + 336]
        vfmadd132pd     zmm4, zmm3, qword ptr [rdx + rcx + 48]{1to8} # zmm4 = (zmm4 * mem) + zmm3
        vmovupd zmmword ptr [rsp + rcx - 128] {k1}, zmm4
        add     rcx, 56
        cmp     rcx, 392
        jne     .LBB0_1

The above loop exectures 7 times.
This is not the ideal pattern, but pretty good. Hopefully out of order execution can allow performing the loop iterations in parallel.

I’m guessing that LLVM doesn’t do a good job here without AVX512.

1 Like

Well, I am not sure the primary reason of the standard library should be to reduce using times. As you stated above:

I agree with you that I’m probably being naive and non-pragmatic, but I think that what you are essentially suggesting is to include it into the stdlib, not because it should be in there, but as a (however valuable) fix to reduce using times. This will probably greatly impede any attempts either to improve some of the minor flaws of the package, mentioned above by others, or to implement some of my “magical” ideas :wink: .

Improve the package? Remove @generated functions where the generic ones work well. Identify what is causing the excessive load time, reduce the problem, and open targeted issues. Have a bot check the load time on each PR and don’t merge PRs that increase it too much. Etc etc. I can say with some confidence that StaticArrays will not be made into an stdlib so it is better to focus on other possible improvements.

15 Likes

Do you think this is true in general or just regarding the current implementation?

True for moving such a big package (that duplicates so much logic in Base and LinearAlgebra) wholesale into the default Julia distribution. As I said earlier, having a small static array type that just defines the type, the array interface, broadcasting and some constructors and then work with the already defined generic functions is probably a good idea in the long run.

17 Likes

And as a general policy, we want anything that can be implemented as a package to be a package. Base + stdlibs should only ship what it needs to implement itself.

14 Likes

This is correct: https://github.com/JuliaArrays/StaticArrays.jl/pull/825 . Somehow most of the loading time is spent on adding all these multiplication methods with complicated signatures.

4 Likes

I think all the StaticArrays developers would agree that we’d prefer a different implementation approach :slight_smile: I’m kind of surprised that unrolling all the things works as well as it does. It worked all the way back in Julia 0.5 and it’s had a lot of staying power as the compiler has changed.

We should definitely remove the unrolling when the compiler has good heuristics for small loops. Note, however that this introduces its own problems for a generic (not just numeric) static array library. Implementing a type stable map(f, v::StaticVector) which interacts well with inference is very simple with unrolling in the frontend. But with a loop and mutation it’s not so simple to deal with the type of the output array (I guess the widening techniques used in the map from Base can be used to deal with this. It’s just an unpleasant and pervasive complication for any operation which fills a mutable output array.)


As to the original question here, I don’t like the idea of making StaticArrays a stdlib for several reasons. Primarily, the API is not what we want for the long term, but putting APIs in the stdlib is a statement about compatibility and support which goes far beyond fixing load time issues.

If it’s just about long load times, we should put some concerted effort into understanding the causes of the load time and do what we can to reduce those in the short term.

Here’s the list of problems with the API, as I see them:

  1. The inability to write the same array algorithms for mutating and non-mutating cases is pretty bad. (Could fixing this be done as a non-breaking extension? There’s some progress on very related issues in https://github.com/JuliaLang/julia/pull/42465)
  2. The extension API with Tuple and similar_type for custom static arrays, and using similar_type to preserve output types is not really what you’d like. Especially similar_type forces too much computation in the type domain which can be very awkward. An improvement might need some kind of progress on freeze / thaw -like APIs. See the previous point.
  3. Concrete type names like SArray{Tuple{2, 3, 4}, Float64, 3, 24} are awful. I guess we could improve this situation with macros and deprecate the full type name as the official way to write the concrete type. The only other way forward seems to be language changes, for example Proposal: Defer calculation of field types until type parameters are known · Issue #18466 · JuliaLang/julia · GitHub.
  4. We should have a uniform and consistent way to express mixed static and dynamically sized dimensions (using a StaticNumbers-like representation for the static dimensions; in principle storing the array size as a field (which would be zero-size for a fully static array where all dimensions are static numbers). If we had a Julia-native buffer type like in Introduce Buffer type and make Array an abstraction on top of it · Issue #12447 · JuliaLang/julia · GitHub this would help a lot.

Having written all that, I think we could unify static and dynamically sized mutable arrays in the future under a single array type. We’d probably need both of

The idea is like this:

mutable struct StaticOrDynamicArray{T, Dimensions}
    size::Dimensions   # sizeof this field == 0 in fully static case
    data::data_of(Dimensions)  # Need #18446 for dynamism, #12447 for efficiency
end

To demonstrate how we’d name existing types in this scheme:

MVector{N, T}    = StaticOrDynamicArray{T, Tuple{StaticInt{N}}}
MMatrix{N, M, T} = StaticOrDynamicArray{T, Tuple{StaticInt{N}, StaticInt{M}}}

# HybridArray{Tuple{2,2,StaticArrays.Dynamic()}}
Hybrid2x2xN{T} = StaticOrDynamicArray{T, Tuple{StaticInt{2}, StaticInt{2}, Int}}}

It’s also possible to express the following if we wanted. IIUC this has the same data layout as Array, though with a caveat for small dynamically sized arrays which would need to be solved by #12447 in any case:

Array{T,N} = StaticOrDynamicArray{T, NTuple{N,Int}}
Vector{T} = StaticOrDyanmicArray{T, Tuple{Int}}
29 Likes

Your idea this way?

struct StaticOrDynamicArray{T, ND} <: AbstractArray{T,ND}
    size::ND
    data::data_of(ND)
end

Yes, that’s more or less what I wrote? Though I’ve since cleaned it up a bit (post edited) because I was missing a mutable and added some definitions showing how the current MVector and MMatrix can be defined in terms of StaticOrDynamicArray.

By the way, I’m very aware the syntax for the complete concrete type of StaticOrDynamicArray with all the wrapping in StaticInt, etc, is awful. We’d need a way to write the type names for higher dimensional hybrid and static arrays conveniently; perhaps just a macro in the same way that @NamedTuple needs a macro to make writing the concrete type of named tuples bearable.

1 Like

Is it worthwhile to have

struct StaticArray{T,N} <: AbstractArray{T,N}
   # ...
end
struct DynamicArray{T,N} <: AbstractArray{T,N}
   # ...
end

StaticOrDynamicArray{T,N} =
    Union{StaticArray{T,N},
          DynamicArray{T,N}}

(presuming T, N are available to be constant propagated)

This doesn’t allow for hybrids of static and dynamic dimensions (as in HybridArrays.jl)

1 Like

Another reason to include something into core language instead of a package could be to provide a stronger guarantee that a piece of code will run in 30 years from now. Isn’t that one of the advantages of Fortran?

Compiled language Cpp and Rust have static arrays has base core data structure. Isn’t this creating a “feeling” of missing feature in Julia for people/team that are used to those languages?

5 Likes

The manifest already contains the exact version of all the dependencies. Since you mentioned Rust, this is the same as the Cargo.lock file.

They basically have NTuple. They don’t have any math defined on them which is the point of StaticArrays. Julia already has a much much richer built array math built-in than rust, python and cpp.

17 Likes

There’s been some discussion of trying to get Static.StaticInt into base. This would certainly help with composability of mixed type arrays

3 Likes