Small Vectors

Rust has a smallvec package, which behaves like a vector except it stores the first n elements inline to the struct. Thus, there’s no allocation as long as you stay below n elements. Is there anything comparable to this for Julia? I’m considering implementing a SmallVectors.jl package, but I wanted to check beforehand that there wasn’t something like this out there already.

2 Likes

StaticArrays.jl

6 Likes

That would be great! I don’t believe StaticArrays.jl has this:

These store a certain number of elements inline, and fall back to the heap for larger allocations.

I.e. it’s meant for small statically sized arrays (unless I missed something or some types were added later that I overlooked; or to another package, then I would really like to know). C++ has similar to Rust, in LLVM, named SmallVector, I believe. It’s mutable, and if you try to enlarge (over some size-limit I guess), then all of it goes to the heap. I don’t believe some small prefix stays on the stack or is cached there. For simplicity I would do either or, though I’m thinking of doing a string type where a prefix is cached… so it might be a nice option. One other thing, arrays are mutable in Julia. Is there a way to have immutable? I would like such an array to be mutable, e.g. when on the stack, and when on the heap, when you’re the only user, but allowing for marking it immutable/frozen.

But if there is a fallback, an array of such small vectors cannot be inlined, it has to be an array of pointers. What’s the advantage of that, then, relative to a static array?

Another related point is that standard Julia vectors have already a buffer such that push operations are not allocating unless the number of elements exceeds that of the buffer and, then, the array is reallocated, always inlinining the elements.

If you statically know the size, that’s great and you should obviously use static sizes. But sometimes you can’t statically know the size, but you know it’s usually small. That’s when something like a SmallVector is useful.

I was actually playing with something like this a little while ago after @Elrod talked about how much he likes LLVM’s SmallVector. Here’s a little demo:

using StrideArrays
mutable struct SmallStore{SpillSize}
    inline::NTuple{SpillSize, UInt8}
    outline::Union{Vector{UInt8}, Nothing}
end;

function smallarray(::Type{T}, ::StaticInt{SpillSize}, size::Integer...) where {T, SpillSize}
    @assert isbitstype(T)
    N = prod(size) * sizeof(T)
    inline = Ref{NTuple{SpillSize, UInt8}}()[]
    if N <= SpillSize
        outline = nothing
        store = SmallStore{SpillSize}(inline, outline)
        ptr = pointer_from_objref(store)
    else
        outline = Vector{UInt8}(undef, N)
        store = SmallStore{SpillSize}(inline, outline)
        ptr = pointer(outline)
    end
    ptrA = PtrArray(Ptr{T}(ptr), size)
    strA = StrideArray(ptrA, store)
end;

Now here’s a benchmark function comparing this smallarray to an array without a static lower size bound:

function foo_sa()
    SpillSize = static(10*8)
    if rand() < 0.95
        # 95% of the time our vector fits inline
        L = 10
    else
        # 5% of the time it'll be too big
        L = 10000
    end
    A = smallarray(Float64, SpillSize, L)
    A .= (1:L)
    sum(A)
end;


function foo()
    if rand() < 0.95
        # 95% of the time our vector fits inline
        L = 10
    else
        # 5% of the time it'll be too big
        L = 10000
    end
    A = StrideArray{Float64}(undef, L)
    A .= (1:L)
    sum(A)
end;
julia> @benchmark foo_sa()
BenchmarkTools.Trial: 10000 samples with 997 evaluations.
 Range (min … max):  138.866 ns …   3.697 μs  ┊ GC (min … max):  0.00% … 78.22%
 Time  (median):     273.836 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   321.531 ns ± 176.710 ns  ┊ GC (mean ± σ):  10.40% ± 15.50%

        ▃▆██▇▄▂                                                  
  ▁▁▂▃▅█████████▇▅▄▃▃▂▂▂▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  139 ns           Histogram: frequency by time          932 ns <

 Memory estimate: 2.05 KiB, allocs estimate: 1.

julia> @benchmark foo()
BenchmarkTools.Trial: 9078 samples with 994 evaluations.
 Range (min … max):  269.175 ns …   2.844 μs  ┊ GC (min … max):  0.00% … 80.24%
 Time  (median):     475.920 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   551.678 ns ± 220.005 ns  ┊ GC (mean ± σ):  11.35% ± 15.79%

        ▁▆█▇▅                                                    
  ▁▁▁▁▃▅██████▆▅▃▂▂▂▂▂▂▂▃▃▄▃▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  269 ns           Histogram: frequency by time         1.53 μs <

 Memory estimate: 4.77 KiB, allocs estimate: 1.
7 Likes

It’s also worth mentioning that StrideArrays.jl is able to deal with statically sized, stack allocated arrays which are much, much bigger than those StaticArrays.jl is able to handle. So we could just make a statically sized stride array and then take a view of it that cuts off at the size we care about, but as you can see from this benchmark, that is really wasteful and slow if we only need that huge size 5% of the time:

julia> function foo_fullystatic()
           if rand() < 0.95
               L = 10
           else
               L = 10000
           end
           _A = StrideArray{Float64}(undef, static(10000),)
           A = view(_A, 1:L)
           A .= (1:L)
           sum(A)
       end;

julia> @benchmark foo_fullystatic()
BenchmarkTools.Trial: 10000 samples with 993 evaluations.
 Range (min … max):  248.157 ns … 651.682 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     427.573 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   427.282 ns ±  54.169 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                         ▁▁ ▄▁ █ ▃▆ ▆▆ ▇▄ ▆▂ ▄                   
  ▁▁▁▁▁▁▁▁▁▂▂▁▂▂▂▃▂▃▅▂▆▇▃██▄██▆████▆██▅██▅██▆█▆▆█▄▆▆▃▅▄▂▄▃▂▃▂▂▂ ▄
  248 ns           Histogram: frequency by time          557 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
1 Like

Note,

Stack allocated arrays are great, as are mutable arrays.

StrideArrays.jl tries its hardest to provide you with both.

You can ask for non-static size of each dimension, but I can’t see that working for the stack. Then all likely need to be static AND small. Even if you can have non-small then I doubt you want to blow the stack.

It seems you might also rather want to use:

where it’s actually implemented. I do see there’s also make_dynamic, but not implemented in *Core. I would want all of this automated, starting on the stack, then move to the heap automatically, and otherwise keep the same semantics as regular Arrays. Which bound-check, those do not by default.

@olynch if something is missing there, I suppose you could make your package/type SmallVectors.jl by reusing much from that package.

1 Like

Actually, Julia is typically pretty smart about this stuff.

using StrideArrays
function f(N)
    v = StrideArray{Float64}(undef, N) .= 1
    sum(v)
end
julia> for i ∈ 0:7
           N = static(10^i)
           print("N = 10^$i "); @btime f($N)
       end
N = 10^0   2.236 ns (0 allocations: 0 bytes)
N = 10^1   13.217 ns (0 allocations: 0 bytes)
N = 10^2   19.827 ns (0 allocations: 0 bytes)
N = 10^3   129.276 ns (0 allocations: 0 bytes)
N = 10^4   1.760 μs (0 allocations: 0 bytes)
N = 10^5   29.928 μs (0 allocations: 0 bytes)
N = 10^6   975.464 μs (1 allocation: 7.63 MiB)
N = 10^7   14.469 ms (1 allocation: 76.29 MiB)

Here it turned the stack allocation into a heap allocation for us automatically to avoid blowing the stack. The only real advantage of SmallVector is that we don’t need to know the size statically at compile time.

4 Likes

From what I could read, smallvec is inlined and intended to be so for most cases, the fallback intended for rare cases. If it’s actually common for the vector to get longer than its inlined length, then it’s strictly worse than a fully heap-allocated vector. Here’s a comment by one of the maintainers on the pros and cons.

My question is if smallvec can be stored inline in arrays in Rust and if that is impossible in Julia due to different mutability semantics. My hunch is that as a field or container element in Julia, it has to be allocated on the heap to be shared reliably by multiple references. As I (poorly) understand it, Rust doesn’t allow multiple references to share and mutate something.

1 Like

Nah, that’s totally fine in julia. Schematically, if your SmallVec looks like

struct SmallVec{N, T}
    inline::NTuple{N, T}
    outline::Union{Nothing, Vector{T}}
end

then

julia> sizeof(SmallVec{3, Float64})
32

julia> 3 * 8 + 8
32

i.e. the memory layout of this struct is N values of type T followed by a pointer (size 8). This can all be stored inline in arrays:

julia> sizeof([SmallVec(ntuple(i -> i, 3), nothing), SmallVec(ntuple(i -> i, 3), nothing)])
64
2 Likes

Would a struct like

struct SmallVec{N,T}
    inline::SVector{N,T}
    outline::Vector{T}
end

with appropriate get index and set index operations be useful in this context?

(Oh, @Mason you just answered this! Thanks)

But it must be immutable to be copied around and stored inline in arrays, right? I know that StaticArrays mutate internal tuples, but even there they document that the mutable types are heap-allocated (generally). Mutability would allow something like:

result = let
  sv1 = SmallVec((1,2,3), nothing)           # 1st binding
  sv_duo = [sv1, SmallVec((4,5,6), nothing)] # 2nd binding
  sv_duo[1][2] = 20               # mutate via 2nd binding
  sv1                         # exit scope via 1st binding
end
result[2] == 20 # true

It seems like the small vector must be allocated on the heap and referenced instead of stored inline in the array in that case, and if so it must be done in all cases.

2 Likes

Ah, yes if you want SmallVec to mutable you’d be out of luck for inlining into arrays.

2 Likes

A PR adding that SmallArray implementation would be welcome.
Multivariate polynomial GCDs were an example where a C++ implementation was much faster than a comparable Julia implementation, so it’d be interesting to see if SIMDPolynomials could be made much faster.

Matrix exponentials are another example I’ve been interested in recently where a naive C++ implementation is much faster than Julia at small sizes, due to the large number of allocations involved.

3 Likes

By the way though, I should also mention that if we’re okay with rust-like limitations on aliasing, then it’s pretty easy to have this by just having a mutable and and immutable version of SmallVec that you convert between. If you want your values to ve stored inline in an array, then you convert to the immutable version and mutate the array itself. If you have them outside the array, you go mutable.

1 Like

Yeah I think of mutability as more about the aliasing than the mutations. I’ve noticed that many people new to this kind of mutability reach for mutables when they really want reassignments, which Accessors.jl can make mutation syntax do. I wonder though, can reassignment be optimized so that a field is modified instead of a full new instance being constructed.

1 Like

Yes, if enough information is provided to the compiler, it is usually able to replace a reassignment of an immutable with a stack mutation:

#+begin_src julia
using Accessors, StaticArrays, BenchmarkTools

f(tup) = Base.setindex(tup, 100, 3)
f((1, 1, 1, 1))
#+end_src

#+RESULTS:
: (1, 1, 100, 1)
#+begin_src julia
code_llvm(f, Tuple{NTuple{4, Int}}, debuginfo=:none)
#+end_src

#+RESULTS:
define void @julia_f_5283([4 x i64]* noalias nocapture noundef nonnull sret([4 x i64]) align 8 dereferenceable(32) %0, [4 x i64]* nocapture noundef nonnull readonly align 8 dereferenceable(32) %1) #0 {
top:
  %2 = getelementptr inbounds [4 x i64], [4 x i64]* %1, i64 0, i64 3
  %3 = bitcast [4 x i64]* %1 to <2 x i64>*
  %4 = load <2 x i64>, <2 x i64>* %3, align 8
  %5 = load i64, i64* %2, align 8
  %6 = bitcast [4 x i64]* %0 to <2 x i64>*
  store <2 x i64> %4, <2 x i64>* %6, align 8
  %.sroa.3.0..sroa_idx2 = getelementptr inbounds [4 x i64], [4 x i64]* %0, i64 0, i64 2
  store i64 100, i64* %.sroa.3.0..sroa_idx2, align 8
  %.sroa.4.0..sroa_idx3 = getelementptr inbounds [4 x i64], [4 x i64]* %0, i64 0, i64 3
  store i64 %5, i64* %.sroa.4.0..sroa_idx3, align 8
  ret void
}

However, if the immutable contains any mutables, then out of an abundance of caution, the compiler will instead allocate new stack space and ‘repack’ the memory. Note the appearance of the repack instruction here:

#+begin_src julia
code_llvm(f, Tuple{SVector{4, Base.RefValue{Int}}, Base.RefValue{Int}}, debuginfo=:none)
#+end_src

#+RESULTS:
define void @julia_f_5838([1 x [4 x {}*]]* noalias nocapture noundef nonnull sret([1 x [4 x {}*]]) align 8 dereferenceable(32) %0, [1 x [4 x {}*]]* nocapture noundef nonnull readonly align 8 dereferenceable(32) %1, {}* noundef nonnull align 8 dereferenceable(8) %2) #0 {
top:
  %3 = getelementptr inbounds [1 x [4 x {}*]], [1 x [4 x {}*]]* %1, i64 0, i64 0, i64 0
  %4 = load atomic {}*, {}** %3 unordered, align 8
  %5 = getelementptr inbounds [1 x [4 x {}*]], [1 x [4 x {}*]]* %1, i64 0, i64 0, i64 1
  %6 = load atomic {}*, {}** %5 unordered, align 8
  %7 = getelementptr inbounds [1 x [4 x {}*]], [1 x [4 x {}*]]* %1, i64 0, i64 0, i64 3
  %8 = load atomic {}*, {}** %7 unordered, align 8
  %.repack = getelementptr inbounds [1 x [4 x {}*]], [1 x [4 x {}*]]* %0, i64 0, i64 0, i64 0
  store {}* %4, {}** %.repack, align 8
  %.repack4 = getelementptr inbounds [1 x [4 x {}*]], [1 x [4 x {}*]]* %0, i64 0, i64 0, i64 1
  store {}* %6, {}** %.repack4, align 8
  %.repack6 = getelementptr inbounds [1 x [4 x {}*]], [1 x [4 x {}*]]* %0, i64 0, i64 0, i64 2
  store {}* %2, {}** %.repack6, align 8
  %.repack8 = getelementptr inbounds [1 x [4 x {}*]], [1 x [4 x {}*]]* %0, i64 0, i64 0, i64 3
  store {}* %8, {}** %.repack8, align 8
  ret void
}
2 Likes

Yeah, this is basically exactly the struct that I plan to implement now! Thanks @Mason.

Also, for context, I’m planning to use this to implement a generic AST. So most of the time, nodes will have <=3 children, and thus saving an extra allocation is a win, and it doesn’t need to be mutable.

Sorry for the delayed response; I need to start going to julialang discourse more.

1 Like

If you use column-based storage for your structs via StructArrays.jl, then you could try using ArraysOfArrays.VectorOfArrays for the “array field(s)”, it stores many small arrays in a single contiguous vector underneath. But the element arrays can’t be changed in size, so I’m not sure it it would work for the use case here.