Efficiently creating and updating a struct of SVectors

Tetrakai · September 22, 2024, 3:44pm

Part A

I’m reading some data in from a file then storing it in a struct looking something like this:

# "Big" struct of SVectors
@with_kw struct BigStruct{N}
    A :: SVector{N, Int16}
    B :: SVector{N, Int16}
    C :: SVector{N, Int16}
    D :: SVector{N, Int16}
    E :: SVector{N, Int16}
    F :: SVector{N, Int16}
    G :: SVector{N, Int16}
    H :: SVector{N, Int16}
    I :: SVector{N, Int16}
    J :: SVector{N, Int16}
end

To fill the struct, I tried two methods. One requires repetitive code, but is zero-allocation. The other is nice and short, but is slow and allocates:

# Create Struct (Long, Fast)
function test(dat0)
 @views dat = BigStruct(A = SVector{20}(dat0[:, 1]),
                        B = SVector{20}(dat0[:, 2]),
                        C = SVector{20}(dat0[:, 3]),
                        D = SVector{20}(dat0[:, 4]),
                        E = SVector{20}(dat0[:, 5]),
                        F = SVector{20}(dat0[:, 6]),
                        G = SVector{20}(dat0[:, 7]),
                        H = SVector{20}(dat0[:, 8]),
                        I = SVector{20}(dat0[:, 9]),
                        J = SVector{20}(dat0[:, 10]))
end

# Create Struct (Short, Slow)
function test2(dat0)
    @views vals = [fieldnames(BigStruct)[i] => SVector{20}(dat0[:, i]) for i in 1:10]
    dat  = BigStruct(; vals...)
end

julia> dat0 = fill(Int16(0), 20, 10);

julia> dat1 = @btime test($dat0);
  27.107 ns (0 allocations: 0 bytes)

julia> dat2 = @btime test2($dat0);
  15.395 μs (68 allocations: 6.31 KiB)

julia> dat1 == dat2
true

Part B
Some, but not all, of the values in those SVectors need to be updated. NB: In the real example the SVector(s) to be updated for each loop iteration depends on some conditions.

I found no performance benefit from unpacking the struct first, and there appears to be some overhead that scales with the size of the struct. Presumably because @reset replaces the entire thing:

# Update Values (All Together)
function plusone1(dat1)
    @reset dat1.A .+= Int16(1)
    return dat1
end

# Update Values (unpacked-packed)
function plusone2(dat1)
    @unpack_BigStruct dat1
    @reset A .+= Int16(1)
    @reset dat1.A = A
    return dat1
end

# Update Values (Separately)
function plusone3(A)
    @reset A .+= Int16(1)
    return A
end

Ie, it seems it would be best if I didn’t store the SVectors in the struct at all:

julia> newvals1 = @btime plusone1($dat1);
  9.026 ns (0 allocations: 0 bytes)

julia> newvals2 = @btime plusone2($dat1);
  9.027 ns (0 allocations: 0 bytes)

julia> newvals3 = @btime plusone3($dat1.A);
  3.248 ns (0 allocations: 0 bytes)

julia> newvals1.A == newvals2.A == newvals3
true

The Svectors need to be passed to multiple functions, so it would be convenient to store them in one object I can access by variable name. However, it appears that substantially slows down my loop (the real loop is > 10x faster when I split my “big” struct up into 5 smaller ones, and I can probably get more with further splitting).

But then my code looks like:

dat1       = f1(dat1, dat2, dat3)
dat1       = f2(dat1, dat3)
dat3, dat4 = f3(dat3, dat4, dat5)

Part A

Is there a way to get best of both worlds? Ie, create the struct with zero-allocations without any repetitive code?
Even the struct definition is very repetitive, is there a way to clean that up?

Part B

Can I get the best of both worlds and store these SVectors in the same object without taking a performance hit?

Full MWE

using StaticArrays, Accessors, Parameters, BenchmarkTools

## Part 1 ##
# "Big" struct of SVectors
@with_kw struct BigStruct{N}
    A :: SVector{N, Int16}
    B :: SVector{N, Int16}
    C :: SVector{N, Int16}
    D :: SVector{N, Int16}
    E :: SVector{N, Int16}
    F :: SVector{N, Int16}
    G :: SVector{N, Int16}
    H :: SVector{N, Int16}
    I :: SVector{N, Int16}
    J :: SVector{N, Int16}
end



# Create Struct (Long, Fast)
function test(dat0)
 @views dat = BigStruct(A = SVector{20}(dat0[:, 1]),
                        B = SVector{20}(dat0[:, 2]),
                        C = SVector{20}(dat0[:, 3]),
                        D = SVector{20}(dat0[:, 4]),
                        E = SVector{20}(dat0[:, 5]),
                        F = SVector{20}(dat0[:, 6]),
                        G = SVector{20}(dat0[:, 7]),
                        H = SVector{20}(dat0[:, 8]),
                        I = SVector{20}(dat0[:, 9]),
                        J = SVector{20}(dat0[:, 10]))
end

# Create Struct (Short, Slow)
function test2(dat0)
    @views vals = [fieldnames(BigStruct)[i] => SVector{20}(dat0[:, i]) for i in 1:10]
    dat  = BigStruct(; vals...)
end



dat0 = fill(Int16(0), 20, 10);
dat1 = @btime test($dat0);
dat2 = @btime test2($dat0);

dat1 == dat2




## Part 2 ##
# Update Values (All Together)
function plusone1(dat1)
    @reset dat1.A .+= Int16(1)
    return dat1
end

# Update Values (unpacked-packed)
function plusone2(dat1)
    @unpack_BigStruct dat1
    @reset A .+= Int16(1)
    @reset dat1.A = A
    return dat1
end

# Update Values (Separately)
function plusone3(A)
    @reset A .+= Int16(1)
    return A
end



newvals1 = @btime plusone1($dat1);
newvals2 = @btime plusone2($dat1);
newvals3 = @btime plusone3($dat1.A);

newvals1.A == newvals2.A == newvals3

eldee · September 22, 2024, 6:15pm

This isn’t quite as fast, but it’s relatively close:

julia> function test3(dat0)
           vals = (SVector{20}(@view dat0[:, i]) for i in 1:10)
           dat = BigStruct(vals...)
       end;

julia> dat1 = @btime test($dat0);
  23.671 ns (0 allocations: 0 bytes)

julia> dat2 = @btime test2($dat0);
  11.200 μs (68 allocations: 6.31 KiB)

julia> dat3 = @btime test3($dat0);
  56.751 ns (0 allocations: 0 bytes)

julia> dat1 == dat2 == dat3
true

You could use metaprogramming (also in test), though you probably shouldn’t. At least in the MWE you could replace BigStruct by something like an SVector{10, SVector{20, Int16}} (though with the same note on mutability as below).

julia> function test4(dat0)
           return @SVector [SVector{20}(@view dat0[:, i]) for i = 1:10]
       end;

julia> sdat = @btime test4($dat0);
  56.738 ns (0 allocations: 0 bytes)

julia> typeof(sdat)
SVector{10, SVector{20, Int16}} (alias for SArray{Tuple{10}, SArray{Tuple{20}, Int16, 1, 20}, 1, 10})

Note that you’re want to alter an immutable struct BigStruct (containing immutable SVectors). While this might be (sort of) possible via @reset, I’m not too fond of this approach. Instead you could just make BigStruct mutable.

julia> @with_kw mutable struct MBigStruct{N}
           A :: SVector{N, Int16}
           B :: SVector{N, Int16}
           C :: SVector{N, Int16}
           D :: SVector{N, Int16}
           E :: SVector{N, Int16}
           F :: SVector{N, Int16}
           G :: SVector{N, Int16}
           H :: SVector{N, Int16}
           I :: SVector{N, Int16}
           J :: SVector{N, Int16}
       end;

julia> function mtest(dat0)  # Like test3
           vals = (SVector{20}(@view dat0[:, i]) for i in 1:10)
           dat  = MBigStruct(vals...)
       end;

julia> mdat = @btime mtest($dat0);  # This will allocate, but do you really care?
  60.408 ns (1 allocation: 448 bytes)

julia> function mplusone!(mdat)  # Will change mdat inplace (and return it for good measure)
           mdat.A = mdat.A .+ Int16(1)
           return mdat
        end;

julia> @btime mplusone!($mdat);
  2.200 ns (0 allocations: 0 bytes)

julia> @btime plusone1($dat1);
  5.706 ns (0 allocations: 0 bytes)

julia> @btime plusone2($dat1);
  5.706 ns (0 allocations: 0 bytes)

julia> @btime plusone3($dat1.A);
  2.000 ns (0 allocations: 0 bytes)

julia> mdat = mtest(dat0); mplusone!(mdat); mdat.A == newvals1.A == newvals2.A == newvals3
true

Tetrakai · September 22, 2024, 6:47pm

Thanks. One allocation doesn’t seem that bad at first, but I didn’t mention this is all also being run in parallel. That one allocation will become at least 20 and maybe 600+.

I’ve noticed garbage collection is a real performance killer in parallel. It really seems zero allocations is preferable to even somewhat slower serial benchmarks.

But Ill definately try it and benchmark.

Edit:
Also, the SVector of SVectors is more concise but loses the variable names, which would be its own hit on readability. So that might not be worth it.

mikmoore · September 23, 2024, 3:32pm

Obviously you have a MWE here which may not catch every salient aspect of your problem. I’ll comment on it, although my comments may miss the mark in your actual situation.

Every field of this struct appears to be a matching (same-length) AbstractVector. This is a “struct of arrays” pattern. You might consider instead defining a struct with scalar fields (or simply a Tuple or NamedTuple), then making an “array of structs” by storing those scalar structs in an AbstractVector.

Then you would write your functions in scalar fashion as

function plusone(dat::ScalarStruct)
  @reset dat.A += Int16(1) # I don't know where @reset is from so this line may be slightly wrong
  return dat
end

alldata = SVector{20, ScalarStruct}(#= fill data here =#)
alldata = plusone.(alldata) # use broadcasting to update all ScalarStructs

Personally, I find these scalar functions to be a lot simpler to write and reason about, and somewhat more flexible. StructArrays.jl can be a useful package that lets your write code as if you have an array-of-structs but actually store it as a struct-of-arrays (which, depending on how you use the data, may be more or less efficient).

For 20 items, when the items are this big, you may or may not see a big benefit to using SVector over Vector. Experiment with both or use what is most ergonomic.

lmiq · September 23, 2024, 3:57pm

Since that whole struct mutable, it might be worth using just a Dict and use keys instead of field names for convenience.

Albert_de_montserrat · September 23, 2024, 4:05pm

Regarding part A, you could do this:


function test3(dat0)
    vals = ntuple(Val(10)) do i
        SVector{20}(@views dat0[:, i])
    end
    dat  = BigStruct(vals...)
end

julia> @b test($dat0)
16.099 ns

julia> @b test3($dat0)
16.010 ns

eldee · September 23, 2024, 4:44pm

Interestingly, on my machine this performs the same as my generator-based test3 above.

julia> @btime test($dat0);
  19.739 ns (0 allocations: 0 bytes)

julia> @btime test3_gen($dat0);
  48.133 ns (0 allocations: 0 bytes)

julia> @btime test3_ntuple($dat0);
  48.129 ns (0 allocations: 0 bytes)

(with similar results using Chairmarks.jl).

versioninfo

julia> versioninfo()
Julia Version 1.10.4
Commit 48d4fd4843 (2024-06-04 10:41 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 × Intel(R) Core™ i7-7700K CPU @ 4.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
JULIA_NUM_THREADS = auto

eldee · September 23, 2024, 4:53pm

A Dict does have some overhead for indexing, so it would be somewhat slower. I do agree that it would be easier to code with than (M)BigStruct, though.

lmiq · September 23, 2024, 5:46pm

I have been surprised by this before, and it is worth testing depending on the use case:

julia> using StaticArrays

julia> struct Test{T}
           a::T
           b::T
           c::T
           d::T
           e::T
           f::T
           g::T
           h::T
           i::T
           j::T
       end

julia> t = Test((SVector{3,Float64}(1,1,1) for i in 1:10)...)
Test{SVector{3, Float64}}([1.0, 1.0, 1.0], [1.0, 1.0, 1.0], [1.0, 1.0, 1.0], [1.0, 1.0, 1.0], [1.0, 1.0, 1.0], [1.0, 1.0, 1.0], [1.0, 1.0, 1.0], [1.0, 1.0, 1.0], [1.0, 1.0, 1.0], [1.0, 1.0, 1.0])

julia> @btime getfield($t, $(:f))
  8.785 ns (0 allocations: 0 bytes)
3-element SVector{3, Float64} with indices SOneTo(3):
 1.0
 1.0
 1.0

julia> d = Dict{Symbol,SVector{3,Float64}}(
           (s => SVector{3,Float64}(1,1,1) for s in (:a,:b,:c,:d,:e,:f,:g,:h,:i,:j))...
       )
Dict{Symbol, SVector{3, Float64}} with 10 entries:
  :a => [1.0, 1.0, 1.0]
  :b => [1.0, 1.0, 1.0]
  :f => [1.0, 1.0, 1.0]
  :d => [1.0, 1.0, 1.0]
  :j => [1.0, 1.0, 1.0]
  :e => [1.0, 1.0, 1.0]
  :c => [1.0, 1.0, 1.0]
  :h => [1.0, 1.0, 1.0]
  :g => [1.0, 1.0, 1.0]
  :i => [1.0, 1.0, 1.0]

julia> @btime $d[$(:f)]
  7.532 ns (0 allocations: 0 bytes)
3-element SVector{3, Float64} with indices SOneTo(3):
 1.0
 1.0
 1.0

Tetrakai · September 23, 2024, 7:52pm

Here is some simplified pseudo-code:

Pseudo-Julia

for w in 1:nweeks
    games = schedule[w]
    @threads for g in eachindex(games)
        teamnames = games[g]
        teamdat   = readfromdisk(teamnames)
        teams = (preproster(teamdat[1]), preproster(teamdat[2]))
        for m in 1:nmin
            teams = updateplayers(teams)
            for i in 1:2
                i2 = ifelse(i == 1, 2, 1)
                if randtryshot(teams)
                    if randtackle(teams)
                        idx = pick_tackler(teams[i2])
                        teams[i2].tackles[idx] += 1
                    else
                        idx = pick_shooter(teams[i])
                        teams[i].shots[idx] += 1
                        if randgoal(teams)
                            teams[i].goals[idx] += 1
                        else
                            teams[i2].saves[idx_gk[i2]] += 1
                        end
                    end
                end
            end # End team loop
        end # End game loop
        writetodisk(teams)
    end # End week loop
end # End season loop

Right now, I’m using structs of SVectors and Accessors.@reset to update the values (my understanding is that it mimics mutation by rebinding the entire immutable parent object “on the stack”). The inner two loops are very fast with zero heap allocations, but this has come at the price of code readability. I’m now working outwards to remove as many of those heap allocations as possible.

Essentially I would like to store lots of SVectors in some object for code-readability purposes but have the compiler act as if I am passing them as separate arguments to each function. Is there a technical reason this can’t exist?

Using @unpack kind of achieves this (at the cost of losing some indexing), and using a mutable struct kind of achieves this (at the cost of an extra allocation).

Tetrakai · September 23, 2024, 7:58pm

I see the same as you on my laptop (Intel i5-6300u running julia 1.10.4 with default optimization), but on pc (AMD 2990wx running julia 1.11.0-rc1 with -O3 flag) all three of those are ~ equally fast.

I saw this same thing earlier here:

Topic		Replies	Views
SVectors + @reset from Accessors: Strange Benchmarks General Usage benchmarktools , staticarrays , setfield , accessors	31	289	September 13, 2024
Usage of arrays of static arrays New to Julia performance , staticarrays	16	1104	February 22, 2023
Struct variables explicit setting New to Julia	30	1338	March 19, 2021
How To Avoiding Allocations in Static Structs Performance array , memory-allocation , struct	4	417	December 5, 2022
Julia style? Arrays/matrix vs structs General Usage	19	3882	December 18, 2019

Efficiently creating and updating a struct of SVectors

Related topics