How to optimize struct for speed?

I am defining a new struct for a project, which I need to be fast. This struct only contains one parameter, an MVector (from the StaticArrays.jl package). However, the operations on the struct are 4x slower than those on the MVector directly.

Here is the definition of the struct:

struct Foo{N}
    Foo(v::A) where A <: Union{Vector, SVector, MVector} = new{length(v)}(MVector{length(v)}(v))

Base.copy(foo::Foo) = Foo(

function Base.:+(a::Foo, b::Foo)
    return Foo( +

And the benchmarks:

julia> mvec = MVector{100}(rand(100))
julia> x = Foo(rand(100))
julia> @benchmark copy($mvec)
BenchmarkTools.Trial: 10000 samples with 984 evaluations.
 Range (min … max):  43.874 ns … 323.259 ns  β”Š GC (min … max): 0.00% … 71.34%
 Time  (median):     62.593 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   67.954 ns Β±  24.152 ns  β”Š GC (mean Β± Οƒ):  3.85% Β±  8.97%

      β–β–‡β–ˆβ–„β–†β–†β–ƒ                                                  β–‚
  β–‡β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–†β–…β–β–β–β–β–β–β–β–β–ƒβ–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–„β–†β–†β–‡β–†β–‡β–‡β–†β–†β–„β–…β–† β–ˆ
  43.9 ns       Histogram: log(frequency) by time       221 ns <

 Memory estimate: 816 bytes, allocs estimate: 1.

julia> @benchmark copy($x)
BenchmarkTools.Trial: 10000 samples with 822 evaluations.
 Range (min … max):  149.393 ns … 527.189 ns  β”Š GC (min … max): 0.00% … 65.09%
 Time  (median):     161.423 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   165.675 ns Β±  33.197 ns  β”Š GC (mean Β± Οƒ):  2.35% Β±  7.32%

   β–…β–ˆβ–‡β–ƒβ–                                                        β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–„β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–„β–β–β–β–β–β–β–β–β–β–β–β–„β–…β–…β–…β–†β–‡ β–ˆ
  149 ns        Histogram: log(frequency) by time        417 ns <

 Memory estimate: 832 bytes, allocs estimate: 2.

julia> @benchmark +($mvec, $mvec)
BenchmarkTools.Trial: 10000 samples with 987 evaluations.
 Range (min … max):  37.670 ns … 386.473 ns  β”Š GC (min … max): 0.00% … 46.21%
 Time  (median):     60.061 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   62.614 ns Β±  24.071 ns  β”Š GC (mean Β± Οƒ):  4.30% Β±  9.31%

      β–„β–‡β–†β–ˆβ–‡β–ƒβ–                                                  β–‚
  β–ƒβ–…β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–ƒβ–β–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–ƒβ–†β–†β–†β–†β–†β–…β–†β–‡β–‡β–‡β–‡ β–ˆ
  37.7 ns       Histogram: log(frequency) by time       217 ns <

 Memory estimate: 816 bytes, allocs estimate: 1.

julia> @benchmark +($x, $x)
BenchmarkTools.Trial: 10000 samples with 492 evaluations.
 Range (min … max):  223.604 ns … 915.809 ns  β”Š GC (min … max): 0.00% … 48.79%
 Time  (median):     240.069 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   248.639 ns Β±  56.365 ns  β”Š GC (mean Β± Οƒ):  2.50% Β±  7.61%

  β–‚β–‡β–ˆβ–ˆβ–‚β–                                                        β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–„β–„β–β–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–„β–…β–…β–†β–‡β–‡β–ˆβ–‡β–‡ β–ˆ
  224 ns        Histogram: log(frequency) by time        607 ns <

 Memory estimate: 1.61 KiB, allocs estimate: 3.

Is there something I am doing wrong? How can I fix this?


It is likely due to the missing type parameter of MVector. It actually has two type parameters: MVector{S, T}, where S is the size and T is the element type.

This should help:

struct Foo{N, T}
    dev::MVector{N, T}
    Foo(v::A) where A <: Union{Vector, SVector, MVector} = 
        new{length(v), eltype(A)}(MVector{length(v)}(v))

This was exactly what was missing. Now performances are identical. Thanks!

You can also consider a definition more like

struct Foo{M<:MVector} # or just `struct Foo{M<:AbstractVector}`

where the vector type M itself is parameterized. I often find this more convenient if I don’t need to dispatch further on the field type’s parameters.

Going further, I’ll suggest a few more changes:

struct Foo{M<:MVector} # or just `struct Foo{M<:AbstractVector}`
  # no inner constructor that forces a copy of the input vector
Foo(v::AbstractVector) = Foo(MVector{length(v)}(v)) # this constructor creates a MVector copy of `v`
# if the input to `Foo(v)` is already an `MVector`, no copy is made

Base.copy(foo::Foo) = Foo(copy( # need to make a copy now because the inner constructor no longer does

Notice that I also changed your inner constructor to an outer constructor. The design I’m using here may be slightly more ergonomic and is something you should maybe consider, if it doesn’t conflict with your other usage. Notice that with your old definitions, Base.:+ creates a new MVector for +, then copies the result of that to yet-another new MVector in the Foo call’s inner constructor. Not having a constructor that can work without copying will often lead to performance blockers like this, if performance is something you care about in this situation.

Since you were happy to copy every input anyway, I’ll also make a plug for SVector over MVector. The only thing you lose is the ability to mutate entries. But for modest sizes it’s equally fast (and often faster, depending on context) to just make an entirely new SVector with one value changed (see setindex, not the usual setindex!, for a convenient function for this) than to mutate a MVector. This is because the immutable nature of SVector allows additional compiler optimizations. Also, MVectors usually require heap allocations to create but SVectors almost never do, which is another point is SVector’s favor.

I use SVectors all the time but I don’t think I’ve used an MVector more than once or twice. It’s possible I’ve never actually used one in β€œreal” code.


Trying to understand why the solution proposed by @Dan using this function
Tables.columntype(sch, :Close) is much faster than the others, I came to narrow down the reason for the difference in the use of structures in a similar way (if I understand correctly) to what was discussed here.

julia> using CSV, Tables, DataFrames, Dates, TSFrames

julia>"table.txt",DataFrame,delim=' ', ignorerepeated=true);    

julia> tsdf=TSFrame(df);

julia> sch = Tables.schema(tsdf)
 :Index   Date
 :Open    Float64
 :High    Float64
 :Low     Float64
 :Close   Float64
 :Volume  Float64

julia> using BenchmarkTools

julia> n,t= sch.names,sch.types
((:Index, :Open, :High, :Low, :Close, :Volume), (Date, Float64, Float64, Float64, Float64, Float64))

julia> @btime     t[findfirst(==(:Close),$n)]
  191.888 ns (0 allocations: 0 bytes)

julia> n,t= sch.names,Tuple{sch.types...}
((:Index, :Open, :High, :Low, :Close, :Volume), Tuple{Date, Vararg{Float64, 5}})

julia> @btime     fieldtype(t, findfirst(==(:Close),$n))
  10.010 ns (0 allocations: 0 bytes)

julia> @btime Tables.columntype(sch, :Close)
  12.813 ns (0 allocations: 0 bytes)

What I couldn’t delve into(*) is why the fieldtype(Tuple{t...},i) function is so much faster than getindex(t,i).

(*) ```
julia> @edit fieldtype(t, findfirst(==(:Close),n))
ERROR: could not determine location of method definition