Is this a good illustration of the principle "Don't store fields you can compute on demand"?

maxkapur · June 9, 2022, 12:32pm

Motivation

I was reading @jakobnissen 's blog post about optimizing Julia code and came across this interesting tip:

Do not store fields that can easily be computed on demand.

In my code, I am working on a numerical scheme whose input data is a bunch of vectors having the same length, say x and y. I do a bunch of deterministic calculations, and there are certain quantities such as 1 - x[i] and x[i] * y[i] that are used repeatedly throughout the algorithms. So, I have written my code in such a way that first the user reads x and y into a struct, and this inner constructor performs these preliminary calculations and adds them as fields of the struct. Something like

struct HeavyVecPair
    x::Vector{Float64}
    oneminusx::Vector{Float64}
    twominusx::Vector{Float64}
    y::Vector{Float64}
    oneminusy::Vector{Float64}
    twominusy::Vector{Float64}
    xy::Vector{Float64}
    xplusy::Vector{Float64}

    function HeavyVecPair(x, y)
        @assert length(x) == length(y)
        new(
            x,
            1 .- x,
            2 .- x,
            y,
            1 .- y,
            2 .- y,
            x .* y,
            x + y
        )
    end
end

For comparison, here is one that just contains the vectors.

struct LightVecPair
    x::Vector{Float64}
    y::Vector{Float64}

    function LightVecPair(x, y)
        @assert length(x) == length(y)
        new(x, y)
    end
end

Benchmark

To compare these two, I wrote the following benchmark to mimic my code.

using BenchmarkTools

function funnydot(v::HeavyVecPair)
    sum(
        v.x[i]
        * v.oneminusx[i]
        * v.twominusx[i]
        * v.y[i]
        * v.oneminusy[i]
        * v.twominusy[i]
        * v.xy[i]
        * v.xplusy[i]
        for i in eachindex(v.x)
    )
end

function funnydot(v::LightVecPair)
    sum(
        v.x[i]
        * (1 - v.x[i])
        * (2 - v.x[i])
        * v.y[i]
        * (1 - v.y[i])
        * (2 - v.y[i])
        * v.x[i] * v.y[i]
        * (v.x[i] + v.y[i])
        for i in eachindex(v.x)
    )
end

function main()
    @info "Short vectors"
    display(@benchmark funnydot(v) setup=(v = HeavyVecPair(rand(10), rand(10))))
    display(@benchmark funnydot(v) setup=(v = LightVecPair(rand(10), rand(10))))

    @info "Long vectors"
    display(@benchmark funnydot(v) setup=(v = HeavyVecPair(rand(1000), rand(1000))))
    display(@benchmark funnydot(v) setup=(v = LightVecPair(rand(1000), rand(1000))))
end

main()

The results suggest that the LightVecPair is better for both short and long vectors, as predicted by the post:

[ Info: Short vectors
BenchmarkTools.Trial: 10000 samples with 983 evaluations.
 Range (min … max):  58.111 ns … 105.739 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     58.769 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   59.131 ns ±   1.654 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▂▃▇▆██▄▄▃▂▂▂▂▁  ▂ ▁▂▁▁   ▁ ▂▁▁               ▁▁            ▂
  ▅▇██████████████▇▆██████▇█████████▃▇▄▅▁▁▁▁▁▃▅████▇▇▅▅▄▃▃▅▄▄▅ █
  58.1 ns       Histogram: log(frequency) by time        64 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 10000 samples with 991 evaluations.
 Range (min … max):  39.406 ns … 89.627 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     39.686 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   40.134 ns ±  1.859 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅██▇▅▂   ▁  ▁   ▁ ▁   ▂▂▂▁           ▁                   ▁  ▂
  ██████▇▁▇█▆▇█▇▄▆█▇█▇▁▇█████▇▇▇▅▆▄▃▄▃▇██▇▄▃▁▁▁▁▁▃▃▅███▄▁▁▅█▇ █
  39.4 ns      Histogram: log(frequency) by time        46 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
[ Info: Long vectors
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.281 μs …   9.496 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.459 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.487 μs ± 258.464 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁▂▁▂█▆▃▃▁                                                ▁
  ▇▇▆██████████▆▅▆▅▅▆▄▄▁▃▃▁▁▁▁▃▁▄▁▁▃▄▄▃▁▁▁▄▁▁▃▃▃▁▁▁▁▁▃▁▁▁▅▄▅▅ █
  3.28 μs      Histogram: log(frequency) by time      4.65 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.792 μs …   6.521 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.814 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.870 μs ± 212.418 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█▇▃▂                                   ▃▃                ▁ ▂
  █████▄▄▃▁▃▁▃▃▃▁▇█▆▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄▄▅▅▆██▅▄▄▁▁▁▃▃▁▁▁▁▃▁▄██ █
  2.79 μs      Histogram: log(frequency) by time      3.58 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Note that these times do not include the time that the HeavyVecPair constructor spends performing the preliminary calculations, so in reality the performance cost with the HeavyVecPair is even greater.

Conclusion

The advice in the blog post appears correct.

Questions

Is this a well-designed benchmark?
Are there any other factors in play here that I should consider before reworking the structs in my code base to be more like LightVecPair?
Are there any other optimizations I can make to the code above?

lmiq · June 9, 2022, 12:57pm

Each of these operations should allocate a new array, thus the results are somewhat strange. Are you sure the benchmarks are not inverted?

(For short arrays, if you use static arrays, the computation on the flight may be advantageous, but for standard arrays only if the operations can be fused to not allocate intermediates)

maxkapur · June 9, 2022, 1:13pm

v.x[i] is a scalar, so that doesn’t allocate. The fact that it says .- instead of - is sloppy clipboard work on my part, but I think the compiler takes care of it.

lmiq · June 9, 2022, 1:18pm

Yeah, so that is probably one case where recomputing is better because you have less memory accesses, which are slower than those cheap operations. Maybe if you @inbounds the precomputed version the performance are closer.

lmiq · June 9, 2022, 1:20pm

But is this line doing the same as the other version? I wouldn’t expect allocations in either case.

astrobc1 · June 9, 2022, 4:32pm

As the previous comment mentioned, I think this line is responsible for the large discrepancy. Do you want v.y[i] instead? Interestingly, making this change still isn’t faster than the LightVecPair implementation in my quick test, but they are much closer.

maxkapur · June 9, 2022, 10:26pm

Oh, facepalm. Yep, that’s what I meant. I fixed this error and updated the results in the OP. Like you, I still find that the LightVecPair is faster. I wonder if this will be different on a newer computer that has slower memory latency?

Topic		Replies	Views
Unwanted allocations when updating type field Performance question	5	238	July 12, 2022
Idiomatic storage for intermediate calculations Performance	8	581	October 28, 2020
Why these simple codes have so much difference in performace? New to Julia	1	377	November 13, 2022
Optimising switching algorithms based on value of struct field General Usage performance , struct , type-stability	19	615	July 21, 2022
Allocations on accessing struct New to Julia	6	347	June 24, 2023

Is this a good illustration of the principle "Don't store fields you can compute on demand"?

Motivation

Benchmark

Conclusion

Questions

Related topics