Is this a good illustration of the principle "Don't store fields you can compute on demand"?

Motivation

I was reading @jakobnissen 's blog post about optimizing Julia code and came across this interesting tip:

Do not store fields that can easily be computed on demand.

In my code, I am working on a numerical scheme whose input data is a bunch of vectors having the same length, say x and y. I do a bunch of deterministic calculations, and there are certain quantities such as 1 - x[i] and x[i] * y[i] that are used repeatedly throughout the algorithms. So, I have written my code in such a way that first the user reads x and y into a struct, and this inner constructor performs these preliminary calculations and adds them as fields of the struct. Something like

struct HeavyVecPair
x::Vector{Float64}
oneminusx::Vector{Float64}
twominusx::Vector{Float64}
y::Vector{Float64}
oneminusy::Vector{Float64}
twominusy::Vector{Float64}
xy::Vector{Float64}
xplusy::Vector{Float64}

function HeavyVecPair(x, y)
@assert length(x) == length(y)
new(
x,
1 .- x,
2 .- x,
y,
1 .- y,
2 .- y,
x .* y,
x + y
)
end
end

For comparison, here is one that just contains the vectors.

struct LightVecPair
x::Vector{Float64}
y::Vector{Float64}

function LightVecPair(x, y)
@assert length(x) == length(y)
new(x, y)
end
end

Benchmark

To compare these two, I wrote the following benchmark to mimic my code.

using BenchmarkTools

function funnydot(v::HeavyVecPair)
sum(
v.x[i]
* v.oneminusx[i]
* v.twominusx[i]
* v.y[i]
* v.oneminusy[i]
* v.twominusy[i]
* v.xy[i]
* v.xplusy[i]
for i in eachindex(v.x)
)
end

function funnydot(v::LightVecPair)
sum(
v.x[i]
* (1 - v.x[i])
* (2 - v.x[i])
* v.y[i]
* (1 - v.y[i])
* (2 - v.y[i])
* v.x[i] * v.y[i]
* (v.x[i] + v.y[i])
for i in eachindex(v.x)
)
end

function main()
@info "Short vectors"
display(@benchmark funnydot(v) setup=(v = HeavyVecPair(rand(10), rand(10))))
display(@benchmark funnydot(v) setup=(v = LightVecPair(rand(10), rand(10))))

@info "Long vectors"
display(@benchmark funnydot(v) setup=(v = HeavyVecPair(rand(1000), rand(1000))))
display(@benchmark funnydot(v) setup=(v = LightVecPair(rand(1000), rand(1000))))
end

main()

The results suggest that the LightVecPair is better for both short and long vectors, as predicted by the post:

[ Info: Short vectors
BenchmarkTools.Trial: 10000 samples with 983 evaluations.
Range (min β¦ max):  58.111 ns β¦ 105.739 ns  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     58.769 ns               β GC (median):    0.00%
Time  (mean Β± Ο):   59.131 ns Β±   1.654 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%

ββββββββββββββ  β ββββ   β βββ               ββ            β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
58.1 ns       Histogram: log(frequency) by time        64 ns <

Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 10000 samples with 991 evaluations.
Range (min β¦ max):  39.406 ns β¦ 89.627 ns  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     39.686 ns              β GC (median):    0.00%
Time  (mean Β± Ο):   40.134 ns Β±  1.859 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%

ββββββ   β  β   β β   ββββ           β                   β  β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
39.4 ns      Histogram: log(frequency) by time        46 ns <

Memory estimate: 0 bytes, allocs estimate: 0.
[ Info: Long vectors
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
Range (min β¦ max):  3.281 ΞΌs β¦   9.496 ΞΌs  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     3.459 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   3.487 ΞΌs Β± 258.464 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%

βββββββββ                                                β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
3.28 ΞΌs      Histogram: log(frequency) by time      4.65 ΞΌs <

Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
Range (min β¦ max):  2.792 ΞΌs β¦   6.521 ΞΌs  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     2.814 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   2.870 ΞΌs Β± 212.418 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%

βββββ                                   ββ                β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
2.79 ΞΌs      Histogram: log(frequency) by time      3.58 ΞΌs <

Memory estimate: 0 bytes, allocs estimate: 0.

Note that these times do not include the time that the HeavyVecPair constructor spends performing the preliminary calculations, so in reality the performance cost with the HeavyVecPair is even greater.

Conclusion

The advice in the blog post appears correct.

Questions

• Is this a well-designed benchmark?
• Are there any other factors in play here that I should consider before reworking the structs in my code base to be more like LightVecPair?
• Are there any other optimizations I can make to the code above?
2 Likes

Each of these operations should allocate a new array, thus the results are somewhat strange. Are you sure the benchmarks are not inverted?

(For short arrays, if you use static arrays, the computation on the flight may be advantageous, but for standard arrays only if the operations can be fused to not allocate intermediates)

v.x[i] is a scalar, so that doesnβt allocate. The fact that it says .- instead of - is sloppy clipboard work on my part, but I think the compiler takes care of it.

1 Like

Yeah, so that is probably one case where recomputing is better because you have less memory accesses, which are slower than those cheap operations. Maybe if you @inbounds the precomputed version the performance are closer.

But is this line doing the same as the other version? I wouldnβt expect allocations in either case.

As the previous comment mentioned, I think this line is responsible for the large discrepancy. Do you want v.y[i] instead? Interestingly, making this change still isnβt faster than the LightVecPair implementation in my quick test, but they are much closer.

Oh, facepalm. Yep, thatβs what I meant. I fixed this error and updated the results in the OP. Like you, I still find that the LightVecPair is faster. I wonder if this will be different on a newer computer that has slower memory latency?