Hello, I’m profiling my simulation with profview and this is the flame graph
where I’m pointing to the four function calls that seem to be among the most expensive. Those are calling the same function – superbee!
– in different parts of the simulation. The function reads
maxmod(a::Float64, b::Float64) = 0.5*(sign(a) + sign(b))*max(abs(a), abs(b))
minmod(a::Float64, b::Float64) = 0.5*(sign(a) + sign(b))*min(abs(a), abs(b))
function superbee!(s::T, a::T, b::T, h::Float64) where {T>:Vector{Float64}}
for i=eachindex(s)
@inbounds s[i] = maxmod(minmod(a[i], 2.0*b[i]), minmod(2.0*a[i],b[i]))*h
end
end
where all the input vectors are pre-allocated and usually of size ~100 (the size is know only at runtime; it can get larger, up to 500). Benchmarking this I get
julia> n=100;s=rand(n);a=rand(n);b=rand(n);h=rand();
julia> @benchmark superbee!($s, $a, $b, $h)
BenchmarkTools.Trial: 10000 samples with 451 evaluations.
Range (min … max): 227.541 ns … 716.091 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 229.091 ns ┊ GC (median): 0.00%
Time (mean ± σ): 242.514 ns ± 37.856 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▂▁ ▂ ▁ ▁
████▇█▇██▇█▇▇███▇▇▆▆▅▆▆▇▇▇▆▆▆▇▆▆▆▇▆▇▇▇▇▇▇▇▇▆▇▆▅▆▅▆▄▅▅▅▅▅▃▄▄▅▄ █
228 ns Histogram: log(frequency) by time 403 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Nice, no allocs!
(Btw, I also tried to simply use broadcasting)
function superbeeb!(s::T, a::T, b::T, h::Float64) where {T>:Vector{Float64}}
@. s = maxmod(minmod(a, 2.0 *b), minmod(2.0 *a, b)) * h
end
but this is slower than the for
-loop one
julia> @benchmark superbeeb!($s, $a, $b, $h)
BenchmarkTools.Trial: 10000 samples with 290 evaluations.
Range (min … max): 282.717 ns … 877.279 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 284.676 ns ┊ GC (median): 0.00%
Time (mean ± σ): 294.799 ns ± 35.806 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▃▂▁ ▂ ▁ ▁
███████▇▇█▆▆▇▇▇▇▇▇▇▇▅▆▆▆▆▅▆▅▅▆▅▆▆▅▅▅▅▄▅▆▆▅▅▅▄▄▅▅▅▅▅▄▅▅▅▄▅▄▄▄▄ █
283 ns Histogram: log(frequency) by time 464 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Then I noticed that most of the time is spent calling the sign
function as we have minmod
twice.
However sign
doesn’t care about the difference between a
and 2.0*a
, so I refactored as
function superbee_refactor!(s::T, a::T, b::T, h::Float64) where {T>:Vector{Float64}}
signab = 0.0
absa = 0.0
absb = 0.0
minmoda2b = 0.0
minmod2ab = 0.0
for i = eachindex(s)
@inbounds signab = 0.5 * (sign(a[i]) + sign(b[i]))
@inbounds absa = abs(a[i])
@inbounds absb = abs(b[i])
minmoda2b = signab * min(absa, 2.0*absb)
minmod2ab = signab * min(2.0*absa, absb)
@inbounds s[i] = maxmod(signab*min(absa,2.0*absb),signab*min(2.0*absa,absb))*h
end
end
resulting in
julia> @benchmark superbee_refactor!($s, $a, $b, $h)
BenchmarkTools.Trial: 10000 samples with 661 evaluations.
Range (min … max): 182.625 ns … 529.581 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 185.082 ns ┊ GC (median): 0.00%
Time (mean ± σ): 191.468 ns ± 22.264 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▅█▃▁ ▁ ▁ ▁
█████████▇███▇███▇▇▇▇▆▆▆▆▆▆▆▅▆▆▅▅▅▆▆▆▅▆▅▅▅▆▄▄▄▆▅▅▅▄▄▃▅▅▅▅▄▄▃▅ █
183 ns Histogram: log(frequency) by time 301 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
This required a lot of pre-allocations but it seems to be worth the effort.
Where to go now? I tried @simd
but no luck. Multi-threading as well is not helpful with such a small number of elements in the vectors.
Would StaticArrays
help here? Even though I may have >100 elements at times?