Apologies for what is probably “yet another” allocations post but I am unclear what I have done so wrong to experience the drop in performance I am experiencing. below is the original snippet that I then extend to MWE 2 which I should be grateful for input on how I might improve this - essentially I am doing a similar approach to “online stats” but with my own structs and leveraging as a learning experience as much as anything.
MWE1 - “Count”
Summary
using Parameters , BenchmarkTools, Random
abstract type ICompute end
@with_kw mutable struct Count <: ICompute
valueCounter :: Int64 = 0
end
a = Vector{Float32}(undef,100000)
rand!(a)
function StreamingUpdate!(cnt :: Count, val :: Float32)
cnt.valueCounter += 1
end
c = Count()
aggregateStream(c :: ICompute , a :: Vector{Float32}) = ()
function aggregateStream(c :: Count , a :: Vector{Float32})
for val in a
StreamingUpdate!(c , val)
end
end
@btime aggregateStream($c,$a)
@show c
21.300 μs (0 allocations: 0 bytes)
c = Count
valueCounter: Int64 20813700000
I then extend to include another “streaming compute type: Add”:
Summary
using Parameters , BenchmarkTools, Random
abstract type ICompute end
@with_kw mutable struct Count <: ICompute
valueCounter :: Int64 = 0
end
@with_kw mutable struct Sum <: ICompute
runningTotal :: Float32 = 0
end
a = Vector{Float32}(undef,100000)
rand!(a)
function StreamingUpdate!(cnt :: Count, val :: Float32)
cnt.valueCounter += 1
end
function StreamingUpdate!(cnt :: Sum, val :: Float32)
cnt.runningTotal += val
end
c = Count()
s = Sum()
aggregateStream(c :: Vector{ICompute} , a :: Vector{Float32}) = ()
function aggregateStream(computes :: Vector{ICompute} , a :: Vector{Float32})
for val in a
for c in computes
StreamingUpdate!(c , val)
end
end
end
@btime aggregateStream([$c,$s],$a)
@show c , s
273.600 μs (2 allocations: 96 bytes)
(c, s) = (Count
valueCounter: Int64 2505300000
, Sum
runningTotal: Float32 1.6777216f7
)
version info:
Summary
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
JULIA_NUM_THREADS = 6
note that I actually dont know why count isn’t returning 100_000 since thats what I expect given the setup but the execution time is quite consistently 10x slower. I also have to iterate over the values and populate via a stream as the data is to be quite large (1M +) in my use case.
Thanks - will try later - but would you expect that to also cause the 10x drop in performance?
I’ve made the question title more explicit to show that I’m actually trying to understand where the drop in performance is coming from (along with the allocations but potentially that’s secondary)
My guess is not entirely. The other thing is that MWE2 is looping over a Vector with an abstract element type, so the loop needs to dynamically dispatch StreamingUpdate! (dynamic dispatch takes up runtime). Dynamic dispatch also causes allocations sometimes, but I’m not sure in this case.
No. The comparison is not quite fair, because you benchmark Count vs Count+Sum, when Sum is much slower than Count.
Try [Count(), Count()] and see what the discrepancy is then. That said, @Benny makes an excellent point about dynamic dispatch. Suggestion: Turn the loops around, i.e. go over computes in the outer loop. (Spoiler: the slowdown is ~2x )
Dishonest? Certainly not my intention by any means - I’m looking to resolve this “toy box” example in a structured way exactly so that I can extend it with more complex IComputes.
@Benny - thanks will give it a go and create a list of functions first to try and mitigate dynamic dispatch
This would fail in MWE2 because the type is Vector{Count}, and there’s no method for it because it is not a subtype of the concrete Vector{ICompute}. Quick fix is making the argument type the abstract Vector{<:ICompute}.
I tried your suggestion re removing the profiling wrt making the array, then modified to Vector{<:ICompute} and it did indeed remove all allocations: 277.300 μs (0 allocations: 0 bytes)
I also inverted the iteration (which unfortunately isnt an option in my scenario) which did improve things: 112.000 μs (0 allocations: 0 bytes) which is comparable to running them individually (22.800 μs (0 allocations: 0 bytes) for “count” and 87.200 μs (0 allocations: 0 bytes) for “sum”
I must admit I’m still a little perplexed as to why this would take so long as I really don’t believe much is happening and I am following a similar pattern (I believe) to what I see in the source code for OnlineStats.jl
Potentially I’m looking at this wrong but iterating alone costs 14.314 ns, hence why I assumed a (low) multiple of this for something like Count.
Dynamic dispatch really is costly especially when it adds up in a loop, but if you really have to check arbitrary runtime types to figure out what method to use, you have to pay the cost. If you only have a handful of types to check, you could actually hardcode a redundant if-statement checking the types so each branch has all types constant at compile-time; ManualDispatch.jl provides a macro to make it easier to write and maintain such a loop.
I’ve never used these myself, but if you’re into optimizing and profiling, I’ve heard about packages like ProfileView.jl and Cthulhu.jl.
Many thanks for the response and information @Benny - I will try this and revert.
I must admit I didn’t realise that this is potentially problematic for Julia anyway (anti Julian?) - is there a better way to write it ? I assumed that as I have the concrete types there wouldn’t be any dynamicism going on after the first calls but it appears that I essentially need a method factory behind the scenes to establish which to call.
I’ve also seen functionwranglers.jl that allows building of function arrays which I’ll try too.
Are you conflating dynamic dispatch with compilation? Compilation of the methods will be done at the first calls. If your array can contain multiple types, then there’s no way to know a particular index’s type at compile-time, you have to check the type at run-time. If you call a function on a varying runtime-type, then the selection of the method (possibly already compiled) has to happen at run-time too; this specific step is the dynamic dispatch.
It’s not anti-Julian at all. Sure, it’s slower than static (compile-time) dispatch, and nobody wants their code to run slower if it doesn’t need to be, but a sequence where multiple types can be in any order definitely needs run-time checks. Run-time checks don’t necessarily have to be dynamic dispatch e.g. the if-statements I mentioned; there’s a nice summary of packages that aim to situationally reduce the cost of dynamic dispatch in this Catwalk.jl page.
FunctionWranglers.jl looks interesting, but I’m not sure it’ll help. It seems to deal with an array of functions, but you only have 1 function with 2 methods. You could refactor as 2 different functions? But bear in mind that going through your computes array to (somehow) generate the array of corresponding functions is also a run-time check, and it’ll definitely take up more memory and, depending on your function selection algorithm, might not take less time than dynamic dispatch.
Thanks for the response; your comment: “If your array can contain multiple types, then there’s no way to know a particular index’s type at compile-time, you have to check the type at run-time” is where the knowledge gap is for me since my assumption was that it would end up being a free lookup after the first call. In particular as they all inherit from an abstract class I thought I was giving enough information (and this is definitely a miss on how I reason about multiple dispatch) but I get your point
The point on the “unknown index” is also interesting since the loop is fixed at the function barrier (or so I thought). Learning a lot here - many thanks.
I’ll try and few things and revert - then will mark an answer.
That lookup is exactly the incurred overhead. If the type of every element was concrete and known at compile time (i.e. at first call), the compiler could select the appropriate method right away. No dispatch would need to happen during run time. With a vector of abstract type, this is not possible.
OnlineStats.jl solves this by dragging around type information. For example,
The Series type which corresponds to Vector{ICompute}, has all the information in its type.
Therefore the appropriate methods can be selected at compile time.