Unexpected Allocations and performance loss from concrete classes

djholiver · May 9, 2022, 12:44pm

Hi,

Apologies for what is probably “yet another” allocations post but I am unclear what I have done so wrong to experience the drop in performance I am experiencing. below is the original snippet that I then extend to MWE 2 which I should be grateful for input on how I might improve this - essentially I am doing a similar approach to “online stats” but with my own structs and leveraging as a learning experience as much as anything.

MWE1 - “Count”

Summary

using Parameters , BenchmarkTools, Random

abstract type ICompute end
@with_kw mutable struct Count <: ICompute

  valueCounter :: Int64 = 0

end

a = Vector{Float32}(undef,100000)
rand!(a)

function StreamingUpdate!(cnt :: Count, val :: Float32)

    cnt.valueCounter += 1

end
c = Count()

aggregateStream(c :: ICompute , a :: Vector{Float32}) = ()
function aggregateStream(c :: Count , a :: Vector{Float32})
  for val in a
    StreamingUpdate!(c , val)
  end
end

@btime aggregateStream($c,$a)
@show c

  21.300 μs (0 allocations: 0 bytes)
c = Count
  valueCounter: Int64 20813700000

I then extend to include another “streaming compute type: Add”:

Summary


using Parameters , BenchmarkTools, Random

abstract type ICompute end
@with_kw mutable struct Count <: ICompute

  valueCounter :: Int64 = 0

end

@with_kw mutable struct Sum <: ICompute

  runningTotal :: Float32 = 0

end


a = Vector{Float32}(undef,100000)
rand!(a)

function StreamingUpdate!(cnt :: Count, val :: Float32)

    cnt.valueCounter += 1

end

function StreamingUpdate!(cnt :: Sum, val :: Float32)

  cnt.runningTotal += val

end

c = Count()
s = Sum()

aggregateStream(c ::  Vector{ICompute} , a :: Vector{Float32}) = ()
function aggregateStream(computes :: Vector{ICompute} , a :: Vector{Float32})
  for val in a
    for c in computes
    StreamingUpdate!(c , val)
    end
  end
end



@btime aggregateStream([$c,$s],$a)
@show c , s
  273.600 μs (2 allocations: 96 bytes)
(c, s) = (Count
  valueCounter: Int64 2505300000
, Sum
  runningTotal: Float32 1.6777216f7
)

version info:

Summary

Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 6

note that I actually dont know why count isn’t returning 100_000 since thats what I expect given the setup but the execution time is quite consistently 10x slower. I also have to iterate over the values and populate via a stream as the data is to be quite large (1M +) in my use case.

Any help / pointers gratefully received.

Regards,

skleinbo · May 9, 2022, 1:12pm

count doesn’t come out to be 100_000, because @btime executes many times and you are passing the same objects c and s every time.

djholiver · May 9, 2022, 1:39pm

Ah - thanks for that insight.

Regards

Benny · May 9, 2022, 2:12pm

I think you might be profiling making the array [$c,$s] which would cause at least 1 allocation, check what happens if you make the array in advance:

cs = [c, s]
@btime aggregateStream($cs,$a)

djholiver · May 9, 2022, 2:22pm

Thanks - will try later - but would you expect that to also cause the 10x drop in performance?

I’ve made the question title more explicit to show that I’m actually trying to understand where the drop in performance is coming from (along with the allocations but potentially that’s secondary)

Benny · May 9, 2022, 2:26pm

My guess is not entirely. The other thing is that MWE2 is looping over a Vector with an abstract element type, so the loop needs to dynamically dispatch StreamingUpdate! (dynamic dispatch takes up runtime). Dynamic dispatch also causes allocations sometimes, but I’m not sure in this case.

skleinbo · May 9, 2022, 2:29pm

No. The comparison is not quite fair, because you benchmark Count vs Count+Sum, when Sum is much slower than Count.

Try [Count(), Count()] and see what the discrepancy is then. That said, @Benny makes an excellent point about dynamic dispatch.
Suggestion: Turn the loops around, i.e. go over computes in the outer loop. (Spoiler: the slowdown is ~2x )

djholiver · May 9, 2022, 2:35pm

Dishonest? Certainly not my intention by any means - I’m looking to resolve this “toy box” example in a structured way exactly so that I can extend it with more complex IComputes.

@Benny - thanks will give it a go and create a list of functions first to try and mitigate dynamic dispatch

Benny · May 9, 2022, 2:35pm

This would fail in MWE2 because the type is Vector{Count}, and there’s no method for it because it is not a subtype of the concrete Vector{ICompute}. Quick fix is making the argument type the abstract Vector{<:ICompute}.

skleinbo · May 9, 2022, 2:38pm

Sorry, I was not calling you ‘dishonest’. I meant that the comparison is not really fair. I changed the wording.

skleinbo · May 9, 2022, 2:41pm

Right, I had changed that on my end already

djholiver · May 9, 2022, 7:24pm

Hi,

I tried your suggestion re removing the profiling wrt making the array, then modified to Vector{<:ICompute} and it did indeed remove all allocations: 277.300 μs (0 allocations: 0 bytes)

I also inverted the iteration (which unfortunately isnt an option in my scenario) which did improve things: 112.000 μs (0 allocations: 0 bytes) which is comparable to running them individually (22.800 μs (0 allocations: 0 bytes) for “count” and 87.200 μs (0 allocations: 0 bytes) for “sum”

I must admit I’m still a little perplexed as to why this would take so long as I really don’t believe much is happening and I am following a similar pattern (I believe) to what I see in the source code for OnlineStats.jl

Potentially I’m looking at this wrong but iterating alone costs 14.314 ns, hence why I assumed a (low) multiple of this for something like Count.

Regards,

Benny · May 10, 2022, 12:47am

Dynamic dispatch really is costly especially when it adds up in a loop, but if you really have to check arbitrary runtime types to figure out what method to use, you have to pay the cost. If you only have a handful of types to check, you could actually hardcode a redundant if-statement checking the types so each branch has all types constant at compile-time; ManualDispatch.jl provides a macro to make it easier to write and maintain such a loop.

I’ve never used these myself, but if you’re into optimizing and profiling, I’ve heard about packages like ProfileView.jl and Cthulhu.jl.

djholiver · May 10, 2022, 6:36am

Hi,

Many thanks for the response and information @Benny - I will try this and revert.

I must admit I didn’t realise that this is potentially problematic for Julia anyway (anti Julian?) - is there a better way to write it ? I assumed that as I have the concrete types there wouldn’t be any dynamicism going on after the first calls but it appears that I essentially need a method factory behind the scenes to establish which to call.

I’ve also seen functionwranglers.jl that allows building of function arrays which I’ll try too.

Regards,

Benny · May 10, 2022, 7:18am

Are you conflating dynamic dispatch with compilation? Compilation of the methods will be done at the first calls. If your array can contain multiple types, then there’s no way to know a particular index’s type at compile-time, you have to check the type at run-time. If you call a function on a varying runtime-type, then the selection of the method (possibly already compiled) has to happen at run-time too; this specific step is the dynamic dispatch.

It’s not anti-Julian at all. Sure, it’s slower than static (compile-time) dispatch, and nobody wants their code to run slower if it doesn’t need to be, but a sequence where multiple types can be in any order definitely needs run-time checks. Run-time checks don’t necessarily have to be dynamic dispatch e.g. the if-statements I mentioned; there’s a nice summary of packages that aim to situationally reduce the cost of dynamic dispatch in this Catwalk.jl page.

FunctionWranglers.jl looks interesting, but I’m not sure it’ll help. It seems to deal with an array of functions, but you only have 1 function with 2 methods. You could refactor as 2 different functions? But bear in mind that going through your computes array to (somehow) generate the array of corresponding functions is also a run-time check, and it’ll definitely take up more memory and, depending on your function selection algorithm, might not take less time than dynamic dispatch.

djholiver · May 10, 2022, 7:42am

Hi,

Thanks for the response; your comment: “If your array can contain multiple types, then there’s no way to know a particular index’s type at compile-time, you have to check the type at run-time” is where the knowledge gap is for me since my assumption was that it would end up being a free lookup after the first call. In particular as they all inherit from an abstract class I thought I was giving enough information (and this is definitely a miss on how I reason about multiple dispatch) but I get your point

The point on the “unknown index” is also interesting since the loop is fixed at the function barrier (or so I thought). Learning a lot here - many thanks.

I’ll try and few things and revert - then will mark an answer.

Regards,

skleinbo · May 10, 2022, 11:33am

That lookup is exactly the incurred overhead. If the type of every element was concrete and known at compile time (i.e. at first call), the compiler could select the appropriate method right away. No dispatch would need to happen during run time. With a vector of abstract type, this is not possible.

OnlineStats.jl solves this by dragging around type information. For example,

julia> Series(Sum(), Mean())
Series
├─ Sum: n=0 | value=0.0
└─ Mean: n=0 | value=0.0

julia> typeof(ans)
Series{Number, Tuple{Sum{Float64}, Mean{Float64, EqualWeight}}}

The Series type which corresponds to Vector{ICompute}, has all the information in its type.
Therefore the appropriate methods can be selected at compile time.

djholiver · May 10, 2022, 2:11pm

Thanks @skleinbo for this information - really helpful.

I’ve been scanning the codebase for how this is implemented but Im a bit unclear - is it here:

const Tup = Union{Tuple, NamedTuple}
const VectorOb = Union{AbstractVector, Tup}
const XY{T,S} = Union{Tuple{T,S}, Pair{T,S}, NamedTuple{names,Tuple{T,S}}} where {names,T<:AbstractVector{<:Number},S<:Number}

found here

If not, how would I implement in the initial MWE2? Apologies if I am accidentally asking for too much work - feel free to say so!

Regards,

skleinbo · May 11, 2022, 11:46am

In OnlineStatsBase.jl you’ll find for example

github.com

joshday/OnlineStatsBase.jl/blob/0e2aec140298d37f3a4ebad5ea5b1636e68b012a/src/stats.jl#L592-L597


      
          struct Series{IN, T} <: StatCollection{IN}
              stats::T
              Series(stats::T) where {T} = new{Union{map(input, stats)...}, T}(stats)
          end
          Series(t::OnlineStat...) = Series(t)
          Series(; t...) = Series(NamedTuple(t))

Count, Mean and so on are all subtypes of OnlineStat.

I haven’t studied the source well enough to guide you any further I’m afraid.

Topic		Replies	Views
Speed of multiple dispatch Performance	14	1197	November 27, 2023
Dramatic performance change by adding additional unused method Performance	33	1938	November 23, 2021
Disabling allocations Performance	50	5558	December 10, 2020
Performance problems when dealing with deliberately type-unstable code. How to use type knowledge better? Performance	33	2069	December 30, 2020
Elimination of (unnecessary) runtime dispatch allocations Internals & Design box , allocations , runtime-dispatch	13	1135	May 5, 2024

Unexpected Allocations and performance loss from concrete classes

Related topics