Allocation and slow down when # of types involved increase (slower than C++ virtual methods)

*Happy to pay someone to spend time investigating this

The context is @peremato trying to present a Julia case for detector simulation used at e.g. CERN (the LHC) and other accelerator/medical research places: https://indico.cern.ch/event/1202075/contributions/5056320/attachments/2511955/4317921/Geom4hep-202206.pdf


As far as we can tell, the allocation and speed different (order of magnitude) purely comes from the fact that one file contains more geometry types during run time than the other file.

but I will summarize finding and steps to reproduce:

  1. git clone https://github.com/peremato/Geom4hep/
  2. cd Geom4hep/

open examples/XRay.jl and comment out everything below line 54 (only keep definitions)

  1. julia --project=.
  2. ] instantiate
  3. julia> using Revise, Geom4hep
  4. julia> includet("./examples/XRay.jl")
  5. compare these two:
const full = processGDML("examples/trackML.gdml", Float64)
const volume = full[2,1]
const world = getWorld(volume)
const nav = BVHNavigator(world)
@time generateXRay(nav, world, 1e4, 1);
@time generateXRay(nav, world, 1e4, 1);
  0.013423 seconds (9 allocations: 80.453 KiB)
const full2 = processGDML("examples/cms2018.gdml", Float64)
const volume2 = full2[1,7,1]
const world2 = getWorld(volume2)
const nav2 = BVHNavigator(world2)
@time generateXRay(nav2, world2, 1e4, 1);
@time generateXRay(nav2, world2, 1e4, 1);
  2.723019 seconds (29.79 M allocations: 757.602 MiB, 2.67% gc time)

The other finding and bread crumbs can be found: Hunting allocation · Issue #1 · peremato/Geom4hep · GitHub

2 Likes

the various distanceToOut and distanceToIn which takes different types, and often call these function with different types (because geometry can nest within each other), is what we think the problem is at

and seems to be a fundamental limitation of Julia types at this point

1 Like

Dynamic dispatch is slow in Julia — slower than C++ virtual methods, because C++ method dispatch only depends on a single object type (this) and hence can use vtable lookup.

High-performance code in Julia always relies on devirtualizing critical code. This also means that you can’t easily write performant geometry code in Julia by having an array of geometric objects of different types and relying on dynamic dispatch to execute different methods (determined at runtime) for different objects — you need to implement a different dispatch strategy.

See also this discussion: Union splitting vs C++

7 Likes

ah yes, and C++ single dispatch is much easier also makes sense.

The question is what are some known solution that doesn’t break code organization and extensibility, unrolling is probably not an option here.

1 Like

I think you basically need to implement your own vtable as discussed in the other thread.

I havent tried it but maybe this will help?

4 Likes
1 Like

It’s good to have this confirmed, but while the topic was “when # of types increases”, I’m curious, 1) is it for sure slower when there’s actually only one type involved. And 2) when 2+ for one or more extra type, do you think it gets rapidly slower? [I’m not too concerned about that vs C++, since multiple dispatch is more powerful than C++, and I’m not sure C++ has anything to emulate it faster (or slower or at all), though Stroustrup had a, not yet implemented, proposal.]

I’m guessing Julia is optimized for the multiple case, not single, so Julia doesn’t use a vtable even if it could sometimes? So what’s the alternative in the single case? I think basically a big old (C-like) switch (and @goto only as a macro, maybe vtable could be built?).

Except Julia doesn’t have switch (yet) either, but a big if-else should be as fast:

Julia neither has pattern matching built in (like Scala, and recently added to Python and Java), but it’s available with e.g. this package:

Do you know if that is the main package, or at least if it (or any of the many alternatives) are as fast as C’s switch?

1 Like

A sequence of if elses or using short circuit evaluation in place of a switch(I like this a lot) will lower to efficient structures (jump tables etc…) if LLVM thinks it will be faster.

See Do C++ performance optimisations apply to Julia? - #30 by gbaraldi

2 Likes

it gets rapidly slower when reaching 5, see the C++ Union splitting thread linked above. other wise comparable in performance because static dispatch has no overhead at run time

1 Like

I have hoped that I can get away with something like:

function distanceToIn(shape, point, dir)
    if shape isa Trap
        distanceToIn_trap(shape, point, dir)
    elseif shape isa Trd
        distanceToIn_trd(shape, point, dir)
    elseif shape isa Cone
        distanceToIn_cone(shape, point, dir)
    elseif shape isa Box
        distanceToIn_box(shape, point, dir)
    elseif shape isa Tube
        distanceToIn_tube(shape, point, dir)
    elseif shape isa Volume
        distanceToIn_volume(shape, point, dir)
    elseif shape isa Polycone
        distanceToIn_polycone(shape, point, dir)
    elseif shape isa CutTube
        distanceToIn_cuttube(shape, point, dir)
    elseif shape isa Boolean
        distanceToIn_boolean(shape, point, dir)
    end
end

but it made no difference. The example you linked seems to suggest this has to go inside the loop directly?

1 Like

Add @nospecialize?

around this steering function? no help

1 Like

manual splitting actually made performance much worse Trying manual splitting (regression) by Moelf · Pull Request #2 · peremato/Geom4hep · GitHub

splitted into Distance.jl and Inside.jl

Does it make a difference if you do distanceToIn_trap(shape::Trap, point, dir) to make sure that the compiler is using the type information here?

no difference.

maybe I’m not finding the correct function that causes the issue… but again, Profiler is not helpful here so it has been really hard to pin down the problem

by annotating the individual function with types, I’m able to recover the original performance (i.e. much slower than C++) and I feel like essentially I have re-created how Julia does things but manually and with no performance improvement

we might have a thread:
this change of 4 lines (Trying manual splitting by Moelf · Pull Request #2 · peremato/Geom4hep · GitHub)

seems to make time 2x worse and allocation 10x larger, all I’m doing here is manual splitting like I did for other shape

Have you tried out using GitHub - YingboMa/Unityper.jl ? It has some annoying ergonomics like requiring a default keword argument field for every struct, but it is designed to solve this exact performance problem you’re having.

2 Likes

the function we call inside the loop is recursive and branch to different types, this is different from most manual union splitting example (and YingboMa’s example)