Allocation and slow down when # of types involved increase (slower than C++ virtual methods)

gbaraldi · September 26, 2022, 10:32pm

You actually found one,

ianfiske · September 28, 2022, 12:41pm

For an outsider reading this with curiosity about takeaways for similar problems, would you mind sharing what, if any, general solution strategy you found? I tried reading through the branch diffs in Hunting allocation · Issue #1 · peremato/Geom4hep · GitHub to follow, but I got a bit lost in the details of the application. I totally understand if you don’t have time for such a request.

j-fu · September 28, 2022, 1:46pm

For another take on this topic, see https://j-fu.github.io/marginalia/julia/unionize/

jling · September 28, 2022, 1:53pm

so here’s the problem: non of the examples mentioned above including yours (which looks very nice btw) have MUTUALLY recursiveness among types and the function f()

if our case were like this, we’d easily just write a sumup_f_manual and re-write the whole workload as an explicit loop

ndinsmore · September 28, 2022, 8:49pm

It seems to me that there are a few issues with your boolean type:

Nested Parametric Type Explosion: There are 8 primitive types and 11 parametric types if you include the 3 boolean types(Union, Subtraction, and Intersection). Looking at the CMS2018 file it looks like there is typically at least 2 layers of nested booleans. So it seems like very quickly you could get more than 512( which is 8^3) different types. You might consider making this a less parametric type This seems very close to the pitfalls mentioned in:
The-dangers-of-abusing-multiple-dispatch-(aka,-more-on-types-with-values-as-parameters)
Not using bounding boxes for the initial screening: It seems like if you calculated and stored a bounding box for both the target and tool(I think you call them Left & Right) in the boolean object at creation time, so that you don’t have to dispatch on the sub shapes to get the bounding boxes. You could have a course outer method that works only on the bounding boxes, would be the same all the booleans regardless of the tool and target type, and likely intercept/screen 90+% calls to this function. Then have a more precise fine version of the method that actually dispatches to the primitives’ method.
- I think that at the top level you are doing this with Bounded Volume Hierarchy but if not something similar should be done there.

jling · September 29, 2022, 2:36am

this is great I guess @peremato has seen the Tips page but decided to implement according to the C++ design as a first draft; but yeah, we should try to fix both points

peremato · November 3, 2022, 4:09pm

Sorry for the long break to comeback to this thread.

The memory allocation problem is solved by having the 3 BooleanShapes types not parametric with left and right shapes but using Union{BaseShape{T}, BooleanShape{T}} for the left and right shapes. This was figured out by @grasph. Thanks!

However the performance is yet recovered.

The problem with the performance comes from the fact that each method of BooleanShape has to deal with dynamic dispatch of the sub-shapes (left and right). The only way to implement these methods is by using the methods of the sub-types. For instance, to find in a point in inside a BooleanUnion you need to ask if the point is in the Left shape or in the Right shape. You cannot not implement this method in any other way.

I made a test of calling the inside method of a 4-nested boolean structure with the current implementation using as type for left and right Union{BaseShape{T}, BooleanShape{T}} compared with the original parametric type definition like this:

struct BooleanShape{T<:AbstractFloat, SL<:AbstractShape{T}, SR<:AbstractShape{T}, OP} <: AbstractShape{T}
    left::SL                                     # the mother (or left) volume A in unplaced form
    right::SR                                    # (or right) volume B in placed form, acting on A with a boolean operation
    transformation::Transformation3D{T}          # placement of "right" with respect of "left" 
end

and can see almost a factor 2 in performance. This means that the weight of dynamic dispatch is enormous.

@ndinsmore is right, at creation time I know exactly the types for the sub-shapes. So, if I could store this information I would not need to make use of dynamic dispatch at run-time (which is exactly the case when we use the types of the sub-shapes as parameters of the BooleanShape type). The problem is that I do to know how to do it. I could store the actual method that I would need to call at run-time for example using left_inside = which(inside,(typeof(left), Point3{T})), but it seems that in Julia I cannot call directly methods. You get MethodError: objects of type Method are not callable.

Somehow what I want to reproduce is to build the equivalent of a C++ virtual table for the left and right sub-shapes. Does anybody knows if this is possible?

jling · November 4, 2022, 1:33pm

so I ran into @sjkelly yesterday at Julia user meeting, and we talked about the performance of this thread.

I think we may be just barely abusing types in this case. The Performance Tips pages in the manual use “car brands <=> types” as an example of type abusing. Here, the # of geometries is not a problem itself, but because we (have to) parametrize over the child types for meta geometries, the # of types explodes.

We could opt for more manual methods, like using Enum instead of types for the meta geometries (Boolean), so instead of putting child type into the parameter, we just store a Enum, and do if-else when a Boolean shape is encountered.

peremato · November 4, 2022, 2:50pm

This is exactly what I try to do with the latest version. There are in total 10 types (7 basic and 3 booleans) no more. The BooleanShapes encapsulate the abstraction with the left and right types but themselves are not abstract. The problem is the poor performance when calling a function using one of the booleans because of dynamic dispatch, which is a pity since I know exactly the types at construction time. If I could call the correct method stored at construction time would be solution.

With manual splitting if shape isa Box; elseif shape isa Tube; ... you get only a marginal 5% improvement, while getting rid completely of the dynamic dispatch (tested using parametric sub-shape types, which I cannot use in the final project) in a complex recursive structure you get easily 50% improvement. So, this is what I am aiming to recover all the lost performance due to dynamic dispatch.

jling · November 4, 2022, 3:26pm

hmm yeah FunctionWrapper mimicking C++'s single dispatch is possible

rdeits · November 4, 2022, 3:49pm

In case it’s helpful, I put together a simple example of what is effectively a c++ style vtable using FunctionWrappers here: ConcreteInterfaces.jl/demo.ipynb at master · rdeits/ConcreteInterfaces.jl · GitHub

peremato · November 7, 2022, 10:58am

Thank-you. It looks very interesting. I’ll give it a try.

peremato · November 8, 2022, 11:32am

@rdeits unfortunately rdeits/ConcreteInterfaces.jl is not working with a recent version of Julia. In any case, I have made a test with FunctionWrapper by hand instead of using the @interface macro, which looks very handy.

I have defined a partial interface to Shapes with only two methods (inside and extent) to start with.

#---Define shape interfaces  (aka C++ virtual table)------------------------------------------------
struct ShapeInterface{T<:AbstractFloat} <: AbstractInterface
    self::AbstractShape{T}
    inside::FunctionWrapper{Int64, Tuple{Point3{T}}}
    extent::FunctionWrapper{Tuple{Point3{T},Point3{T}},Tuple{}}
    ShapeInterface{T}(self::AbstractShape{T}) where T = new(self, (p::Point3{T} -> inside(self, p)), (()->extent(self)))
end
import Geom4hep: inside, extent
inside(i::ShapeInterface{T}, p::Point3{T}) = i.inside(p)
extent(i::ShapeInterface{T}) = i.extent()

then I define the new BooleanShape like this:

struct NBooleanShape{T<:AbstractFloat, OP} <: AbstractShape{T}
    left::ShapeInterface{T}   # the mother (or left) volume A in unplaced form
    right::ShapeInterface{T}  # (or right) volume B in placed form, acting on A with a boolean operation
    transformation::Transformation3D{T}          # placement of "right" with respect of "left" 
end

The NBooleanShape has good properties (type stability, no allocations at runtime, etc.) but the performance is not better. With a nested boolean shape you get a factor 2 worst performance.

with the original code (as in the head of repository)
  62.709 ns (0 allocations: 0 bytes)
with the new BooleanShape
  110.988 ns (0 allocations: 0 bytes)

full example

using Revise
using Geom4hep

using FunctionWrappers
import FunctionWrappers: FunctionWrapper
abstract type AbstractInterface end

#---Define shape interfaces  (aka C++ virtual table)------------------------------------------------
struct ShapeInterface{T<:AbstractFloat} <: AbstractInterface
    self::AbstractShape{T}
    inside::FunctionWrapper{Int64, Tuple{Point3{T}}}
    extent::FunctionWrapper{Tuple{Point3{T},Point3{T}},Tuple{}}
    ShapeInterface{T}(self::AbstractShape{T}) where T = new(self, (p::Point3{T} -> inside(self, p)), (()->extent(self)))
end

import Geom4hep: inside, extent
inside(i::ShapeInterface{T}, p::Point3{T}) = i.inside(p)
extent(i::ShapeInterface{T}) = i.extent()


#---New definition of Boolean Shapes--------------------------------------------------------------------
struct NBooleanShape{T<:AbstractFloat, OP} <: AbstractShape{T}
    left::ShapeInterface{T}   # the mother (or left) volume A in unplaced form
    right::ShapeInterface{T}  # (or right) volume B in placed form, acting on A with a boolean operation
    transformation::Transformation3D{T}          # placement of "right" with respect of "left" 
end
const NBooleanUnion{T} = NBooleanShape{T, :union}
const NBooleanSubtraction{T} =  NBooleanShape{T, :subtraction}
const NBooleanIntersection{T} = NBooleanShape{T, :intersection}

#---Constructor--------------------------------------------------------------------------------------
function NBooleanSubtraction(left::AbstractShape{T}, right::AbstractShape{T}, place::Transformation3D{T}=one(Transformation3D{T})) where T<:AbstractFloat
    NBooleanSubtraction{T}(ShapeInterface{T}(left), ShapeInterface{T}(right), place)
end

function inside(shape::NBooleanSubtraction{T}, point::Point3{T})::Int64
    (; left, right, transformation) = shape
    lpoint = transformation * point
    positionA = inside(left, point)
    positionA == kOutside && return kOutside
    positionB = inside(right, lpoint)
    if positionA == kInside && positionB == kOutside
        return kInside;
    elseif positionA == kInside && positionB == kSurface ||
           positionB == kOutside && positionA == kSurface
        return kSurface
    else
        return kOutside
    end
end

function extent(shape::NBooleanSubtraction{T})::Tuple{Point3{T},Point3{T}} where {T}
    extent(shape.left)
end


const T = Float64
#---New boolean instance
nboolean = NBooleanSubtraction(NBooleanSubtraction(NBooleanSubtraction(NBooleanSubtraction(Box{T}(2.035, 100.35000000000001, 0.5750000000000001),
                                                                                      Box{T}(0.3175, 100.35000000000001, 0.065),
                                                                                      Transformation3D{T}(1.7175000000000002, 0.0, 0.51)),
                                                                   Box{T}(0.3175, 100.35000000000001, 0.065),
                                                                   Transformation3D{T}(1.7175000000000002, 0.0, -0.51)),
                                                Box{Float64}(0.3175, 100.35000000000001, 0.065),
                                                Transformation3D{T}(-1.7175000000000002, 0.0, 0.51)),
                            Box{Float64}(0.3175, 100.35000000000001, 0.065),
                            Transformation3D{T}(-1.7175000000000002, 0.0, -0.51))
#---Old boolean instance
boolean = BooleanSubtraction(BooleanSubtraction(BooleanSubtraction(BooleanSubtraction(Box{T}(2.035, 100.35000000000001, 0.5750000000000001),
                                                                                      Box{T}(0.3175, 100.35000000000001, 0.065),
                                                                                      Transformation3D{T}(1.7175000000000002, 0.0, 0.51)),
                                                                   Box{T}(0.3175, 100.35000000000001, 0.065),
                                                                   Transformation3D{T}(1.7175000000000002, 0.0, -0.51)),
                                                Box{Float64}(0.3175, 100.35000000000001, 0.065),
                                                Transformation3D{T}(-1.7175000000000002, 0.0, 0.51)),
                             Box{Float64}(0.3175, 100.35000000000001, 0.065),
                             Transformation3D{T}(-1.7175000000000002, 0.0, -0.51))

point = Point3{T}(0,0,0)

# benchmaking                             
using BenchmarkTools
@btime inside(boolean, point)
@btime inside(nboolean, point)

ndinsmore · November 8, 2022, 6:02pm

@peremato what I was saying before was to store the actual bounding box of the left and right shape so that you could filter out maybe a high percentage of the multiple dispatch calls by boolean. So you would check if the ray or point was in the BBox before ever dispatching, but again the key to this is to store BBox before hand.

ndinsmore · November 8, 2022, 8:41pm

@peremato & @jling

I got a chance to actually try the code out and I don’t think that your problem has anything to do with dispatch. Apparently when you run @profview anything that shows up in red is a dispatch problem.

At least from a complexity point of view, the time difference between your two different examples is 100% related to how many times the method distanceToIn(...) gets called because of the way the geometry is built up.

With some poorly written instrumentation code, I counted the call to ``distanceToIn(…)``` for all the different types:
For “examples/trackML.gdml”

julia> Geom4hep.call_count
Dict{Any, Int64} with 2 entries:
  PlacedVolume => 123264
  Trd          => 23104

For “examples/cms2018.gdml”

julia> Geom4hep.call_count
Dict{Any, Int64} with 6 entries:
  Tube               => 108529
  PlacedVolume       => 44120474
  Box                => 47601314
  BooleanUnion       => 6553615
  BooleanSubtraction => 10157038
  Trap               => 4397782

Which means in the benchmarking just the PlaceVolume version of that method is getting called 357x more times and the ratios in your benchmarked times above are only 209x. You will know from benchmarking that the distanceToIn(...) method is the majority of the problem.

I hate to say it but I think that you have been chasing the wrong gremlin.

I can send you the instrumentation code if you like.

jling · November 8, 2022, 8:50pm

I’m not sure what you mean here. Yes we’re calling distanceToIn() many more times, but we know from a C++ counterpart that the cms2018 geometry shouldn’t be this slow. And it seems that as long as don’t have Boolean() shapes it’s not a problem – you can test this out by skip parsing Boolean geometries

peremato · November 9, 2022, 8:02am

Yes, you are right and we are already doing it. We have an implementation of BVH accelerating structure to avoid large searches, which speedup many factors (see BVH.jl). But at the end of the day, to make a precise determination of the distance to a surface you need to have the real shape and not an approximative bounding box.

ndinsmore · November 9, 2022, 7:38pm

First can we agree to a few things:

The majority of the computation time is spent in distanceToIn() for both models.
Your legacy C++ code runs about as fast as the julia version on the TrackML example
When you say that the Julia version is slow for the cms2018 example you mean 2x - 5x the time of C++ code not 200x the time.
Calling the the function that is most of the computing time 350x more times and including boolean types(ie multiple dispatch) but only increasing the run time 200x is a reasonable outcome.

Also when you say that not parsing the booleans does that mean the code runs 30-40% faster (which is about what I get on my machine 2.8s → 1.8s)? Because from the counting above if remove 16M boolean calls of distanceToIn() each of which takes with it a call of distanceToIn() to as primative type that accounts for the decrease in time alone.

If you run my instrumentation code on the cms2018 example with the boolean ignored you get:

Dict{Any, Int64} with 3 entries:
  PlacedVolume => 45856163
  Box          => 38865821
  Trap         => 6881813

I think this is an algorithm issue not a dispatch issue.

Topic		Replies	Views
Help with 0.5 slowdown General Usage question	25	2234	December 7, 2016
Understanding allocations in GeometryBasics.jl Performance	6	441	November 10, 2021
Raytracing weekend: debugging unexpected allocations New to Julia performance	3	453	June 22, 2022
Reducing allocations to optimize (to match Java speed) Performance	22	1330	December 10, 2018
Julia slowdown on long running programs with many allocations Performance memory-allocation , flux	6	1524	July 6, 2021

Allocation and slow down when # of types involved increase (slower than C++ virtual methods)

Related topics