Disabling allocations

Just my two cents. In some situations it can be beneficial to split different homogeneous containers for the different types and iterates over them separately. This way dynamic dispatch is avoided, while not having to do any cumbersome gymnastics to define “manual tagged union” types because of the lack of sum types in Julia.

Example code with 3 types of lights
const Vec{N,T} = SArray{Tuple{N},T,1,N}

abstract type AbstractLight end

struct PointLight{T,P} <: AbstractLight
    position::Vec{3,T}
    power::P
end

struct AreaLight{T,P,S} <: AbstractLight
    position::Vec{3,T}
    direction::Vec{3,T}
    power::P
    size::S
end

struct SpotLight{T,P,A} <: AbstractLight
    position::Vec{3,T}
    direction::Vec{3,T}
    power::P
    angle::A
end

point_lights::Vector{PointLight{Float64,Float64}} = ...
area_lights::Vector{AreaLight{Float64,Float64,Float64}} = ...
spot_lights::Vector{SpotLight{Float64,Float64}} = ...

irradiance(position, normal, light::PointLight) = ...
irradiance(position, normal, light::AreaLight) = ...
irradiance(position, normal, light::SpotLightLight) = ...

function irradiance(position, normal, point_lights, area_lights, spot_lights)
    irrad = 0.
    for light in point_lights
        irrad += irradiance(position, normal, light)
    end
    for light in area_lights
        irrad += irradiance(position, normal, light)
    end
    for light in spot_lights
        irrad += irradiance(position, normal, light)
    end
    irrad
end
2 Likes

But what kind of structure in a language that does not have dynamical dispatch allows this to be done in any way that cannot be done with Julia, resulting in more performant code? In other words, how the possibility of doing this using dynamical dispatch is detrimental?

(in Fortran I would just have different function names for every type of light and certainly I would iterate over each separately or add conditionals inside the loop but the fallback to that option in Julia seems a completely viable option. Are there other code structures that would perform better than that?).

I think the issue here is dynamic dispatch at run-time, not multiple dispatch at compile-time.

4 Likes

Is something like this what we are talking about?

using BenchmarkTools

function mysum(x)
  s = 0.
  for i in 1:length(x)
    s += x[i]
  end
  s
end

n = 1000

x = [ isodd(i) ? rand(1:10) : rand() for i in 1:n ]
@btime mysum(x)

x = [ rand(1:10) for i in 1:n ]
@btime mysum(x)

x = [ rand() for i in 1:n ]
@btime mysum(x)

That sum of course is much faster if performed on arrays of numbers of the same type. If this is an example of the sort of problem that may arise in a more complex code, other languages have smarter ways to deal with that?

I’m very interested in what problems you have encountered, and in what solutions you have found.

For dynamic dispatch over collections, I looked into things like FunctionWrapper to build dispatch tables, but finally settled on just storing items in a tuple (which is a heterogenous collection after all) and finally unrolling the loop with a macro. This makes the compiler generate code directly specialized to the contents of the scene. It works great for toy examples, although the type signatures are monstrous. I have yet to implement a BVH or anything, which will surely make things more complex.

I’m not sure I quite follow the stack allocation example. Can’t stack allocations be enforced by making the type immutable? Is the problem because your type requires mutability, or because of something else I don’t understand?

1 Like

If you try to implement a BVH for mixed types of geometries and primitives(for example, sphere, box, plane…), you will quickly encounter serious problems. They need to be stored in a heterogenous collection, so dynamic dispatch is needed. Union splitting is also possible, but it’s not extensible (though it works fine in the toy project).

That’s the problem. It’s possible to get around, but it’s getting harder when you add more features.

Immutable types which contain references to mutable types can cause allocations, though in newer version of Julia this is improved. Also, modifying and partially initializing a immutable type is hard. Another problem is that, a light/material type in raytracer may contain a lot of field, so copy of the struct is inefficient(which may happen when the type is immutable) , though I have no firm evidence for that. Using immutable type is totally possible, but it might be slower than the mutable type.
The true problem I encountered when I rewrite tinsel is that, I have to use abstract types somewhere(for example, in a bvh or light list or a hitrecord). Consider this code:

struct HitRecord
    ray::Ray
    hitpoint::Point3
    hitmaterial::AbstractMaterial
end
function raytracing()
    #...
    hitrec =  hitScene()
    #use the material to generate new rays 
    reflectRay(hitrec.ray,hitrec.material)
    #...
end

If I use imutable types for all materials, then a copy will be performed when you construct HitRecord, since you can’t just a store a pointer to the material (Since immutable type has no stable address.). To avoid copy, I can use mutable types or wrap the types in a Base.Ref, which then lead to allocations because it might escape…
A possible choice is to unsafely take the pointer of the struct and use GC.@preserve to use the pointer. After that the lifetime of the value is ended, so the memory is released.
In some C++ raytracer I have read about(Ray Tracing in One Weekend, pbrt,Mitsuba2), they just use single dispatch(virtual function) to solve the problems. So they store a shared_pointer(or even raw pointer) to the parent class(that is AbstractMaterial).

1 Like

It looks like in your example real problem is not stack allocation, but type instability of the hitScene. Is there any way to provide type stable version of hitScene?
Alternatively you can try to use union splitting

struct HitRecord{T <: AbstractMaterial}
    ray::Ray
    hitpoint::Point3
    hitmaterial::T
end
function raytracing()
    #...
    hitrec =  hitScene() # Here type of HitRecord is unknown
    #use the material to generate new rays 
    if hitrec isa HitRecord{Material1}
       reflectRay(hitrec.ray,hitrec.material)
    elseif hitrec isa HitRecord{Material2}
       reflectRay(hitrec.ray,hitrec.material)
    end
    #...
end
1 Like

This doesn’t work…Now the type of hitrec is HitRecord{T} where T<:AbstractMaterial, an abstract type, can you ensure that this dynamic type doesn’t cause any heap allocation?
Another problem is that, even you use a if-elseif-else branch to enforce type stability, is it faster or slower than the single dispatch approach in C++? Rendering a 500x500 picture with 64 samples for each pixel needs to invoke the raytracing function at least for 16_000_000 times. Not only materials, but also lights, textures and shapes can have differen types. Performance of dispatch at runtime(union splitting/dynamic dispatch/single dispatch) will affect the performance of the whole raytracer.

I do not quite understand. HitRecord{T} is a concrete type if T is a concrete type.

abstract type Foo end

struct Bar1 <: Foo
    x::Int
end

struct Bar2 <: Foo
    x::Float64
end

struct Baz{T <: Foo}
    y::T
end

julia> isconcretetype(Foo)
false

julia> isconcretetype(Bar1)
true

julia> isconcretetype(Bar2)
true

julia> isconcretetype(Baz)
false

julia> isconcretetype(Baz{Bar1})
true

Union splitting is rather fast as far as I know, but of course you should experiment with your real life structures to understand it’s influence on performance

function f(v)
    res = 0.0
    for z in v
        if z isa Baz{Bar1}
            res += z.y.x
        elseif z isa Baz{Bar2}
            res += z.y.x
        end
    end

    return res
end

v = Baz.(map(x -> x[1](x[2]), zip(rand([Bar1, Bar2], 100), rand(1:1000, 100))))

julia> @btime f($v)
  97.095 ns (0 allocations: 0 bytes)

At least it is non allocating.

But if you don’t know what the type of T is exactly, then it must be HitRecord{T} where T. Demonstration of what I said in your example:

julia> @code_warntype f(v)
Variables
  #self#::Core.Compiler.Const(f, false)
  v::Array{Baz,1}
  res::Float64
  @_4::Union{Nothing, Tuple{Baz,Int64}}
  z::Baz

Here type of z is Baz, and isconcretetype(Baz)==false

That’s the whole problem. There is little discussion about how to handle dynamic dispatch if dynamic dispatch is unavoidable. The common suggestions are union splitting and if-elseif-else to ensure type stability. However, when you write a (non-toy project) raytracer, dispatch at runtime is unavoidable. What should one do to avoid bad performance is unclear (for a normal user). Since there is no clear guideline and I don’t know how to solve these problems, I have to give up.

But how another language deals with that kind of data structure in a more performant manner? (a link for me to follow is appreciated).

Because I come for Fortran, and in Fortran certainly that kind of problem is rare. But not because Fortran allows something that Julia does not, but because writing a vector of mixed types in Fortran is so unnatural that one would rethink the data structure from start to not do that. And that, of course, can be done in Julia in the exact same way.

There is a package GitHub - SciML/RecursiveArrayTools.jl: Tools for easily handling objects like arrays of arrays and deeper nestings in scientific machine learning (SciML) and other applications that has something like a fast heterogenous array:

ArrayPartition(x::AbstractArray…)
An ArrayPartition A is an array, which is made up of different arrays A.x. These index like a single array, but each subarray may have a different type. However, broadcast is overloaded to loop in an efficient manner, meaning that A .+= 2.+B is type-stable in its computations, even if A.x[i] and A.x[j] do not match types. A full array interface is included for completeness, which allows this array type to be used in place of a standard array where such a type stable broadcast may be needed. One example is in heterogeneous differential equations for DifferentialEquations.jl.

An ArrayPartition acts like a single array. A[i] indexes through the first array, then the second, etc., all linearly. But A.x is where the arrays are stored. Thus, for:

using RecursiveArrayTools
A = ArrayPartition(y,z)
we would have A.x[1]==y and A.x[2]==z. Broadcasting like f.(A) is efficient.

I planned to use it myself at some point, but turned out I don’t need this. The package may fit your needs as well.

If you read @code_warntype further you will see that despite z is of the type Baz all calculations are type stable and it is supported by the fact that there are no allocations in @btime benchmark. That is the whole idea of union splitting, you take some non concrete type and split into concrete branches. When compiler knows that it is in this or that branch it know exactly which method to apply, so no dynamic dispatch is happening and as a result there are no allocations.

However, when you write a (non-toy project) raytracer, dispatch at runtime is unavoidable. What should one do to avoid bad performance is unclear (for a normal user).

Well, as I have said, you should try union splitting. It’s rather straightforward and removes the problem of dynamical dispatch and it helps you to avoid bad performance.

On the other hand, a proper data structure or type stable code can improve performance much better than union splitting or other fancy techniques. For example, if you can store all of your materials in a structure like [Material1[], Material2[], ...] i.e. as a heterogeneous vector of homogeneous arrays, then code performance will be improved and you avoid dynamic dispatch problem completely.

2 Likes

quote answer from Loki Asteri:

A virtual method call has a slightly higher cost as you need to look up the method to call before you call it (but this is a simple table look up not a search).

This is different for dynamic dispatch. Dynamic dispatch needs to search the methods table.

Yea and no. Even you can do something to get around this problem, it doesn’t scale well… The true problem occurs when I try to extend my render to support more materials, more textures and more lights.Another problem is that, currently we have no function types in Julia. In C language, we can use function pointer and lookup table to mimic C++'s single dispatch, which is still impossible in Julia(unless you use the FunctionWrapper).

1 Like

Thank you for your reply! I never tried to implement a ray tracer. But if I was to start that in Fortran probably I would start building independent lists of materials, rays, textures, and work with linked lists to deal of the combinations over which the calculations have to be performed. That can be done in Julia as well, but I can imagine how complex that can become and that there might more elegant solutions.

This is what I have done in the toy project. I currently have only one light type, one material type and three shape types. I do union splitting on shape types. But I want to get rid of this approach since it’s not extensible.

The problem here is not “avoid bad performance”. I already know all the tricks here to ensure type stability. The problem is to get a equal performance of C++/C. So if you think union splitting can solve all this proplems, then it should easy to answer these questions:
1.If I have a lot of material types, then is union splitting faster or slower than single dispatch in C++? If it’s slower, then how would it impact the performance of the whole raytracer?(significant or little impact?)
2.If I use a isa to check whether a value belongs to a specific type, what’s the complexity and cost of it? Is isa a trivial function that has almost no cost?How is the cost of it, compared to a linear lookup table and switch clause in C?
3.If I want to ensure that all the abstract types are splitted into concrete types before the computation, what should I do if I have multiple functions depending on a dynamic type which are invoked during different stages of rendering? For example:

function first_stage()
    #use material to calculate reflect direction
end
function second_stage()
   #use material to calculate transmission direction
end
function third_stage()
   #use materials and textures to calculate colors of light
end
function raytracers()
    first_stage()
    second_stage()
    third_stage()
end

So to use union splitting, I have to couple all the functions together, or do union splitting at every position where it’s used?

2 Likes
  1. I do not know, one should do benchmark to compare it with C/C++. And it doesn’t make sense to compare with single dispatch, you should compare performance of all code, since Julia compiler may apply additional tricks which can mitigate the effect of if branching. Of course, general rule that Julia code is 1.2-1.3 slower than it’s C/C++ counterpart, so you should keep that in mind, while benchmarking.

  2. AFAIK, it’s very fast for concrete types, just the time of pointer comparison. But you should look at the Julia internals to get additional information.

  3. In your example there are no types at all, so it’s hard to say. If it can be rewritten as

function raytracers()
   material = getmaterialfromscene()
   first_stage(material)
   second_stage(material)
   third_stage(material)
end

then you can do something like this

function stages(material)
   first_stage(material)
   second_stage(material)
   third_stage(material)
end

function raytracers()
   material = getmaterialfromscene()
   if material isa Material1
     stages(material)
   # continue with union splitting
   # ...
end

If one has hundreds of materials, and I suppose many functions which will have to be split like this, I guess having a list of the materials and writing macros that generate the codes for these functions would be one practical way to deal with this, isn’t it?

1 Like

Yes, macro is a reasonable way to generalize this approach. You can materialize conditions for all subtypes of the given abstract type, so in the end one can reach rather compact form.

Another thing occurred to me. If some types require function, for example, the material may define diffusion function, then there is a problem, that Julia does not dispatch on functions. But you can avoid it by using functors. I think it’s not a big problem to write a macro, that can transform something like

@material begin
a::Int 
b::Float32
abc_diffusion(x) = # some function of parameters a, b and value of x
end

to

struct Material_abc <: AbstractMaterial
   a::Int
   b::Float32
end
function (m::Material_abc)(x)
  # some function of m.a, m.b and x
end

Now you can dispatch on Material_abc and use it freely in scene rendering as usual function. With proper union splitting macro there is even no need to keep track of all types that were added.

1 Like

Very interesting. Using a simple eval did the trick:

abstract type Material end

struct Material1 <: Material
  m :: Float64
end

struct Material2 <: Material
  m :: Float64
end

struct HitPoint{T <: Material}
  p :: Float64
  r :: Float64
  m :: T
end

# the hit function
hit(p <: HitPoint) = p.r*p.p*p.m.m

# generate functors for each type
for T in subtypes(Material)
  eval(:((p::HitPoint{$T})() = p.r*p.p*p.m.m))
end

# naive sum
function hits(hitpoints)
  s = 0.
  for p in hitpoints 
    s += hit(p)
  end
  s
end

# with union spliting
function hits2(hitpoints)
  s = 0.
  for p in hitpoints
    if p isa HitPoint{Material1}
      s += hit(p)
    elseif p isa HitPoint{Material2}
      s += hit(p)
    end
  end
  s
end

# with functors
function hits3(hitpoints)
  s = 0.
  for p in hitpoints 
    s += p()
  end
  s
end

Benchmark:

julia> n = 1000
julia> hitpoints = [ isodd(i) ? HitPoint(rand(),rand(),Material1(rand())) : 
              HitPoint(rand(),rand(),Material2(rand())) for i in 1:n ]

julia> @btime hits($hitpoints)
  25.215 μs (2000 allocations: 31.25 KiB)
126.63951422494503

julia> @btime hits2($hitpoints)
  1.137 μs (0 allocations: 0 bytes)
126.63951422494503

julia> @btime hits3($hitpoints)
  1.254 μs (0 allocations: 0 bytes)
126.63951422494503

If one creates a vector of the elements of the same type, one can more or less estimate the overhead of these approaches:

julia> n = 1000;

julia> hitpoints_single = HitPoint{Material1}[ HitPoint(rand(),rand(),Material1(rand())) for i in 1:n ];

julia> @btime hits($hitpoints_single)
  991.833 ns (0 allocations: 0 bytes)
125.72367862942023

julia> @btime hits3($hitpoints_single)
  992.250 ns (0 allocations: 0 bytes)
125.72367862942023

Dealing with the mixed-type array takes, in this example, 15% more time with the union spliting or functor approaches, and all approaches are the same in this case.

Thus the functor approach behaves as nicely as the union splitting.

I apologize for intervening in the discussion without too much to contribute. I think I learnt a lot, and I am grateful to you all for the patience.

6 Likes