Disabling allocations

I know that might be hard due to code size and complexity, but if possible could point an example?

It is possible in Julia:

using BenchmarkTools, LoopVectorization, PaddedMatrices#v0.2.1

@noinline function dostuff(A, B)
    C = A*B
    s = zero(eltype(C))
    @avx for i in eachindex(C)
        s += C[i]
    end
    s
end
function main_v1()
    A = @StrideArray rand(8,8);
    B = @StrideArray rand(8,8);
    dostuff(A, B)
end
function main_v2()
    A = @StrideArray rand(8,8);
    B = @StrideArray rand(8,8);
    @gc_preserve dostuff(A, B)
end
@benchmark main_v1()
@benchmark main_v2()

Because a StrideArray is your typical mutable array:

julia> A = @StrideArray rand(2,5)
2×5 StrideMatrix{Tuple{StaticInt{2}, StaticInt{5}}, (true, true), Float64, 1, 0, (1, 2), Tuple{StaticInt{8}, StaticInt{16}}, Tuple{StaticInt{1}, StaticInt{1}}, PaddedMatrices.MemoryBuffer{10, Float64}} with indices StaticInt{1}():StaticInt{1}():StaticInt{2}()×StaticInt{1}():StaticInt{1}():StaticInt{5}():
 0.429701  0.318488  0.842704  0.0217103  0.212563
 0.82351   0.245693  0.890502  0.941539   0.626707

julia> A[1,3] = 8;

julia> A
2×5 StrideMatrix{Tuple{StaticInt{2}, StaticInt{5}}, (true, true), Float64, 1, 0, (1, 2), Tuple{StaticInt{8}, StaticInt{16}}, Tuple{StaticInt{1}, StaticInt{1}}, PaddedMatrices.MemoryBuffer{10, Float64}} with indices StaticInt{1}():StaticInt{1}():StaticInt{2}()×StaticInt{1}():StaticInt{1}():StaticInt{5}():
 0.429701  0.318488  8.0       0.0217103  0.212563
 0.82351   0.245693  0.890502  0.941539   0.626707

main_v1 of course causes allocations:

julia> @benchmark main_v1()
BenchmarkTools.Trial:
  memory estimate:  1.06 KiB
  allocs estimate:  2
  --------------
  minimum time:     71.279 ns (0.00% GC)
  median time:      91.891 ns (0.00% GC)
  mean time:        101.286 ns (10.74% GC)
  maximum time:     924.190 ns (77.74% GC)
  --------------
  samples:          10000
  evals/sample:     974

But main_v2 stack allocates them:

julia> @benchmark main_v2()
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     40.155 ns (0.00% GC)
  median time:      40.206 ns (0.00% GC)
  mean time:        40.346 ns (0.00% GC)
  maximum time:     57.046 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     990

Also, for good measure, using immutable structs will normally be stack allocated:

using StaticArrays
function main_v3()
    A = @SMatrix rand(8,8);
    B = @SMatrix rand(8,8);
    dostuff(A, B)
end
@benchmark main_v3()

yielding

julia> @benchmark main_v3()
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     186.780 ns (0.00% GC)
  median time:      187.906 ns (0.00% GC)
  mean time:        188.042 ns (0.00% GC)
  maximum time:     203.646 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     891

The macro PaddedMatrices.@gc_preserve works by

  1. Using GC.@preserve on all arguments to the call
  2. Tries to replace AbstractArrays with a PtrArray that holds a pointer to the original array, along with holding its size, strides, and offsets.

GC.@preserve protects the memory from getting collected, but it is still stack allocated as it cannot escape; the array itself is replaced with the PtrArray. Of course, you as the user have to guarantee that the PtrArray doesn’t escape, as it’s only valid for as long as GC.@preserve protects the data.

You could take a similar approach with whatever data structures you need.

Also, you should be able to avoid all dynamic dispatches. Try creating branches at the point a type could take one more than one value, and immediately call into those functions. That is, instead of

function foo(args...)
    # do stuff
    if hit_metalic_object
        thing_its_reflecting_off_of = MetalType()
    elseif #...
        thing_its_reflecting_off_of = #...
    else #...
        # ...
    end
     # computations continue
end

do something like

Now, I don’t know what your code looks like

function foo(args...)
    # do stuff
    if hit_metalic_object
        foo_continued(MetalType(), args...)
    elseif #...
        # ...
    else #...
        # ...
    end
 end
function foo_continued(thing_its_reflecting_off_of, args...)
    # computations continue
end

I don’t know what your code looks like, but you should be able to restructure/organize it in a way to avoid dynamic dispatches.

3 Likes

If I am not mistaken, I can first take a pointer of a mutable struct by wrapping the struct in Ref and RefValue and unsafely converting it to Ptr, then pass the pointer into a function and use GC.@preserve to prevent the pointer getting collected.

The problem here is that immediately call into those functions. is impossible. For example, a scene has a collection of lights with different light types. Dynamic dispatch is needed when iterating this list.

Do they have to be different types?

This is just a toy project. So it’s fine to build only one light type which contains all the fields of all the light types. Currently tinsel only supports one material type, which fortunately eliminates a lot of dynamic dispatch. But this solution doesn’t scale well when you have many different types of light ,shapes and materials. That’s one reason why I stop add more features to tinsel(It will become much slower and allocate a lot of memory).

Your claims seem to imply that you found a very deep and important problem in Julia, which impairs you from writing a performant code. I think we do not expect these problems to exist, such that your case might be interesting for some reasons:

  1. We can learn how to write the code in Julia avoiding those pitfalls, and be better Julia programmers.
  2. We find that doing (1) is indeed cumbersome, and that suggests improvements in the Julia language or package ecosystem.
  3. There is indeed an unsolvable problem (unlikely).

I do not consider myself a very experienced Julia programmer and I use these threads to learn a lot. I would be very grateful if examples of those issues are provided so we can learn from them how to do things better. I understand that providing such examples may be more work that you are willing to dedicate to this, but in that case it is a little bit frustrating that you claim a major problem in the language without actually providing the examples that show them.

3 Likes

I think I am not the only one who proposes the “single dispatch” problem. Of course, one can use NameTuple and FunctionWrapper to mimic a method table just like those in C++ (actually I have tried a prototype of it, but I finally gave up for complexity).
The problem here is that we try to eliminate dynamic dispatch, since dynamic dispatch is inefficient and allocates memory, whereas single dispatch in C++ and other languages also happens at runtime but the cost of single dispatch is much smaller.

It’s solvable, but the solution is frustrating. For example, Elrod suggests using only one light type. It’s doable in this toy projects and this is what I currently done in this project. But it won’t work well when you have a lots of light types. He also suggests that ensuring the production of different types followed immediately by the consumption of the types by using a if-elseif-else branching. However, for a raytracer, this is impossible, since when you create a list of lights for the scene, different type of lights are already mixed together. It’s also possible to use a large union of different types and let the compilers do the union splitting. But I don’t think this approach scales well…You have to modify the code of library to add more supports of different type.

That sounds strange to me, dynamic dispatch to a sum, for example, with 184 methods, does not introduce any problem. But I am not qualified to argue here, I would just want to understand if there is an issue that we must know.

edit: I think I will remain lost on this one :frowning: . I do not see how having the possibility of multiple dispatch can make things worse relative to not having it. Seems that macros could be used to generate the combinations of types and functions required for one single problem, which would be probably more efficient than other alternatives. But certainly I am not able to understand the issue here… If someone has the time to provide a mwe of the kind of issue that is being pointed here, it will be greatly appreciated.

FWIW, the one-type approach is what’s used in this simple path tracer for material types.

2 Likes

What’s being described is most certainly an issue for some people, which comes about due to the tradeoff that Julia makes as a language: writing code that has some dynamic parts is easy, however that means it’s easy to accidentally write dynamic code when you weren’t intending to. The same goes for allocations; we make allocating easy and don’t require explicit frees, but that also means it’s easy to make accidental allocations and also GC activity. Julia is not perfect, and cannot become so, because the free-lunch theorem says that every decision you make that improves some aspect, also harms some other aspect. I think it’s worth remembering that before telling someone that their issue isn’t actually an issue.

I agree with @Elrod that if/else followed by a function barrier is probably the way to handle this, since then union splitting can work decently well, but FunctionWrapper is also a good possibility.

I think it would be reasonable to have a utility that detects allocations in a piece of code and throws an error (with a stacktrace) when they try to happen. Cassette can work for this, but as @Mason points out, all sorts of things can allocate. Additionally, Cassette can be hard on the compiler and often produces horrible stacktraces. If it were me implementing this, I would probably feed the raytracer inner loop into GPUCompiler, and add an LLVM.jl pass that looks for all uses of gc_pool_alloc (which is an intrinsic that Julia inserts that indicates that a value must be heap allocated). That will also ensure that your code has no dynamic dispatch, since GPUCompiler errors when it encounters any such dynamic calls.

3 Likes

Oh, no, that was not what I meant. I think I was pretty clear to say that I do not understand how this issue arises. And I do not mean understand implying that the author might not know how to do things, I mean really, I do not understand how a issue like the one mentioned arises in a code. And if a small example of such problem could be written, perhaps that would help me use the language better.

I had already my own share of trying to find allocations, which I solved by properly defining types in my structures, and composed structures thereon (some that were not easy to find). But the problem here mentioned seems to be in a different level, which is probably beyond what I can grasp at the moment.

3 Likes

Just my two cents. In some situations it can be beneficial to split different homogeneous containers for the different types and iterates over them separately. This way dynamic dispatch is avoided, while not having to do any cumbersome gymnastics to define “manual tagged union” types because of the lack of sum types in Julia.

Example code with 3 types of lights
const Vec{N,T} = SArray{Tuple{N},T,1,N}

abstract type AbstractLight end

struct PointLight{T,P} <: AbstractLight
    position::Vec{3,T}
    power::P
end

struct AreaLight{T,P,S} <: AbstractLight
    position::Vec{3,T}
    direction::Vec{3,T}
    power::P
    size::S
end

struct SpotLight{T,P,A} <: AbstractLight
    position::Vec{3,T}
    direction::Vec{3,T}
    power::P
    angle::A
end

point_lights::Vector{PointLight{Float64,Float64}} = ...
area_lights::Vector{AreaLight{Float64,Float64,Float64}} = ...
spot_lights::Vector{SpotLight{Float64,Float64}} = ...

irradiance(position, normal, light::PointLight) = ...
irradiance(position, normal, light::AreaLight) = ...
irradiance(position, normal, light::SpotLightLight) = ...

function irradiance(position, normal, point_lights, area_lights, spot_lights)
    irrad = 0.
    for light in point_lights
        irrad += irradiance(position, normal, light)
    end
    for light in area_lights
        irrad += irradiance(position, normal, light)
    end
    for light in spot_lights
        irrad += irradiance(position, normal, light)
    end
    irrad
end
2 Likes

But what kind of structure in a language that does not have dynamical dispatch allows this to be done in any way that cannot be done with Julia, resulting in more performant code? In other words, how the possibility of doing this using dynamical dispatch is detrimental?

(in Fortran I would just have different function names for every type of light and certainly I would iterate over each separately or add conditionals inside the loop but the fallback to that option in Julia seems a completely viable option. Are there other code structures that would perform better than that?).

I think the issue here is dynamic dispatch at run-time, not multiple dispatch at compile-time.

4 Likes

Is something like this what we are talking about?

using BenchmarkTools

function mysum(x)
  s = 0.
  for i in 1:length(x)
    s += x[i]
  end
  s
end

n = 1000

x = [ isodd(i) ? rand(1:10) : rand() for i in 1:n ]
@btime mysum(x)

x = [ rand(1:10) for i in 1:n ]
@btime mysum(x)

x = [ rand() for i in 1:n ]
@btime mysum(x)

That sum of course is much faster if performed on arrays of numbers of the same type. If this is an example of the sort of problem that may arise in a more complex code, other languages have smarter ways to deal with that?

I’m very interested in what problems you have encountered, and in what solutions you have found.

For dynamic dispatch over collections, I looked into things like FunctionWrapper to build dispatch tables, but finally settled on just storing items in a tuple (which is a heterogenous collection after all) and finally unrolling the loop with a macro. This makes the compiler generate code directly specialized to the contents of the scene. It works great for toy examples, although the type signatures are monstrous. I have yet to implement a BVH or anything, which will surely make things more complex.

I’m not sure I quite follow the stack allocation example. Can’t stack allocations be enforced by making the type immutable? Is the problem because your type requires mutability, or because of something else I don’t understand?

1 Like

If you try to implement a BVH for mixed types of geometries and primitives(for example, sphere, box, plane…), you will quickly encounter serious problems. They need to be stored in a heterogenous collection, so dynamic dispatch is needed. Union splitting is also possible, but it’s not extensible (though it works fine in the toy project).

That’s the problem. It’s possible to get around, but it’s getting harder when you add more features.

Immutable types which contain references to mutable types can cause allocations, though in newer version of Julia this is improved. Also, modifying and partially initializing a immutable type is hard. Another problem is that, a light/material type in raytracer may contain a lot of field, so copy of the struct is inefficient(which may happen when the type is immutable) , though I have no firm evidence for that. Using immutable type is totally possible, but it might be slower than the mutable type.
The true problem I encountered when I rewrite tinsel is that, I have to use abstract types somewhere(for example, in a bvh or light list or a hitrecord). Consider this code:

struct HitRecord
    ray::Ray
    hitpoint::Point3
    hitmaterial::AbstractMaterial
end
function raytracing()
    #...
    hitrec =  hitScene()
    #use the material to generate new rays 
    reflectRay(hitrec.ray,hitrec.material)
    #...
end

If I use imutable types for all materials, then a copy will be performed when you construct HitRecord, since you can’t just a store a pointer to the material (Since immutable type has no stable address.). To avoid copy, I can use mutable types or wrap the types in a Base.Ref, which then lead to allocations because it might escape…
A possible choice is to unsafely take the pointer of the struct and use GC.@preserve to use the pointer. After that the lifetime of the value is ended, so the memory is released.
In some C++ raytracer I have read about(Ray Tracing in One Weekend, pbrt,Mitsuba2), they just use single dispatch(virtual function) to solve the problems. So they store a shared_pointer(or even raw pointer) to the parent class(that is AbstractMaterial).

1 Like

It looks like in your example real problem is not stack allocation, but type instability of the hitScene. Is there any way to provide type stable version of hitScene?
Alternatively you can try to use union splitting

struct HitRecord{T <: AbstractMaterial}
    ray::Ray
    hitpoint::Point3
    hitmaterial::T
end
function raytracing()
    #...
    hitrec =  hitScene() # Here type of HitRecord is unknown
    #use the material to generate new rays 
    if hitrec isa HitRecord{Material1}
       reflectRay(hitrec.ray,hitrec.material)
    elseif hitrec isa HitRecord{Material2}
       reflectRay(hitrec.ray,hitrec.material)
    end
    #...
end
1 Like

This doesn’t work…Now the type of hitrec is HitRecord{T} where T<:AbstractMaterial, an abstract type, can you ensure that this dynamic type doesn’t cause any heap allocation?
Another problem is that, even you use a if-elseif-else branch to enforce type stability, is it faster or slower than the single dispatch approach in C++? Rendering a 500x500 picture with 64 samples for each pixel needs to invoke the raytracing function at least for 16_000_000 times. Not only materials, but also lights, textures and shapes can have differen types. Performance of dispatch at runtime(union splitting/dynamic dispatch/single dispatch) will affect the performance of the whole raytracer.

I do not quite understand. HitRecord{T} is a concrete type if T is a concrete type.

abstract type Foo end

struct Bar1 <: Foo
    x::Int
end

struct Bar2 <: Foo
    x::Float64
end

struct Baz{T <: Foo}
    y::T
end

julia> isconcretetype(Foo)
false

julia> isconcretetype(Bar1)
true

julia> isconcretetype(Bar2)
true

julia> isconcretetype(Baz)
false

julia> isconcretetype(Baz{Bar1})
true

Union splitting is rather fast as far as I know, but of course you should experiment with your real life structures to understand it’s influence on performance

function f(v)
    res = 0.0
    for z in v
        if z isa Baz{Bar1}
            res += z.y.x
        elseif z isa Baz{Bar2}
            res += z.y.x
        end
    end

    return res
end

v = Baz.(map(x -> x[1](x[2]), zip(rand([Bar1, Bar2], 100), rand(1:1000, 100))))

julia> @btime f($v)
  97.095 ns (0 allocations: 0 bytes)

At least it is non allocating.