Ray Tracing in a week-end - Julia vs SIMD-optimized C++

Great thread! Fun fact: The design of StaticArrays was indirectly informed by the arrays-for-geometry-types I wish I’d had back when I used to write a lot of C++ rendering code (in the aqsis project and several toy renderers).

Some thoughts after looking over the project:

StaticArrays usage

In the README that you’ve got a section “using Base.getproperty() for vec.x instead of vec[1]”. You could do this very easily using StaticArrays.FieldVector:

struct Vec3{T} <: FieldVector{3, T}
    x::T
    y::T
    z::T
end

On the other hand, there’s something quite nice about using plain SVector. For instance, you can use the short syntax:

  • SA[1,2,3] is a shorter version of @SVector[1,2,3]
  • SA{T}[1,2,3] is shorthand for @SVector T[1,2,3]
  • There’s also aliases SA_F64 and SA_F32 for SA{Float64} and SA{Float32} if you’re writing concrete code.

Actually… here’s a nice idea — what if we added a getproperty feature to StaticArrays so that v.x, v.y and v.z “just worked” by default like everyone wants them to? Here, I grant you a letter of marque :scroll: to use the following getproperty implementation:

function Base.getproperty(v::SVector{3}, name::Symbol)
    name === :x ? getfield(v, :data)[1] :
    name === :y ? getfield(v, :data)[2] :
    name === :z ? getfield(v, :data)[3] : getfield(v, name)
end

Notes on the code

It’s been mentioned above, but to mention it again - const _rng = Xoroshiro128Plus() is no good with threads. (There’s a lot of globals though! I’m not sure it’s used from your threads section?)

A few thoughts about the loop here: https://github.com/claforte/RayTracingWeekend.jl/blob/9d88dfa7e48e82101e87988ee73305f130121f66/src/proto.jl#L242

A general thought about SIMD (on which I’m not an expert!) — data layout can be important and 3-vectors and arrays of structs aren’t a great data layout. The problem is that such structs are a great programming abstraction! Packages like StructOfArrays can help to change the data layout into a SIMD-friendly struct-of-arrays format while preserving the array-of-structs abstraction you want to use in the code.

Making the materials and hit system fully concrete with tight inner loops

If you’re willing to make larger changes to your ray_color function and replace the (intuitive!) recursive formulation with a ray batch system, you could have fully concrete inner loops and you’d have a much higher chance of getting really tight loops which SIMD properly.

I think there’s a few steps to this:

  1. Replace the recursion in ray_color with a loop and a stack of hit events (make sure that the stack gets reused to avoid allocations)!

  2. Walk batches of rays together. Each time there’s a hit, instead of recursing into it immediately, store the HitRecord in the batch.

  3. Remove the abstract Material from HitRecord, and replace it with something like the following concrete MaterialRef type. This allows hits to be grouped into like materials, with all the hits for the same type of material processed together in a batch.

struct MaterialRef
    UInt16 material_type_index
    UInt16 index
end

Now you have the following (untested!) code sketch to store (“intern”) a new material into some existing arrays:

function store_material(material_types, all_materials, new_material)
    material_type_index = findfirst(material_types, typeof(new_material))
    storage = all_materials[]
    push!(storage, new_material)
    return MaterialRef(material_type_index, lastindex(storage))
end

# Storage arrays for materials can be created like such
material_types = subtypes(Material)
all_materials = collect(Vector{T}() for T in material_types)

The point of this is that you can act on all materials of the same type in batch mode in a tight loop without dynamic dispatch and with inlining:

function process_hits(hits, material_storage, material_index)
    for hit in hits
        if hit.material_ref.material_type_index == material_index
            material = mat_storage[hit.material_ref.index]
            # Do whatever you want with the material and the hit
            # The compiler knows the concrete type of `material_storage`
            # so use of `material` will be fast and inlined.
        end
    end
end

# Very rough usage
mat_idx = 1
mat_storage = all_materials[mat_idx]
process_hits(hits, all_materials[mat_idx], mat_idx)

This kind of batch-based pattern is likely to be helpful (perhaps necessary?) on the GPU.

Sampling patterns

You’re using rand() without stratification inside render() (and elsewhere like in random_vec2_in_disk) but this is very noisy for a given number of samples per pixel and you can consider stratified sampling. Simple NxN jittered sampling would be a lot better and is easy to implement, at least within render(), though it restricts n_samples = N*N for some N.

Alternatively it would be interesting to try a low discrepancy sequence from a package like https://github.com/stevengj/Sobol.jl.

21 Likes