Ray Tracing in a week-end - Julia vs SIMD-optimized C++

gbaraldi · December 16, 2021, 2:20am

The 4x speedup feels like your code isn’t SIMDing while the ISPC has manual SIMD so it does utilize it. I don’t know how your code is right now but maybe LoopVectorization.jl can work.

c42f · December 16, 2021, 3:19am

Great thread! Fun fact: The design of StaticArrays was indirectly informed by the arrays-for-geometry-types I wish I’d had back when I used to write a lot of C++ rendering code (in the aqsis project and several toy renderers).

Some thoughts after looking over the project:

StaticArrays usage

In the README that you’ve got a section “using Base.getproperty() for vec.x instead of vec[1]”. You could do this very easily using StaticArrays.FieldVector:

struct Vec3{T} <: FieldVector{3, T}
    x::T
    y::T
    z::T
end

On the other hand, there’s something quite nice about using plain SVector. For instance, you can use the short syntax:

SA[1,2,3] is a shorter version of @SVector[1,2,3]
SA{T}[1,2,3] is shorthand for @SVector T[1,2,3]
There’s also aliases SA_F64 and SA_F32 for SA{Float64} and SA{Float32} if you’re writing concrete code.

Actually… here’s a nice idea — what if we added a getproperty feature to StaticArrays so that v.x, v.y and v.z “just worked” by default like everyone wants them to? Here, I grant you a letter of marque to use the following getproperty implementation:

function Base.getproperty(v::SVector{3}, name::Symbol)
    name === :x ? getfield(v, :data)[1] :
    name === :y ? getfield(v, :data)[2] :
    name === :z ? getfield(v, :data)[3] : getfield(v, name)
end

Notes on the code

It’s been mentioned above, but to mention it again - const _rng = Xoroshiro128Plus() is no good with threads. (There’s a lot of globals though! I’m not sure it’s used from your threads section?)

A few thoughts about the loop here: https://github.com/claforte/RayTracingWeekend.jl/blob/9d88dfa7e48e82101e87988ee73305f130121f66/src/proto.jl#L242

If column major vs row major in the output image matters for perf… you know it’s a toy renderer or toy scene as this is the outer loop
Your u,v coordinates for your samples are “off by half a pixel” - you may want u = (j-T(0.5))/nx instead. If you’re intent on microoptimizing this outer loop, you could consider computing dx = T(1)/nx outside the loop doing u = (j-T(0.5))*dx to avoid the division.
The dot product at https://github.com/claforte/RayTracingWeekend.jl/blob/9d88dfa7e48e82101e87988ee73305f130121f66/src/proto.jl#L277 seems unnecessary if your ray directions are normalized?
Is there a reason to have T in the abstract Material type? You can probably remove & simplify https://github.com/claforte/RayTracingWeekend.jl/blob/9d88dfa7e48e82101e87988ee73305f130121f66/src/proto.jl#L338
The comment at https://github.com/claforte/RayTracingWeekend.jl/blob/9d88dfa7e48e82101e87988ee73305f130121f66/src/proto.jl#L342 seems a bit misleading. It’s the distance, not the vector?
There may be a performance benefit from having the abstract Material{T} at https://github.com/claforte/RayTracingWeekend.jl/blob/9d88dfa7e48e82101e87988ee73305f130121f66/src/proto.jl#L352 due to the small Union optimization. But this optimization may go away and performance fall off a cliff if you increase the number of Material subtypes beyond a few.
HittableList = Vector{Hittable} has abstractly typed elements, even though there’s only one concretely typed Sphere <: Hittable. If you add other hittables you’ll probably find a big performance reduction here. What happens if you replace Vector{Hittable} with Vector{Sphere}? (Yes this makes things less flexible but if there’s a huge performance boost we can recover the abstraction with an alternative method.)
Why are all your Material subtypes mutable? This seems unnecessary and might have a performance impact.
Does this @simd actually do anything https://github.com/claforte/RayTracingWeekend.jl/blob/9d88dfa7e48e82101e87988ee73305f130121f66/src/proto.jl#L447 ? I’d guess not because Vector{Hittable} has an abstract element type?

A general thought about SIMD (on which I’m not an expert!) — data layout can be important and 3-vectors and arrays of structs aren’t a great data layout. The problem is that such structs are a great programming abstraction! Packages like StructOfArrays can help to change the data layout into a SIMD-friendly struct-of-arrays format while preserving the array-of-structs abstraction you want to use in the code.

Making the materials and hit system fully concrete with tight inner loops

If you’re willing to make larger changes to your ray_color function and replace the (intuitive!) recursive formulation with a ray batch system, you could have fully concrete inner loops and you’d have a much higher chance of getting really tight loops which SIMD properly.

I think there’s a few steps to this:

Replace the recursion in ray_color with a loop and a stack of hit events (make sure that the stack gets reused to avoid allocations)!
Walk batches of rays together. Each time there’s a hit, instead of recursing into it immediately, store the HitRecord in the batch.
Remove the abstract Material from HitRecord, and replace it with something like the following concrete MaterialRef type. This allows hits to be grouped into like materials, with all the hits for the same type of material processed together in a batch.

struct MaterialRef
    UInt16 material_type_index
    UInt16 index
end

Now you have the following (untested!) code sketch to store (“intern”) a new material into some existing arrays:

function store_material(material_types, all_materials, new_material)
    material_type_index = findfirst(material_types, typeof(new_material))
    storage = all_materials[]
    push!(storage, new_material)
    return MaterialRef(material_type_index, lastindex(storage))
end

# Storage arrays for materials can be created like such
material_types = subtypes(Material)
all_materials = collect(Vector{T}() for T in material_types)

The point of this is that you can act on all materials of the same type in batch mode in a tight loop without dynamic dispatch and with inlining:

function process_hits(hits, material_storage, material_index)
    for hit in hits
        if hit.material_ref.material_type_index == material_index
            material = mat_storage[hit.material_ref.index]
            # Do whatever you want with the material and the hit
            # The compiler knows the concrete type of `material_storage`
            # so use of `material` will be fast and inlined.
        end
    end
end

# Very rough usage
mat_idx = 1
mat_storage = all_materials[mat_idx]
process_hits(hits, all_materials[mat_idx], mat_idx)

This kind of batch-based pattern is likely to be helpful (perhaps necessary?) on the GPU.

Sampling patterns

You’re using rand() without stratification inside render() (and elsewhere like in random_vec2_in_disk) but this is very noisy for a given number of samples per pixel and you can consider stratified sampling. Simple NxN jittered sampling would be a lot better and is easy to implement, at least within render(), though it restricts n_samples = N*N for some N.

Alternatively it would be interesting to try a low discrepancy sequence from a package like https://github.com/stevengj/Sobol.jl.

c42f · December 16, 2021, 3:48am

github.com/JuliaArrays/StaticArrays.jl

Named access with .{x,y,z,w} to elements of SVector and MVector

JuliaArrays:master ← JuliaArrays:cjf/svector-getproperty

opened 03:39AM - 16 Dec 21 UTC

c42f

+86 -29

Having access with v.x etc makes the *concrete* SVector interface much more usa…ble for geometry; indeed, it will "just work like you expect" without the need to resort to a custom FieldVector type. This makes the *concrete* SVector/MVector interface fatter, but OTOH I don't think it can do any harm to people who don't want to use it. If we do this, it should be made to work at least for SVector{1} SVector{2} and SVector{3}, but I saw no reason to disallow the use of these in higher dimensions and overcomplicate the implementation. Inspired by discussion at https://discourse.julialang.org/t/ray-tracing-in-a-week-end-julia-vs-simd-optimized-c/72958 I'll add tests and some docs if we decide this is a good idea!

Let’s see what the other StaticArrays contributors think. Note that this would be an interface for the concrete SVector/MVector only. For general StaticVector indexing would remain as the abstract interface.

viralbshah · December 16, 2021, 5:12am

It would be great to see what it takes for us to match ISPC performance.

c42f · December 16, 2021, 6:19am

I haven’t read up on ISPC, but is the goal “nice idiomatic-looking code” with the compiler doing fancy tricks for SPMD-style programs to get good SIMD speedups? If that’s the high level goal then some of my suggestions above make less sense because they require quite a bit of restructuring.

However I think we could possibly make good progress by using data structures which optimize the data layout internally, while keeping the same surface-level API. This is roughly along the lines of the StructOfArrays package, and the same idea as a lot of the optimizations Keno did for the Celeste.jl project, if I remember correctly. Deep ray trees in a naive recursive ray tracer will likely be bad for any SPMD-style program. Ray divergence is a killer…

Elrod · December 16, 2021, 6:12pm

The Story of ISPC is a good read, in case anyone hasn’t read it yet.

I think we should work on an LLVM pass to do this.
We want access to target information when making these transforms without running into any precompilation or multi-versioning information, and we wouldn’t want egregious latency.
Working on typed Julia IR would be better than LoopVectorization, but LLVM already has lots of analysis infrastructure, particularly for generating target-specific code.
Later in the pipeline should also have less latency for passes likely to increase the amount of code everything downstream must work with.

ISPC is also open source with a 3-clause BSD license, so we could look through it. In particular, opt.cpp.

Also, interesting that since then, the Intel compilers have all switched to using LLVM. It will be interesting to see if they start making ISPC-like transformations in the future.

claforte · December 16, 2021, 8:53pm

c42f:

function Base.getproperty(v::SVector{3}, name::Symbol)
    name === :x ? getfield(v, :data)[1] :
    name === :y ? getfield(v, :data)[2] :
    name === :z ? getfield(v, :data)[3] : getfield(v, name)
end

Thanks for sharing this @Elrod! Wow, I hadn’t realized that Matt Phar (a legend in the Computer Graphics world) was the original author of ISPC… no wonder it does a superb job at accelerating at this workload!

claforte · December 16, 2021, 8:59pm

Thanks a lot for the great ideas! I started integrating some…

Removing mutable from the Camera and Materials made no perf difference, so I’m keeping this simplification.

The simplification of @SVector/etc. helps a lot to make the code more readable.

I adapted your Base.getproperty idea, and w/ @inline and @inbounds, it seems to give a consistent 3-5% gain in smaller benchmark. Yeah!

# Adapted from @Christ_Foster's: https://discourse.julialang.org/t/ray-tracing-in-a-week-end-julia-vs-simd-optimized-c/72958/40
@inline @inbounds function Base.getproperty(v::SVector{3}, name::Symbol)
    name === :x || name === :r ? getfield(v, :data)[1] :
    name === :y || name === :g ? getfield(v, :data)[2] :
    name === :z || name === :b ? getfield(v, :data)[3] : getfield(v, name)
end

@inline @inbounds function Base.getproperty(v::SVector{2}, name::Symbol)
    name === :x || name === :r ? getfield(v, :data)[1] :
    name === :y || name === :g ? getfield(v, :data)[2] : getfield(v, name)
end

I can live with that duplication, but is there any smart way to simplify this code a bit further?

BTW IMHO the ideal implementation would support mutable Vectors and swizzling: Swizzling (computer graphics) - Wikipedia … then we could run shader code almost as-is.

claforte · December 16, 2021, 9:28pm

Eliminate off-by-half-a-pixel offset: ~2.5% faster
- brought up by @Christ_Foster, and on 2nd thought…
- this is an unnecessary deviation from Peter Shirley’s original
- ran multiple times, seems like ~2.5% speed-up on small benchmarks that has 32 samples, i.e. should be sample-bound

claforte · December 16, 2021, 9:46pm

I tried it (and kept @simd in the later for loop), but if it makes any difference, it’s less than 1% speed-up… so I reverted that change. If I have time later, I want to support Polygon Meshes, Distance Fields, etc.

I tried removing that @simd, it made no difference… so I commented it out for now…

Thanks a lot for your ideas on making the materials and hit system fully concrete… I’ll consider these later, for now I don’t want to deviate unnecessarily from Peter Shirley’s implementation, since I only implemented the equivalent of book 1… but I’ll keep these ideas in mind if/when I implement a GPU-optimized version.

BTW this is becoming a time-consuming side-project, if anybody is willing to directly help improve this code, your help would be welcome!

sashmit · December 16, 2021, 10:16pm

BTW this is becoming a time-consuming side-project,

Ray tracing in a weekend or two?

claforte · December 17, 2021, 12:36am

Latest commit: https://github.com/claforte/RayTracingWeekend.jl/commit/6162fc0bd33020ecf23dad68998feed702094c6b
Implement per-thread RNGs and fixed seeds

this isn’t any faster though… not sure if I did something wrong…
it would great if the right way to implement this was documented in the Julia docs
execution and images are now fully reproducible… speed still varies nearly 1% though
also ran the full benchmark…
we’re now ~5.6% faster than GCC w/ 14 threads

Can an expert (e.g. @c42f) take a quick look at this attempt to get fast per-thread RNGs? Thanks!

c42f · December 17, 2021, 2:50am

This wasn’t a performance optimization, more just a fix to make the broken code correct. You can’t share mutable RNG state between threads and expect the RNG to produce valid samples. For example the RNG state could get corrupted and result in non-random samples or even program crashes. (The exact behavior depends on the way the RNG works.)

If your RNGs are taking a significant amount of time, you could experiment with cheaper lower-quality generator algorithms. For computer graphics applications the RNG doesn’t exactly need to produce high quality random numbers. The ultimate test is whether the final image looks good.

Your trand() function looks correct to me. This implementation with the nthreads-sized array is what the Random standard library did until recently when we got per-task RNGs. The alternative which doesn’t rely on global state is to pass the RNG as an argument to any function that needs it.

c42f · December 17, 2021, 3:05am

Yeah this isn’t fully surprising, but I predict that you’ll see a big speed bump as soon as you introduce another few Hittable types. (You don’t even need to use any of these other types in a scene, just defining them in your Julia session is enough to force dynamic dispatch and loose the chance of inlining.) Anyway, not to scare monger You can easily test it out by defining a few such dummy types <: Hittable. ~~No need to actually implement any functions for them.~~ (Actually you might need placeholder implementations of hit(), at this point my knowledge of type inference starts to get hazy.)

c42f · December 17, 2021, 3:33am

Ah yes I’ve read “The story of ISPC” before but apparently I forgot all about it. The ISPC approach does seems surprisingly straightforward and elegant. A quote from The story of ispc: C’s influence and implementing SPMD on SIMD (part 4)

… the problem with auto-vectorization: it’s a complex optimization, full of heuristics, and you can’t be sure where it’s going to end up. The compiler is trying to reason about the safety of vectorizing a loop—are there any loop-carried dependencies? One can only go so far with reasoning about arbitrary programs with a computer program (remember that pesky halting problem), so auto-vectorizers are doomed to be brittle and unpredictable to the user.

SPMD on SIMD? That’s a transformation . We just saw how to do it. It’s mechanical. If you’ve automated it, there’s no reason it won’t always work.

claforte · December 17, 2021, 3:40am

The results of my latest commit confuses me. I’m trying a non-mutable HitRecord but I need to use a Union{HitRecord, Nothing} to return no-hits cheaply.

Here’s the commit:

Minor slowdown? try non-mutable HitRecord
- heavily inspired by @paulmelis' hittable.jl
- It seems >2X faster in small scenes without reflections/refractions... 
  **951us vs 2.225ms!**
- ... but it's slower for larger scene:
`@btime render($_scene_random_spheres, $t_cam1, 200, 32) ` 
consistently increases from 286ms to 300ms?!

I also committed the .mem file, in case someone can take a quick look and point to me what kind of newbie mistake I must be making…?
https://github.com/claforte/RayTracingWeekend.jl/blob/main/examples/proto.mem_immutable_hitrecord.jl

paulmelis · December 17, 2021, 6:45am

Warning ahead, it’s early morning that I look into this, due to insomnia, so I might be a bit fuzzy in my reasoning

The allocations happening when this function is called:

github.com

claforte/RayTracingWeekend.jl/blob/95c986b7081e0cbe4fa4b1eb8d6ae4af1c5d704d/examples/proto.mem_immutable_hitrecord.jl#L596-L596


      
          524336 	rec = hit(world, r, T(1e-4), typemax(T))

seem to be masked by the function being marked @inline. Removing some of those gives me this change:

paulm@l0420007 07:41:~/c/RayTracingWeekend.jl$ diff -ur src/proto.jl.4010.mem src/proto.jl.4359.mem 
--- src/proto.jl.4010.mem	2021-12-17 07:27:17.968053056 +0100
+++ src/proto.jl.4359.mem	2021-12-17 07:35:16.574703751 +0100
@@ -396,7 +396,8 @@
         - # """
         - 
         - """Equivalent to `hit_record.set_face_normal()`"""
-        - @inline @fastmath function ray_to_HitRecord(t::T, p, outward_n⃗, r_dir::Vec3{T}, mat::Material{T})::Union{HitRecord,Nothing} where T
+        - #@inline 
+        - @fastmath function ray_to_HitRecord(t::T, p, outward_n⃗, r_dir::Vec3{T}, mat::Material{T})::Union{HitRecord,Nothing} where T
         - 	front_face = r_dir ⋅ outward_n⃗ < 0
         - 	n⃗ = front_face ? outward_n⃗ : -outward_n⃗
         - 	HitRecord(t,p,n⃗,front_face,mat)
@@ -442,46 +443,48 @@
         - 	return Scatter(scattered_r, attenuation)
         - end
         - 
-        - @inline @fastmath function hit(s::Sphere{T}, r::Ray{T}, tmin::T, tmax::T)::Union{HitRecord,Nothing} where T
-        -     oc = r.origin - s.center
+        - #@inline 
+        - @fastmath function hit(s::Sphere{T}, r::Ray{T}, tmin::T, tmax::T)::Union{HitRecord,Nothing} where T
+        0     oc = r.origin - s.center
         -     #a = r.dir ⋅ r.dir # unnecessary since `r.dir` is normalized
         - 	a = 1
-        - 	half_b = oc ⋅ r.dir
-        -     c = oc⋅oc - s.radius^2
-        -     discriminant = half_b^2 - a*c
-        - 	if discriminant < 0 return nothing end # no hit!
-        - 	sqrtd = √discriminant
+        0 	half_b = oc ⋅ r.dir
+        0     c = oc⋅oc - s.radius^2
+        0     discriminant = half_b^2 - a*c
+        0 	if discriminant < 0 return nothing end # no hit!
+        0 	sqrtd = √discriminant
         - 	
         - 	# Find the nearest root that lies in the acceptable range
-        - 	root = (-half_b - sqrtd) / a	
-        - 	if root < tmin || tmax < root
-        - 		root = (-half_b + sqrtd) / a
-        - 		if root < tmin || tmax < root
-        - 			return nothing # no hit!
+        0 	root = (-half_b - sqrtd) / a	
+        0 	if root < tmin || tmax < root
+        0 		root = (-half_b + sqrtd) / a
+        0 		if root < tmin || tmax < root
+        0 			return nothing # no hit!
         - 		end
         - 	end
         - 	
-        - 	t = root
-        - 	p = point(r, t)
-        - 	n⃗ = (p - s.center) / s.radius
-        - 	return ray_to_HitRecord(t, p, n⃗, r.dir, s.mat)
+        0 	t = root
+        0 	p = point(r, t)
+        0 	n⃗ = (p - s.center) / s.radius
+   123680 	return ray_to_HitRecord(t, p, n⃗, r.dir, s.mat)
         - end
         - 
         - const HittableList = Vector{Hittable}
         - 
         - #"""Find closest hit between `Ray r` and a list of Hittable objects `h`, within distance `tmin` < `tmax`"""
-        - @inline function hit(hittables::HittableList, r::Ray{T}, tmin::T, tmax::T)::Union{HitRecord,Nothing} where T
+        - #@inline 
+        - function hit(hittables::HittableList, r::Ray{T}, tmin::T, tmax::T)::Union{HitRecord,Nothing} where T
         -     closest = tmax # closest t so far
         -     best_rec::Union{HitRecord,Nothing} = nothing # by default, no hit
-        -     @inbounds for i in eachindex(hittables) # @paulmelis reported gave him a 4X speedup?!
-        - 		h = hittables[i]
-        -         rec = hit(h, r, tmin, closest)
-        -         if rec !== nothing
-        -             best_rec = rec
-        -             closest = best_rec.t # i.e. ignore any further hit > this one's.
+        0     @inbounds for i in eachindex(hittables) # @paulmelis reported gave him a 4X speedup?!
+        0 		h = hittables[i]
+        0         rec = hit(h, r, tmin, closest)
+        0         if rec !== nothing
+        0             best_rec = rec
+        0             closest = best_rec.t # i.e. ignore any further hit > this one's.
         -         end
         -     end
-        -     best_rec
+        0     best_rec
         - end
         - 
         - @inline color_vec3_in_rgb(v::Vec3{T}) where T = 0.5normalize(v) + SA{T}[0.5,0.5,0.5]
@@ -592,7 +595,7 @@
         0 		return SA{T}[0,0,0]
         - 	end
         - 		
-    33040 	rec::Union{HitRecord,Nothing} = hit(world, r, T(1e-4), typemax(T))
+        0 	rec::Union{HitRecord,Nothing} = hit(world, r, T(1e-4), typemax(T))
         0     if rec !== nothing
         - 		# For debugging, represent vectors as RGB:
         - 		# claforte TODO: adapt to latest code!

So the real culprit seems to be the ray_to_HitRecord call, specifically it’s s.mat argument. Apparently because Material is an abstract type and so s.mat causes allocations?

Also see the remarks above by @c42f in Ray Tracing in a week-end - Julia vs SIMD-optimized C++ - #42 by c42f, precisely on making Material concrete.

Edit: slight update for clarity, added link to earlier comment

claforte · December 17, 2021, 3:03pm

Thanks @paulmelis, I’ll probably take a closer look in a couple days… before that though, I’ll update the Pluto.jl notebook and start broadcasting its existence on Twitter and at work, to try and get some extra visibilty for JuliaLang.

claforte · December 17, 2021, 11:52pm

I updated the Pluto notebook, it’s super fast now.

I also just posted this project to Twitter: https://twitter.com/chrlaf/status/1471991209203843074

Thanks everyone for your help! I learnt a lot and I’m eager to repeat the experience!

claforte · December 19, 2021, 8:39pm

I’m cleaning up the code so it can serve as a decent self-contained example, but I ran into a multithreading crash… I’m not sure yet who to report it to.

The steps to repro are documented in src/proto/repro.jl in this branch:
https://github.com/claforte/RayTracingWeekend.jl/tree/claforte/threads_crash1

UPDATE: I opened an issue: https://github.com/JuliaLang/julia/issues/43485

Clean-up

I created a new @ballocs macro returns number of heap allocations. It was much easier than I thought!
A new @test_no_allocs macro verifies expressions causes no heap allocations
Many of my tests are of the form: @test_no_allocs rgb($t_col), so I can later prevent regression in performance