Ray Tracing in a week-end - Julia vs SIMD-optimized C++

The 4x speedup feels like your code isn’t SIMDing while the ISPC has manual SIMD so it does utilize it. I don’t know how your code is right now but maybe LoopVectorization.jl can work.

Great thread! Fun fact: The design of StaticArrays was indirectly informed by the arrays-for-geometry-types I wish I’d had back when I used to write a lot of C++ rendering code (in the aqsis project and several toy renderers).

Some thoughts after looking over the project:

StaticArrays usage

In the README that you’ve got a section “using Base.getproperty() for vec.x instead of vec[1]”. You could do this very easily using StaticArrays.FieldVector:

struct Vec3{T} <: FieldVector{3, T}
    x::T
    y::T
    z::T
end

On the other hand, there’s something quite nice about using plain SVector. For instance, you can use the short syntax:

  • SA[1,2,3] is a shorter version of @SVector[1,2,3]
  • SA{T}[1,2,3] is shorthand for @SVector T[1,2,3]
  • There’s also aliases SA_F64 and SA_F32 for SA{Float64} and SA{Float32} if you’re writing concrete code.

Actually… here’s a nice idea — what if we added a getproperty feature to StaticArrays so that v.x, v.y and v.z “just worked” by default like everyone wants them to? Here, I grant you a letter of marque :scroll: to use the following getproperty implementation:

function Base.getproperty(v::SVector{3}, name::Symbol)
    name === :x ? getfield(v, :data)[1] :
    name === :y ? getfield(v, :data)[2] :
    name === :z ? getfield(v, :data)[3] : getfield(v, name)
end

Notes on the code

It’s been mentioned above, but to mention it again - const _rng = Xoroshiro128Plus() is no good with threads. (There’s a lot of globals though! I’m not sure it’s used from your threads section?)

A few thoughts about the loop here: https://github.com/claforte/RayTracingWeekend.jl/blob/9d88dfa7e48e82101e87988ee73305f130121f66/src/proto.jl#L242

A general thought about SIMD (on which I’m not an expert!) — data layout can be important and 3-vectors and arrays of structs aren’t a great data layout. The problem is that such structs are a great programming abstraction! Packages like StructOfArrays can help to change the data layout into a SIMD-friendly struct-of-arrays format while preserving the array-of-structs abstraction you want to use in the code.

Making the materials and hit system fully concrete with tight inner loops

If you’re willing to make larger changes to your ray_color function and replace the (intuitive!) recursive formulation with a ray batch system, you could have fully concrete inner loops and you’d have a much higher chance of getting really tight loops which SIMD properly.

I think there’s a few steps to this:

  1. Replace the recursion in ray_color with a loop and a stack of hit events (make sure that the stack gets reused to avoid allocations)!

  2. Walk batches of rays together. Each time there’s a hit, instead of recursing into it immediately, store the HitRecord in the batch.

  3. Remove the abstract Material from HitRecord, and replace it with something like the following concrete MaterialRef type. This allows hits to be grouped into like materials, with all the hits for the same type of material processed together in a batch.

struct MaterialRef
    UInt16 material_type_index
    UInt16 index
end

Now you have the following (untested!) code sketch to store (“intern”) a new material into some existing arrays:

function store_material(material_types, all_materials, new_material)
    material_type_index = findfirst(material_types, typeof(new_material))
    storage = all_materials[]
    push!(storage, new_material)
    return MaterialRef(material_type_index, lastindex(storage))
end

# Storage arrays for materials can be created like such
material_types = subtypes(Material)
all_materials = collect(Vector{T}() for T in material_types)

The point of this is that you can act on all materials of the same type in batch mode in a tight loop without dynamic dispatch and with inlining:

function process_hits(hits, material_storage, material_index)
    for hit in hits
        if hit.material_ref.material_type_index == material_index
            material = mat_storage[hit.material_ref.index]
            # Do whatever you want with the material and the hit
            # The compiler knows the concrete type of `material_storage`
            # so use of `material` will be fast and inlined.
        end
    end
end

# Very rough usage
mat_idx = 1
mat_storage = all_materials[mat_idx]
process_hits(hits, all_materials[mat_idx], mat_idx)

This kind of batch-based pattern is likely to be helpful (perhaps necessary?) on the GPU.

Sampling patterns

You’re using rand() without stratification inside render() (and elsewhere like in random_vec2_in_disk) but this is very noisy for a given number of samples per pixel and you can consider stratified sampling. Simple NxN jittered sampling would be a lot better and is easy to implement, at least within render(), though it restricts n_samples = N*N for some N.

Alternatively it would be interesting to try a low discrepancy sequence from a package like https://github.com/stevengj/Sobol.jl.

21 Likes

Let’s see what the other StaticArrays contributors think. Note that this would be an interface for the concrete SVector/MVector only. For general StaticVector indexing would remain as the abstract interface.

2 Likes

It would be great to see what it takes for us to match ISPC performance.

5 Likes

I haven’t read up on ISPC, but is the goal “nice idiomatic-looking code” with the compiler doing fancy tricks for SPMD-style programs to get good SIMD speedups? If that’s the high level goal then some of my suggestions above make less sense because they require quite a bit of restructuring.

However I think we could possibly make good progress by using data structures which optimize the data layout internally, while keeping the same surface-level API. This is roughly along the lines of the StructOfArrays package, and the same idea as a lot of the optimizations Keno did for the Celeste.jl project, if I remember correctly. Deep ray trees in a naive recursive ray tracer will likely be bad for any SPMD-style program. Ray divergence is a killer…

1 Like

The Story of ISPC is a good read, in case anyone hasn’t read it yet.

I think we should work on an LLVM pass to do this.
We want access to target information when making these transforms without running into any precompilation or multi-versioning information, and we wouldn’t want egregious latency.
Working on typed Julia IR would be better than LoopVectorization, but LLVM already has lots of analysis infrastructure, particularly for generating target-specific code.
Later in the pipeline should also have less latency for passes likely to increase the amount of code everything downstream must work with.

ISPC is also open source with a 3-clause BSD license, so we could look through it. In particular, opt.cpp.

Also, interesting that since then, the Intel compilers have all switched to using LLVM. It will be interesting to see if they start making ISPC-like transformations in the future.

9 Likes

Thanks for sharing this @Elrod! Wow, I hadn’t realized that Matt Phar (a legend in the Computer Graphics world) was the original author of ISPC… no wonder it does a superb job at accelerating at this workload!

Thanks a lot for the great ideas! I started integrating some…

Removing mutable from the Camera and Materials made no perf difference, so I’m keeping this simplification.

The simplification of @SVector/etc. helps a lot to make the code more readable. :slight_smile:

I adapted your Base.getproperty idea, and w/ @inline and @inbounds, it seems to give a consistent 3-5% gain in smaller benchmark. Yeah! :slight_smile:

# Adapted from @Christ_Foster's: https://discourse.julialang.org/t/ray-tracing-in-a-week-end-julia-vs-simd-optimized-c/72958/40
@inline @inbounds function Base.getproperty(v::SVector{3}, name::Symbol)
    name === :x || name === :r ? getfield(v, :data)[1] :
    name === :y || name === :g ? getfield(v, :data)[2] :
    name === :z || name === :b ? getfield(v, :data)[3] : getfield(v, name)
end

@inline @inbounds function Base.getproperty(v::SVector{2}, name::Symbol)
    name === :x || name === :r ? getfield(v, :data)[1] :
    name === :y || name === :g ? getfield(v, :data)[2] : getfield(v, name)
end

I can live with that duplication, but is there any smart way to simplify this code a bit further?

BTW IMHO the ideal implementation would support mutable Vectors and swizzling: Swizzling (computer graphics) - Wikipedia … then we could run shader code almost as-is. :slight_smile:

Eliminate off-by-half-a-pixel offset: ~2.5% faster
- brought up by @Christ_Foster, and on 2nd thought…
- this is an unnecessary deviation from Peter Shirley’s original
- ran multiple times, seems like ~2.5% speed-up on small benchmarks that has 32 samples, i.e. should be sample-bound

I tried it (and kept @simd in the later for loop), but if it makes any difference, it’s less than 1% speed-up… so I reverted that change. If I have time later, I want to support Polygon Meshes, Distance Fields, etc.

I tried removing that @simd, it made no difference… so I commented it out for now…

Thanks a lot for your ideas on making the materials and hit system fully concrete… I’ll consider these later, for now I don’t want to deviate unnecessarily from Peter Shirley’s implementation, since I only implemented the equivalent of book 1… but I’ll keep these ideas in mind if/when I implement a GPU-optimized version. :slight_smile:

BTW this is becoming a time-consuming side-project, if anybody is willing to directly help improve this code, your help would be welcome!

BTW this is becoming a time-consuming side-project,

Ray tracing in a weekend or two? :wink:

8 Likes

Latest commit: https://github.com/claforte/RayTracingWeekend.jl/commit/6162fc0bd33020ecf23dad68998feed702094c6b
Implement per-thread RNGs and fixed seeds

  • this isn’t any faster though… not sure if I did something wrong…
  • it would great if the right way to implement this was documented in the Julia docs
  • execution and images are now fully reproducible… speed still varies nearly 1% though
  • also ran the full benchmark…
  • we’re now ~5.6% faster than GCC w/ 14 threads

Can an expert (e.g. @c42f) take a quick look at this attempt to get fast per-thread RNGs? Thanks!

7 Likes

This wasn’t a performance optimization, more just a fix to make the broken code correct. You can’t share mutable RNG state between threads and expect the RNG to produce valid samples. For example the RNG state could get corrupted and result in non-random samples or even program crashes. (The exact behavior depends on the way the RNG works.)

If your RNGs are taking a significant amount of time, you could experiment with cheaper lower-quality generator algorithms. For computer graphics applications the RNG doesn’t exactly need to produce high quality random numbers. The ultimate test is whether the final image looks good.

Your trand() function looks correct to me. This implementation with the nthreads-sized array is what the Random standard library did until recently when we got per-task RNGs. The alternative which doesn’t rely on global state is to pass the RNG as an argument to any function that needs it.

3 Likes

Yeah this isn’t fully surprising, but I predict that you’ll see a big speed bump as soon as you introduce another few Hittable types. (You don’t even need to use any of these other types in a scene, just defining them in your Julia session is enough to force dynamic dispatch and loose the chance of inlining.) Anyway, not to scare monger :slight_smile: You can easily test it out by defining a few such dummy types <: Hittable. No need to actually implement any functions for them. (Actually you might need placeholder implementations of hit(), at this point my knowledge of type inference starts to get hazy.)

1 Like

Ah yes I’ve read “The story of ISPC” before but apparently I forgot all about it. The ISPC approach does seems surprisingly straightforward and elegant. A quote from The story of ispc: C’s influence and implementing SPMD on SIMD (part 4)

… the problem with auto-vectorization: it’s a complex optimization, full of heuristics, and you can’t be sure where it’s going to end up. The compiler is trying to reason about the safety of vectorizing a loop—are there any loop-carried dependencies? One can only go so far with reasoning about arbitrary programs with a computer program (remember that pesky halting problem), so auto-vectorizers are doomed to be brittle and unpredictable to the user.

SPMD on SIMD? That’s a transformation . We just saw how to do it. It’s mechanical. If you’ve automated it, there’s no reason it won’t always work.

2 Likes

The results of my latest commit confuses me. I’m trying a non-mutable HitRecord but I need to use a Union{HitRecord, Nothing} to return no-hits cheaply.

Here’s the commit:

Minor slowdown? try non-mutable HitRecord
- heavily inspired by @paulmelis' hittable.jl
- It seems >2X faster in small scenes without reflections/refractions... 
  **951us vs 2.225ms!**
- ... but it's slower for larger scene:
`@btime render($_scene_random_spheres, $t_cam1, 200, 32) ` 
consistently increases from 286ms to 300ms?!

I also committed the .mem file, in case someone can take a quick look and point to me what kind of newbie mistake I must be making…?
https://github.com/claforte/RayTracingWeekend.jl/blob/main/examples/proto.mem_immutable_hitrecord.jl

Warning ahead, it’s early morning that I look into this, due to insomnia, so I might be a bit fuzzy in my reasoning :wink:

The allocations happening when this function is called:

seem to be masked by the function being marked @inline. Removing some of those gives me this change:

paulm@l0420007 07:41:~/c/RayTracingWeekend.jl$ diff -ur src/proto.jl.4010.mem src/proto.jl.4359.mem 
--- src/proto.jl.4010.mem	2021-12-17 07:27:17.968053056 +0100
+++ src/proto.jl.4359.mem	2021-12-17 07:35:16.574703751 +0100
@@ -396,7 +396,8 @@
         - # """
         - 
         - """Equivalent to `hit_record.set_face_normal()`"""
-        - @inline @fastmath function ray_to_HitRecord(t::T, p, outward_n⃗, r_dir::Vec3{T}, mat::Material{T})::Union{HitRecord,Nothing} where T
+        - #@inline 
+        - @fastmath function ray_to_HitRecord(t::T, p, outward_n⃗, r_dir::Vec3{T}, mat::Material{T})::Union{HitRecord,Nothing} where T
         - 	front_face = r_dir ⋅ outward_n⃗ < 0
         - 	n⃗ = front_face ? outward_n⃗ : -outward_n⃗
         - 	HitRecord(t,p,n⃗,front_face,mat)
@@ -442,46 +443,48 @@
         - 	return Scatter(scattered_r, attenuation)
         - end
         - 
-        - @inline @fastmath function hit(s::Sphere{T}, r::Ray{T}, tmin::T, tmax::T)::Union{HitRecord,Nothing} where T
-        -     oc = r.origin - s.center
+        - #@inline 
+        - @fastmath function hit(s::Sphere{T}, r::Ray{T}, tmin::T, tmax::T)::Union{HitRecord,Nothing} where T
+        0     oc = r.origin - s.center
         -     #a = r.dir ⋅ r.dir # unnecessary since `r.dir` is normalized
         - 	a = 1
-        - 	half_b = oc ⋅ r.dir
-        -     c = oc⋅oc - s.radius^2
-        -     discriminant = half_b^2 - a*c
-        - 	if discriminant < 0 return nothing end # no hit!
-        - 	sqrtd = √discriminant
+        0 	half_b = oc ⋅ r.dir
+        0     c = oc⋅oc - s.radius^2
+        0     discriminant = half_b^2 - a*c
+        0 	if discriminant < 0 return nothing end # no hit!
+        0 	sqrtd = √discriminant
         - 	
         - 	# Find the nearest root that lies in the acceptable range
-        - 	root = (-half_b - sqrtd) / a	
-        - 	if root < tmin || tmax < root
-        - 		root = (-half_b + sqrtd) / a
-        - 		if root < tmin || tmax < root
-        - 			return nothing # no hit!
+        0 	root = (-half_b - sqrtd) / a	
+        0 	if root < tmin || tmax < root
+        0 		root = (-half_b + sqrtd) / a
+        0 		if root < tmin || tmax < root
+        0 			return nothing # no hit!
         - 		end
         - 	end
         - 	
-        - 	t = root
-        - 	p = point(r, t)
-        - 	n⃗ = (p - s.center) / s.radius
-        - 	return ray_to_HitRecord(t, p, n⃗, r.dir, s.mat)
+        0 	t = root
+        0 	p = point(r, t)
+        0 	n⃗ = (p - s.center) / s.radius
+   123680 	return ray_to_HitRecord(t, p, n⃗, r.dir, s.mat)
         - end
         - 
         - const HittableList = Vector{Hittable}
         - 
         - #"""Find closest hit between `Ray r` and a list of Hittable objects `h`, within distance `tmin` < `tmax`"""
-        - @inline function hit(hittables::HittableList, r::Ray{T}, tmin::T, tmax::T)::Union{HitRecord,Nothing} where T
+        - #@inline 
+        - function hit(hittables::HittableList, r::Ray{T}, tmin::T, tmax::T)::Union{HitRecord,Nothing} where T
         -     closest = tmax # closest t so far
         -     best_rec::Union{HitRecord,Nothing} = nothing # by default, no hit
-        -     @inbounds for i in eachindex(hittables) # @paulmelis reported gave him a 4X speedup?!
-        - 		h = hittables[i]
-        -         rec = hit(h, r, tmin, closest)
-        -         if rec !== nothing
-        -             best_rec = rec
-        -             closest = best_rec.t # i.e. ignore any further hit > this one's.
+        0     @inbounds for i in eachindex(hittables) # @paulmelis reported gave him a 4X speedup?!
+        0 		h = hittables[i]
+        0         rec = hit(h, r, tmin, closest)
+        0         if rec !== nothing
+        0             best_rec = rec
+        0             closest = best_rec.t # i.e. ignore any further hit > this one's.
         -         end
         -     end
-        -     best_rec
+        0     best_rec
         - end
         - 
         - @inline color_vec3_in_rgb(v::Vec3{T}) where T = 0.5normalize(v) + SA{T}[0.5,0.5,0.5]
@@ -592,7 +595,7 @@
         0 		return SA{T}[0,0,0]
         - 	end
         - 		
-    33040 	rec::Union{HitRecord,Nothing} = hit(world, r, T(1e-4), typemax(T))
+        0 	rec::Union{HitRecord,Nothing} = hit(world, r, T(1e-4), typemax(T))
         0     if rec !== nothing
         - 		# For debugging, represent vectors as RGB:
         - 		# claforte TODO: adapt to latest code!

So the real culprit seems to be the ray_to_HitRecord call, specifically it’s s.mat argument. Apparently because Material is an abstract type and so s.mat causes allocations?

image

Also see the remarks above by @c42f in Ray Tracing in a week-end - Julia vs SIMD-optimized C++ - #42 by c42f, precisely on making Material concrete.

Edit: slight update for clarity, added link to earlier comment

3 Likes

Thanks @paulmelis, I’ll probably take a closer look in a couple days… before that though, I’ll update the Pluto.jl notebook and start broadcasting its existence on Twitter and at work, to try and get some extra visibilty for JuliaLang. :slight_smile:

5 Likes

I updated the Pluto notebook, it’s super fast now. :slight_smile:

I also just posted this project to Twitter: https://twitter.com/chrlaf/status/1471991209203843074

Thanks everyone for your help! I learnt a lot and I’m eager to repeat the experience! :slight_smile:

20 Likes

I’m cleaning up the code so it can serve as a decent self-contained example, but I ran into a multithreading crash… I’m not sure yet who to report it to.

The steps to repro are documented in src/proto/repro.jl in this branch:
https://github.com/claforte/RayTracingWeekend.jl/tree/claforte/threads_crash1

UPDATE: I opened an issue: https://github.com/JuliaLang/julia/issues/43485

Clean-up

  • I created a new @ballocs macro returns number of heap allocations. It was much easier than I thought!
  • A new @test_no_allocs macro verifies expressions causes no heap allocations
  • Many of my tests are of the form: @test_no_allocs rgb($t_col), so I can later prevent regression in performance
2 Likes