Ray Tracing in a week-end - Julia vs SIMD-optimized C++

I captured a rr trace after applying the zen_workaround.py: https://s3.amazonaws.com/julialang-dumps/reports/2021-12-19T21-16-29-claforte.tar.zst

I thought that’d be a lot harder! :slight_smile: Kudos to whoever spent the time automating this process!

I feel like special purpose swizzling syntax from graphics is a bit iffy for a general purpose array library.

However, we can use the observation that swizzling is just multiplication by a permutation matrix to get something almost the same without fattening up the interface with graphics-specific concepts.

The following works right now, but is not a cost-free abstraction:

julia> const zy = SA[0 0 1; 0 1 0]
2×3 SMatrix{2, 3, Int64, 6} with indices SOneTo(2)×SOneTo(3):
 0  0  1
 0  1  0

julia> v = SA[1,2,3]
3-element SVector{3, Int64} with indices SOneTo(3):
 1
 2
 3

julia> zy*v
2-element SVector{2, Int64} with indices SOneTo(2):
 3
 2

To make it a free abstraction we’d need a StaticPermutation type along the lines of the following rough sketch:

using StaticArrays
using LinearAlgebra

struct StaticPermutation{N,M,P} <: StaticMatrix{N,M,Bool}
end

Base.getindex(p::StaticPermutation{N,M,P}, i::Int) where {N,M,P} = Tuple(p)[i]
Base.Tuple(p::StaticPermutation{N,M,P}) where {N,M,P} = SMatrix{max(N,M),max(N,M)}(I)[SVector(P),:]

function Base.:*(p::StaticPermutation{N,M,P}, v::StaticVector{M}) where {N,M,P}
    v[SVector(P)]
end

then we get:

julia> zy = StaticPermutation{2,3,(3,2)}()
2×3 StaticPermutation{2, 3, (3, 2)} with indices SOneTo(2)×SOneTo(3):
 0  0  1
 0  1  0

julia> v = SA[1,2,3]
3-element SVector{3, Int64} with indices SOneTo(3):
 1
 2
 3

julia> zy*v
2-element SVector{2, Int64} with indices SOneTo(2):
 3
 2

the generated code looks pretty great:

julia> @code_native zy*v
	.text
; ┌ @ REPL[6]:1 within `*'
	movq	%rdi, %rax
; │ @ REPL[6]:2 within `*'
; │┌ @ indexing.jl:117 within `getindex'
; ││┌ @ indexing.jl:120 within `_getindex'
; │││┌ @ indexing.jl:124 within `macro expansion'
	vpermilps	$78, 8(%rsi), %xmm0     # xmm0 = mem[2,3,0,1]
; │└└└
	vmovups	%xmm0, (%rdi)
	retq
	nop
; └

Of course, this isn’t quite 100% what you’d like for swizzling:

  • zy and other swizzling matrices are independent symbols you need to bring into scope separately — they are not properties of the vector v.
  • The syntax is less familiar. It might be nice if we could generalize to have v⋅zy mean (v'*zy)' but this would be slightly iffy even if it didn’t conflict with the existing meaning of scalar product for matrices as the dot product of the flattened list of elements.

There’s plenty of other multiplicaton-like operators, so I guess one of those could be used for swizzling…

julia> ⋄(v::StaticVector, m::StaticPermutation) = m*v
⋄ (generic function with 1 method)

julia> v⋄zy
2-element SVector{2, Int64} with indices SOneTo(2):
 3
 2

probably better to just use matrix multiplication though.

2 Likes

Not sure if any of these provide the notation you’re looking for but there are a number of array notation libraries, discussed at

https://github.com/mcabbott/TensorCast.jl#elsewhere

1 Like

With some help from @paulmelis and @vchuravy, I pinpointed and fixed the source of the segmentation fault… lessons learnt:

  • when a segmentation fault occurs, turn off (through julia arguments) all the optimizations, e.g. inline, inbounds, etc… but keep the threads to >1.
  • nthreads() isn’t valid during precompilation… need to allocate an empty array, then resize!() it later and fill it up with valid thread-specific RNGs.

https://github.com/claforte/RayTracingWeekend.jl/commit/177d7c803d067be1c7c251ab7e9fb9a360f98e5c

Next: more clean-up, then start implementing Peter Shirley’s book 2. (I need BVHs for my day job)

4 Likes

Hi,

I just posted my version of this, which is a basic transcode of the C++ code - someone linked me to this thread.

My single threaded version renders 1200x675 using Float64 in 2m30s, compared to 9mins for the C++ on the same machine.

Using 40 threads, one per scanline, the same scene takes 15s - including compilation.

https://discourse.julialang.org/t/i-translater-peter-shirleys-raytracer-to-julia-c-9mins-julia-2m30s-15s-threads/

1 Like

Well done! But I’m a little curious about your C++ times. For me, the basic C++ version (first weekend) runs in 58s on a single core of a i5-4460 CPU @ 3.20GHz, where my own Julia version uses 84.2s (single-threaded). Did you perhaps not compile the C++ version in release mode, i.e. with optimizations?

probably not, I just did g++ main.cc && ./a.out > r.ppm

I wasn’t going for speed. I can re-try it with better settings if you post them - my C++ coding experience is from before the STL was introduced!

You can try g++ -O3 -march=native main.cc for starters, that should already make quite a difference :slight_smile:

it sure did

$ time ./a.out > a.out.ppm 
Scanlines remaining: 0   
Done.

real	0m48.668s
user	0m48.639s
sys	0m0.028s

Is that single threaded, i.e. the comparison is 150 seconds for the Julia version?

my link had a typo, an admin fixed it but it broke the link :slight_smile:

I had bad C++ compiler flags and that version now runs in about 1 minute, not 9

So my single threaded Julia, at 2m30s is now not so good as I thought.

but the code is well suited to multi-threading

10 threads at 27s real and 20 threads take 17s real and 40 threads (my max) take 15s real - so clearly it is not a linear speedup.

julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, haswell)
Environment:
  JULIA_CPU_THREADS = 40
  JULIA_NUM_THREADS = 40

It has two physical CPUs

1 Like

Two separate sockets each with 20 cores? Or with 10 cores and hyperthreading? Going above the core count is usually not much value in hard core computing tasks, works better with things like packet shuffling or tasks with IO or similar, where the hyperthread allows for very fast context switch when a thread stalls.

Two sockets

and you are correct, the performance curve flattens with more threads

Is your code NUMA-aware? It might only be using one socket at the moment.

I don’t know how to determine that. Julia makes no mention, that I have seen, regarding NUMA. That said, Linux reports full CPU utilization.

I wanted to try this for the longest time and finally got around to it :slight_smile:

My implementation is a fairly naive translation of the original C++ code from the book, down to the file structure & struct names. The only major difference is that I added a threading component to the outermost loop like @claforte did in their final version, as well as having each material kept in its own typed container in hittable_list (aggregated using a Dict).

As an added challenge, I restricted myself to only Base and stdlibs (no StaticArrays!), resulting in using the Sampler API of the Random stdlib for the various restricted generation modes. This was very fun, as I had to learn a (to me) not that familiar/often used API! I’ll definitely use that more often in the future though and already have ideas on how to incorporate it in one of my other projects…

Anyway, here’s the juicy timing stuff that people care about. The first two runs are just the code by @claforte, directly cloned from the GitHub repo, the last two runs are from my implementation. I hardcoded the samples & other config, but I assure you the image parameters are the same :slight_smile: I chose to use the parameters from the last example in the pluto notebook, since that was the easiest to find & fast to test against.

julia> @time render(scene_random_spheres(; elem_type=Float64), t_cam1, 320, 32); 
  5.443237 seconds (6.15 M allocations: 449.467 MiB, 2.31% gc time, 16.10% compilation time)

julia> @time render(scene_random_spheres(; elem_type=Float64), t_cam1, 320, 32); 
  4.533576 seconds (4.77 M allocations: 365.095 MiB, 1.62% gc time)
julia> @time WeekendRaytracer.main(devnull)
World generation took 8 milliseconds.
Took 4877 milliseconds to render and 34 milliseconds to save.
  4.976668 seconds (20.15 M allocations: 928.291 MiB, 3.51% gc time, 6.14% compilation time)

julia> @time WeekendRaytracer.main(devnull)
World generation took 0 milliseconds.
Took 4737 milliseconds to render and 32 milliseconds to save.
  4.768643 seconds (20.25 M allocations: 926.917 MiB, 3.25% gc time)

You can find my code on github as well, if you want to take a look!

9 Likes

I also implemented this book in Julia here Renderer/RayTracingInOneWeekend.jl at master · Zentrik/Renderer (github.com).

I think this is the fastest Julia implementation of this book (If you know of a faster one please send it to me) though it isn’t the most faithful.

On the example in claforte’s repo (1920x1080x1000), my code ran in 753s (Just run clarforte()), compared to 1663s for claforte’s code on my machine.

10 Likes

Using SIMD and making some other minor improvements the code now runs in 200s on my pc.

6 Likes

Have you checked out chernos channel he has a whole series on ray tracing. He was able to render that image in 22s.

yt video improving ray tracing program

In his recent video he said each thread needs a seperate instance of rand and implementing that he got a 2.5x speed up.

2.5x speed up