Ray Tracing in a week-end - Julia vs SIMD-optimized C++

If instead of a abstract type of material you use a union type, using:

begin
	struct NoMaterial end
	const _no_material = NoMaterial()
	const _y_up = @SVector[0f0,1f0,0f0]
end

struct Metal
	albedo::SVector{3,Float32}
	fuzz::Float32 # how big the sphere used to generate fuzzy reflection rays. 0=none
	Metal(a,f=0.0) = new(a,f)
end

struct Lambertian
	albedo::SVector{3,Float32}
end

struct Dielectric
	ir::Float32 # index of refraction, i.e. Ξ·.
end

const Material = Union{NoMaterial,Metal,Lambertian,Dielectric}

and make all structs immutable, you get almost no allocations in the first render function:

julia> c = default_camera()
Camera(Float32[0.0, 0.0, 0.0], Float32[-1.7777778, -1.0, -1.0], Float32[3.5555556, 0.0, 0.0], Float32[0.0, 2.0, 0.0], Float32[1.0, 0.0, 0.0], Float32[0.0, 1.0, 0.0], Float32[0.0, 0.0, 1.0], 0.0f0)

julia> s = scene_4_spheres()
HittableList(Hittable[Sphere(Float32[0.0, 0.0, -1.0], 0.5f0, Lambertian(Float32[0.7, 0.3, 0.3])), Sphere(Float32[0.0, -100.5, -1.0], 100.0f0, Lambertian(Float32[0.8, 0.8, 0.0])), Sphere(Float32[-1.0, 0.0, -1.0], 0.5f0, Metal(Float32[0.8, 0.8, 0.8], 0.3f0)), Sphere(Float32[1.0, 0.0, -1.0], 0.5f0, Metal(Float32[0.8, 0.6, 0.2], 0.8f0))])

julia> @benchmark render($s,$c, 96, 16)
BenchmarkTools.Trial: 187 samples with 1 evaluation.
 Range (min … max):  26.035 ms …  36.614 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     26.559 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   26.749 ms Β± 929.174 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–ƒβ–‚β–β–β–„β–ƒβ–…β–ˆ ▆▂▁▁  ▁                                            
  β–†β–ƒβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–„β–…β–ˆβ–„β–„β–„β–„β–β–ƒβ–β–ƒβ–β–ƒβ–ƒβ–„β–ƒβ–ƒβ–β–β–β–β–β–β–β–ƒβ–β–ƒβ–β–β–β–β–β–β–β–β–ƒβ–ƒβ–ƒβ–„β–β–β–β–β–β–ƒ β–ƒ
  26 ms           Histogram: frequency by time         29.2 ms <

 Memory estimate: 60.80 KiB, allocs estimate: 2.

This means almost certainly that we got rid of dynamic dispatch and type instabilities. The performance in this small example does not improve (actually it is slightly worse here), but for a larger problem that may be much better. To test.

The use of a union instead of an abstract type is another workaround to suggest to the compiler to do automatic union splitting, but of course it is again less elegant.

For the records, with the abstract material type, I had:

julia> @benchmark render($s,$c, 96, 16)
BenchmarkTools.Trial: 240 samples with 1 evaluation.
 Range (min … max):  19.864 ms … 29.133 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     20.285 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   20.910 ms Β±  1.521 ms  β”Š GC (mean Β± Οƒ):  1.83% Β± 4.79%

  β–ˆβ–ˆβ–„β–‚β–‚β–ƒβ–†β– ▁                                                   
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–ˆβ–β–†β–†β–„β–„β–†β–„β–„β–ˆβ–†β–†β–β–‡β–β–„β–†β–β–β–„β–β–β–β–β–„β–„β–†β–„β–„β–‡β–†β–‡β–„β–β–β–‡β–β–†β–„β–„β–β–„β–β–β–β–β–„ β–†
  19.9 ms      Histogram: log(frequency) by time      25.9 ms <

 Memory estimate: 6.40 MiB, allocs estimate: 138409.

With the union type it is slower in this small test, but note the huge difference in the allocations.

3 Likes

FYI, I finished integrating @woclass’ changes and pushed them to github, in the proto.jl file. I plan to cut-and-paste them in the Pluto.jl notebook next week-end.

2 Likes

Thanks DNF for bringing this up! I tried Xoroshiro, here’s my latest commit:

Use Xoroshiro128Plus RNG
- large speed-up for small random functions…
- small speed up in some small scenes
- …but no noticeable difference in large scenes with high sample-count

… so the bottleneck appears to lie somewhere else (HitRecord allocations?)

Thanks anyway!

1 Like

My quick test suggested that indeed the bottleneck is there.

Probably profiling is what is needed here. It is easy to do that with the vscode extension..

1 Like

My latest commit:

4.35X speed-up using 16 threads
- 20.16 minutes for the 1920x1080x1000 random_spheres!
- Easiest optimization ever… not even a whole line of code! :slight_smile:

I re-located the SIMD-optimized C++ implementation (GitHub - GPSnoopy/RayTracingInOneWeekend: Implementation of Peter Shirley's Ray Tracing In One Weekend book.) and I was remembering incorrectly. GPSnoopy reports:

The cover picture was rendered in about 1h running 16 threads on a Core i9-9900K @ 4.7GHz at 3840x2160 with 1024 samples per pixel. Clocking the CPU at 4.8GHz when running AVX code would cause it to reach over 90C, so be careful when running this code on an overclocked processor.

I tried to update my original description accordingly, but it looks like I don’t have the permission.

My tests were on 1920x1080 (i.e. 4 times less pixels), so my 20mins current record on a Ryzen2 is probably roughly equivalent to his 1 hour on his overpriced i9. :wink:

I’ll try to run his code next, and hopefully their C++ compiler makes an attempt to be competitive on my AMD processor.

Stay tuned! :slight_smile:

11 Likes

Just tried your Pluto notebook, really cool way to teach RT (and specifically the Raytracing in One Weekend series) like this! I do wonder how well the code edits and optimizations you’re doing above work in the Pluto notebook. Or are you doing those β€œoffline”, run the tests, and only then check in the notebook again?

1 Like

how well the code edits and optimizations you’re doing above work in the Pluto notebook. Or are you doing those β€œoffline”, run the tests, and only then check in the notebook again?

I did all the initial implementation in Pluto.jl, I found it worked pretty well although it was getting long (and Pluto doesn’t allow us to cut-and-paste more than one cell at a time, which was painful to reorganize the code).

Now I’m using vscode to edit proto.jl so I can run @btime and get precise measurements. I checked long ago and this didn’t seem to work in Pluto.

Later today I plan to update the Pluto code with the massive (~12X) speed-ups from the past few days…

i think pluto does allow you to copy/paste multiple cells. you just have to start a mouse drag outside of a cell and then move your mouse into the range of multiple cells, they should highlight to indicate that they are in the selection. then you can use your keyboard to do the usual clipboard commands (for me, it’s command-c and command-v on a mac).

I’ve been enjoying this thread. So many twists and turns. I feel like the journey from the starting code and all the changes that incrementally improve performance could be a great narrative would make for a really good youtube video.

Thinking out aloud, perhaps even good to have a session at JuliaCon for such narratives.

19 Likes

Funny, I was thinking in asking if at the end some summary could be made about the changes that lead to a ~12x speed-up.

I see that speed-up as good and not-so-good news. Good because, well, a 10 fold speed up is fantastic. Not-so-good because I feel I belong to those who think they have clever code but it turns out that deep tricks can make a huge difference.

8 Likes

This is why all benchmarks comparing compiled languages end up being senseless when looked only from the point of view of the final performance. If some code is very optimized in one language, there is one effort that is necessary to achieve something reasonable (within a factor ~2). But the effort to get the last bit of performance can be huge, and worst, can be platform-dependent, test case dependent, etc.

That other thread about ray tracing was very curious in this aspect: low-level tricks were used in the Nim implementation and with that it beats all other languages. However, that was only for the exact same problem that was being benchmarked. If one changes somewhat the scene, the profiles change.

What I think is important in a study like this is go back and see what really made a difference. For example, inlining which function? Why? I have many times the experience that a lot of β€œcode optimization” that I thought was being useful was actually just bad benchmarking, and a few important tips and changes make the greatest difference.

9 Likes

Defining a global:

using RandomNumbers.Xorshifts
const _rng = Xoroshiro128Plus()

is almost certainly a bad idea in the multithreaded code.
This is not thread safe, and will also hurt performance.

Also, disappointing that the TaskLocal rng in Julia 1.7 is slower than this, but Random.Xoshiro works well.

So I’d suggest allocating one Xoroshiro128Plus() or a Random.Xoshiro per thread, and passing it around.

2 Likes

Hi!
If you are curious, I implemented ray tracing some time ago in Julia as well: GitHub - pxl-th/Trace.jl: Physically-based ray tracing on CPU
Mostly following this excellent book: Physically Based Rendering: From Theory to Implementation.

It also has multithreading support and BVH acceleration structure. I also spend some time trying to improve performance. Most of performance improvements I got was from the eliminations of allocations in ray β†’ object intersections routines. But it still has a lot of stuff that can be improved, ideally I thought of writing some kind of custom allocator to remove all alocations, but I got lazy shortly after :slight_smile: .

It is able to simulate caustic effects using Stochastic Progressive Photon Mapping, though, which was my main goal (see image below).

23 Likes

Heh, I was wondering if porting PBRT to Julia was going to give cleaner (and fewer lines of) code, while still providing comparable performance. Impressive what you managed to achieve!

And since we’re posting ray-tracing-in-Julia projects, here’s another one (substantially less ambitious), which was an experiment of porting existing Python raytracing code (don’t ask) to Julia. Back then I wrote down details of the optimization journey.

2 Likes

I’m up for a JuliaCon talk if others are interested. :slight_smile: I’m planning to present the results internally at AMD to hopefully rally more Julia supporters.

18 Likes

I’d like that as well, but I didn’t have time to research how to do it… if you happen to know, can you show me?

IMHO just proving that, with some moderate level of expertise, it’s possible to match or exceed the performance of reasonably optimized C++ code, is a necessary step to convince C++ -centric organizations like my employer to take a closer look and back new projects that want to use Julia.

BTW, I now have more concrete comparison, with the C++ (GCC) and an ISPC implementation. The details are here:

https://github.com/claforte/RayTracingWeekend.jl#competitive-targets-gpsnoopys-ispc-c-and-vulkan-implementations

The conclusion is:

This Julia implementation, with 14 threads, runs in 22m21s , i.e. now a bit faster than the equivalent GCC C++ version! (22m59s)

However the ISPC version is incredibly fast in comparison: 6m5s.

I also briefly reviewed the code of each version, to confirm that we’re implementing the same thing, i.e. same max number of bounces, no BVH, etc.

I also tried to enable various Julia optimizer settings, -O3 -ffast-math -march=znver2 and peppered @fastmath, @simd, @inbounds, and it made no difference. Memory allocations/traversal still seem to be the primary bottleneck.

7 Likes

Any chance we can get a disassembled version of the ISPC assembly? I’d be curious to see what it does differently.

1 Like

I don’t know how, but you should be able compile yourself pretty easily: GitHub - GPSnoopy/RayTracingInOneWeekend: Implementation of Peter Shirley's Ray Tracing In One Weekend book.

Is it easy to test how the codes scale with the number of spheres?

These small tests may fall into the categories of problems that fall within a compiler heuristic and not the other to do union splitting of classes/types. Maybe this large difference are not sustained for more realistic problems.

1 Like