I translated Peter Shirley's Raytracer to Julia. C++ 1min -> Julia 2m30s -> 15s (@threads)

Hi,

I like raytracing and there are a nice set of books

https://github.com/RayTracing/raytracing.github.io

so now I’m a bit better at Julia, I transcoded the C++ into Julia and got some nice speedups.

The single threaded version went from 9m in C++ to 2m30 in Julia (with no tricks - I don’t know any!)

(The C++ had no compiler flags, after using -O 3 and -march=native it was down to about 1 minute)

And when I added Threads (40 of them :slight_smile: )I got it down to 15s, including compilation.

Here’s my julia code

And here’s the Book’s Github Discussion page where I made a similar post to this

21 Likes

Sorry for the noob question, does your Julia code produce images like the one in the “Ray-tracing in one weekend” book?

1 Like

yes, that exact one (modulo the rand()s)

1 Like

Wow, I’m speechless.

1 Like

Yes, I’m quite pleased. I wasn’t doing it to make it faster, I just wanted it to produce results :slight_smile:

But then I used a bit of @code_warntypes and @time and I reduced the allocations from 10s of millions to 10s of 1000s

FYI, searching through Discourse, there seem to be a few related threads. I’m linking one of them.

4 hours !

Cool! Some small styling things:

const Vec3 = SVector{3,Float64}

is the same as SArray{Tuple{3}, Float64, 1, 3}.

and note that you don’t need to define magnitude, actually you have:

using LinearAlgebra: norm
x = rand(Vec3)
norm(x)

and also you could do:

julia> randf(::Type{T},fmin,fmax) where T = fmin .+ (fmax-fmin) .* rand(T)   
randf (generic function with 1 method)

julia> randf(Vec3,-1,1)
3-element SVector{3, Float64} with indices SOneTo(3):
 -0.3009982992786695
 -0.15959842094608057
  0.27936134566733006

julia> randf(SVector{2,Float64},-1,1)
2-element SVector{2, Float64} with indices SOneTo(2):
 -0.8267854719262475
 -0.9675131426873447

julia> randf(Float32,-1,1)
0.52987814f0

1 Like

magnitude(x,y,z) is quicker than norm([x,y,z]) because of the extra allocation

magnitude(v) and norm(v) are the same

norm is a strange name too, Shirley had length and I don’t like that either :slight_smile:

I did swap out unit_vector for LinearAlgebra.normalize

that randf is cool too, but again, it creates [x,y,z] not x,y,z
and this is all about reducing those allocations

I moved the code, git was being annoying with the fork

https://github.com/lawless-m/ShirleyRenderer.jl

If you use StaticArrays none of these allocate.

it was still slightly slower

perhaps I shall rework it all the way through and see properly

There’s no reason why Julia is so much slower, (not GC, not String, not I/O, not byte wrangling), well I know it’s easy for me to say it… But I hope we could make it within 125% time of C++. This is a micro benchmark big enough that people can point to as “composite” test for suitability blah blah

I added support for Images instead of PPM so now I can upload this

It uses more samples per pixel (150) and more depth (200), the defaults are 10 & 50

Again with 40 threads. I’m not going to chase more performance for the moment - that’s not the purpose of this - although I hope to experiment with some GPU offloading perhaps. I can’t use any CPU specialisations - Julia complains about a missing cx-16 instruction when I try and use -C

matt@pox:~/GitHub/ShirleyRenderer.jl/src$ julia --project=.. -L RandomScene.jl -e "@time main(;samples_per_pixel=150, max_depth=200)"
123.419846 seconds (328.84 M allocations: 24.327 GiB, 15.28% gc time, 1.76% compilation time)

5 Likes

150% is the best I have managed

matt@pox:~/GitHub/ShirleyRenderer.jl$ time julia -t 1 --project=. -L src/RandomScene.jl -e “main()”

real 1m13.689s
user 1m12.059s
sys 0m1.360s

matt@pox:~/GitHub/raytracing.github.io/src/$ time ./a.out > a.out.ppm
Scanlines remaining: 0
Done.

real 0m49.599s
user 0m49.570s
sys 0m0.028s

Your example has 24GB allocations and 15% GC time. Thats probably the problem. You could preallocate all the arrays used for each thread and reuse them. That much allocation will probably reduce your cache performance too.

1 Like

yep, but I think that’s quite a bit of re-work

I’ll spend some more time on it once I have the other code ported

1 Like

Turned out the design needed me to do the work of per thread anyway.

I used 2 muitable structs to hold the intermediate results, this reduced allocations significantly.

e.g.

mutable struct Hit 
	p::Point3
	normal::Vec3
	t::Float64
	front_face::Bool
	material::Material
	Hit() = new(zero(Point3), zero(Vec3), 0, true)
end

However the effect on the time was a few seconds at the most

$time julia --project=. -t 1 -L src/RandomScene.jl -e "@time main()"
 62.763695 seconds (6.64 M allocations: 330.171 MiB, 0.53% gc time, 3.15% compilation time)

real    1m10.922s
user    1m9.625s
sys     0m1.029s

and only 1 second wall for the 40 threaded version

matt@pox:~/GitHub/ShirleyRenderer.jl$ time julia --project=. -t 40 -L src/RandomScene.jl -e "@time main()"
  5.969525 seconds (6.64 M allocations: 330.225 MiB, 3.03% gc time, 35.72% compilation time)

real    0m14.056s
user    2m48.924s
sys     0m0.988s

I should take my own advice and try it but are mutable structs significantly slower than if I had a set of these and updated them ?

ps = Vector{Point3}(undef, nthreads())
normals = Vector{Point3}(undef, nthreads())
...

Immutable is usually better, but more importantly, type-stable is best. You need to run Cthulhu.jl and ProfileView.jl on everything and get rid of instabilities. At a glance, Material is an abstract type, but you have it as a field type. Instead maybe use a type parameter M, or if you need to mix materials in a vector, dont use a type to specify them at all.

Just to let you know, I’ve been playing with your code too: https://github.com/cdsousa/ShirleyRenderer.jl

One thing I noticed is that having 1) Materials as an abstract element of Sphere and 2) a vector of abstract Hitables in Scene, kind of hits a sweet spot in the compiler which produces a well-optimized code.

2 Likes