Hi, after I talked about Julia to collegues some weeks ago, I somehow wanted to plot Julia sets (Wikipedia), simply because I thougt it to be a funny example.
To play with it I used Interact.jl.
The current result is this 82 liner: Github gist
I tried to get rid of slow things I knew about:
use existing arrays instead of creating new ones
tried to avoid type changes
Then I attempted to utilize my 8 core Ryzen with 15 threads, by partitioning the image (its hight) into batches, and use @threads
Check out the performance tips, particularly the very first one, which is “Avoid global variables”. Your code is entirely based on non-constant global variables, which have a huge negative performance impact. If you can rewrite it to avoid global variables entirely you should see a significant improvement.
Checkout the julia fractal code from @sdanischhere, but keep in mind that this was pre 1.0 and I don’t know the current state of GPUArrays, but same code should also work with CUArrays.
function juliaset(z0, c, maxiter)
z = z0
for i in 1:maxiter
abs2(z) > 4f0 && return (i - 1) % UInt8
z = z * z + c
end
return maxiter % UInt8 # % is used to convert without overflow check
end
function threaded(output, input, c)
Threads.@threads for i in eachindex(input)
@inbounds output[i] = juliaset(input[i], c, 255)
end
return output
end
# scale should change with picture size
scale = 200.0 # the greater the more details
width, height = 800, 700
x = -0.5; y = 1.2
c = x+y*im
xvalues = LinRange(-width/2, width/2, width) ./ scale
yvalues = LinRange(-height/2, height/2, height) ./ scale
inputmatrix = xvalues' .+ yvalues .* im
result = fill(UInt(8), height, width)
@time threaded(result, inputmatrix, c)
That should be faster… You could make it even faster with an inner loop, that can be SIMD accelerated
Wrong: Using your version of juliaset(…) immediately was actually slower, but @code_warntype pointed out that in z = z0 and thus in abs2(z) the type was assumed to be Any. I take this paragraph back. Tried to reproduce this with a fresh REPL, and it was fast now. Sorry.
Using your function as function juliaset(z0 :: Complex{Float64}, c, maxiter) is three times faster than mine. This is cool!
Make sure you have 3 for optimization level
Might not change anything, but I remember that 03 pretty much only enabled better optimizations for reducing over a variable in a loop, which we are actually doing here…
I come to the conclusion that I don’t know yet what I’m doing; getting results all over the place.
So finally updated the gist again. Gist
j(…) is now a comparing test, calling (at)btimes for both our code versions.
simon:
12.790 ms (116 allocations: 13.72 KiB)
zenon:
34.274 ms (2085420 allocations: 57.48 MiB)
Well this is AFTER I changed my code:
corrected UInt to a concrete UInt32 (change to 32 from 8, seems to do no harm) in my inner loop.
replaced z^2 by z*z (from your code), inner loop
replaced abs by abs2, which seems to be much faster for Complex{Float64} (from your code), inner loop
4.used (at)inbounds (from your code), but that seemed not to change anything.
So. Many thanks.
I have to leave the large computer again, for 3 days. That was fun!
Having a quick look at the code, a couple of thoughts:
batchify could be replaced by Iterators.partition? But it also should be unnecessary because @threads already statically partitions the work into nthreads batches, I think.
In juliaSet you use for y = 1:width. This likely kills performance because width is a global variable. You can use size(dataGrid,2) instead.
In the latest version I can see, there’s still the use of global width in:
function juliaSet(batches, valueGrid, dataGrid, c)
@threads for batch in batches
for x in batch
for y = 1:width
@inbounds dataGrid[x,y] = julia(valueGrid[x,y], c, MAXITER)
end
end
end
end