Julia sets in Julia, using Interact.jl - I'd like to get it faster

Hi, after I talked about Julia to collegues some weeks ago, I somehow wanted to plot Julia sets (Wikipedia), simply because I thougt it to be a funny example.

To play with it I used Interact.jl.
The current result is this 82 liner: Github gist

I tried to get rid of slow things I knew about:

  • use existing arrays instead of creating new ones
  • tried to avoid type changes

Then I attempted to utilize my 8 core Ryzen with 15 threads, by partitioning the image (its hight) into batches, and use @threads

This is faster, but much less than I hoped for:

With threads:

julia> @time juliaSet(-0.62+0.42im)
0.442297 seconds (19.41 M allocations: 321.882 MiB, 22.05% gc time)

Without:
julia> @time juliaSet(-0.62+0.42im)
1.208751 seconds (19.12 M allocations: 317.792 MiB, 4.61% gc time)

In addition the plot takes time of course.

Can I do better?

(This is rather a fun example, so I mainly do it for the fun of it, and to learn.)

Kind greetings, z.

Check out the performance tips, particularly the very first one, which is “Avoid global variables”. Your code is entirely based on non-constant global variables, which have a huge negative performance impact. If you can rewrite it to avoid global variables entirely you should see a significant improvement.

Ah. Great resource, thank you!

Updated to local variables now. The line

@time juliaSet(batches, valueGrid, dataGrid, c)

is now called inside a function.

julia> j()
0.351686 seconds (19.11 M allocations: 317.267 MiB)

Slightly faster.

I should add:

julia> versioninfo()
Julia Version 1.3.0
Commit 46ce4d7933 (2019-11-26 06:09 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: AMD Ryzen 7 3700X 8-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, znver1)
Environment:
JULIA_NUM_THREADS = 16

I will try on tomorrow …

Checkout the julia fractal code from @sdanisch here, but keep in mind that this was pre 1.0 and I don’t know the current state of GPUArrays, but same code should also work with CUArrays.

Interesting. I got a AMD card, so I’ll see … but not today.
Thank you!

function juliaset(z0, c, maxiter)
    z = z0
    for i in 1:maxiter
        abs2(z) > 4f0 && return (i - 1) % UInt8
        z = z * z + c
    end
    return maxiter % UInt8 # % is used to convert without overflow check
end

function threaded(output, input, c)
    Threads.@threads for i in eachindex(input)
        @inbounds output[i] = juliaset(input[i], c, 255)
    end
    return output
end
# scale should change with picture size
scale = 200.0 # the greater the more details
width, height = 800, 700
x = -0.5; y = 1.2
c = x+y*im
xvalues = LinRange(-width/2, width/2, width) ./ scale
yvalues = LinRange(-height/2, height/2, height) ./ scale
inputmatrix = xvalues' .+ yvalues .* im 
result = fill(UInt(8), height, width)

@time threaded(result, inputmatrix, c)

That should be faster… You could make it even faster with an inner loop, that can be SIMD accelerated

There is also GPU implementation of Mandelbrot set, which can easily be adapted to your needs Coding the Mandelbrot Set in Julia (with GPU) - YouTube

Wrong: Using your version of juliaset(…) immediately was actually slower, but @code_warntype pointed out that in z = z0 and thus in abs2(z) the type was assumed to be Any.
I take this paragraph back. Tried to reproduce this with a fresh REPL, and it was fast now. Sorry.

Using your function as function juliaset(z0 :: Complex{Float64}, c, maxiter) is three times faster than mine. This is cool!

Thank you!

Only 3 times?
I get:

@btime juliaSet($batches, $valueGrid, $dataGrid, $c)
51ms
@btime threaded($result, $inputmatrix, $c)
729.403 ÎĽs

Which is a speed up of ~70x

To what number did you set JULIA_NUM_THREADS before running Julia?
And are you using julia -O3 ?

I’m not at that computer in the moment, but for Thread.nthreads() I got 16.

Shouldn’t that be enough? (I did it in Juno.)

(Until now I only played with your version of the inner loop, as I don’t have an interesting amount of threads on my current machine.)

Alright, the inner version shouldnt be that different performance wise!


Make sure you have 3 for optimization level :wink:
Might not change anything, but I remember that 03 pretty much only enabled better optimizations for reducing over a variable in a loop, which we are actually doing here…

I don’t think -O3 does really anything right now. At least I haven’t seen a single example where it is faster since SLP got enabled with -O2.

Yeah I thought almost nothing changes with -O3 but this one thing… but my information is pretty old :wink:

I come to the conclusion that I don’t know yet what I’m doing; getting results all over the place.

So finally updated the gist again. Gist
j(…) is now a comparing test, calling (at)btimes for both our code versions.

simon:
12.790 ms (116 allocations: 13.72 KiB)
zenon:
34.274 ms (2085420 allocations: 57.48 MiB)

Well this is AFTER I changed my code:

  1. corrected UInt to a concrete UInt32 (change to 32 from 8, seems to do no harm) in my inner loop.
  2. replaced z^2 by z*z (from your code), inner loop
  3. replaced abs by abs2, which seems to be much faster for Complex{Float64} (from your code), inner loop
    4.used (at)inbounds (from your code), but that seemed not to change anything.

So. Many thanks.
I have to leave the large computer again, for 3 days. That was fun!

Kind greetings, z.

Having a quick look at the code, a couple of thoughts:

  • batchify could be replaced by Iterators.partition? But it also should be unnecessary because @threads already statically partitions the work into nthreads batches, I think.
  • In juliaSet you use for y = 1:width. This likely kills performance because width is a global variable. You can use size(dataGrid,2) instead.

Ah, thank you!
I assumed there must be sth like batchify already, I just didn’t know where to find it :slight_smile:

The global vars are gone in rev. 2 and 3 of the gist.

In the latest version I can see, there’s still the use of global width in:

function juliaSet(batches, valueGrid, dataGrid, c)
    @threads for batch in batches
        for x in batch
            for y = 1:width
                @inbounds dataGrid[x,y] = julia(valueGrid[x,y], c, MAXITER)
            end
        end
    end
end

Uuups. Thank you!, escaped me.

If you are doing Julia sets in Julia you might be tempted to replace z*z by z*z*z…

https://github.com/JuliaLang/julia/issues/16357#issuecomment-219693645