Julia sets in Julia, using Interact.jl - I'd like to get it faster

zenon · March 4, 2020, 7:48pm

Hi, after I talked about Julia to collegues some weeks ago, I somehow wanted to plot Julia sets (Wikipedia), simply because I thougt it to be a funny example.

To play with it I used Interact.jl.
The current result is this 82 liner: Github gist

I tried to get rid of slow things I knew about:

use existing arrays instead of creating new ones
tried to avoid type changes

Then I attempted to utilize my 8 core Ryzen with 15 threads, by partitioning the image (its hight) into batches, and use @threads

This is faster, but much less than I hoped for:

With threads:

julia> @time juliaSet(-0.62+0.42im)
0.442297 seconds (19.41 M allocations: 321.882 MiB, 22.05% gc time)

Without:
julia> @time juliaSet(-0.62+0.42im)
1.208751 seconds (19.12 M allocations: 317.792 MiB, 4.61% gc time)

In addition the plot takes time of course.

Can I do better?

(This is rather a fun example, so I mainly do it for the fun of it, and to learn.)

Kind greetings, z.

rdeits · March 4, 2020, 7:58pm

Check out the performance tips, particularly the very first one, which is “Avoid global variables”. Your code is entirely based on non-constant global variables, which have a huge negative performance impact. If you can rewrite it to avoid global variables entirely you should see a significant improvement.

zenon · March 4, 2020, 8:05pm

Ah. Great resource, thank you!

Updated to local variables now. The line

@time juliaSet(batches, valueGrid, dataGrid, c)

is now called inside a function.

julia> j()
0.351686 seconds (19.11 M allocations: 317.267 MiB)

Slightly faster.

I should add:

julia> versioninfo()
Julia Version 1.3.0
Commit 46ce4d7933 (2019-11-26 06:09 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: AMD Ryzen 7 3700X 8-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, znver1)
Environment:
JULIA_NUM_THREADS = 16

I will try on tomorrow …

BeastyBlacksmith · March 4, 2020, 9:14pm

Checkout the julia fractal code from @sdanisch here, but keep in mind that this was pre 1.0 and I don’t know the current state of GPUArrays, but same code should also work with CUArrays.

zenon · March 4, 2020, 9:19pm

Interesting. I got a AMD card, so I’ll see … but not today.
Thank you!

sdanisch · March 4, 2020, 9:46pm

function juliaset(z0, c, maxiter)
    z = z0
    for i in 1:maxiter
        abs2(z) > 4f0 && return (i - 1) % UInt8
        z = z * z + c
    end
    return maxiter % UInt8 # % is used to convert without overflow check
end

function threaded(output, input, c)
    Threads.@threads for i in eachindex(input)
        @inbounds output[i] = juliaset(input[i], c, 255)
    end
    return output
end
# scale should change with picture size
scale = 200.0 # the greater the more details
width, height = 800, 700
x = -0.5; y = 1.2
c = x+y*im
xvalues = LinRange(-width/2, width/2, width) ./ scale
yvalues = LinRange(-height/2, height/2, height) ./ scale
inputmatrix = xvalues' .+ yvalues .* im 
result = fill(UInt(8), height, width)

@time threaded(result, inputmatrix, c)

That should be faster… You could make it even faster with an inner loop, that can be SIMD accelerated

Skoffer · March 5, 2020, 5:02am

There is also GPU implementation of Mandelbrot set, which can easily be adapted to your needs Coding the Mandelbrot Set in Julia (with GPU) - YouTube

zenon · March 5, 2020, 8:35am

Wrong: Using your version of juliaset(…) immediately was actually slower, but @code_warntype pointed out that in z = z0 and thus in abs2(z) the type was assumed to be Any.
I take this paragraph back. Tried to reproduce this with a fresh REPL, and it was fast now. Sorry.

Using your function as function juliaset(z0 :: Complex{Float64}, c, maxiter) is three times faster than mine. This is cool!

Thank you!

sdanisch · March 5, 2020, 10:17am

Only 3 times?
I get:

@btime juliaSet($batches, $valueGrid, $dataGrid, $c)
51ms
@btime threaded($result, $inputmatrix, $c)
729.403 μs

Which is a speed up of ~70x

sdanisch · March 5, 2020, 10:20am

To what number did you set JULIA_NUM_THREADS before running Julia?
And are you using julia -O3 ?

zenon · March 5, 2020, 12:35pm

I’m not at that computer in the moment, but for Thread.nthreads() I got 16.

Shouldn’t that be enough? (I did it in Juno.)

(Until now I only played with your version of the inner loop, as I don’t have an interesting amount of threads on my current machine.)

sdanisch · March 5, 2020, 1:47pm

Alright, the inner version shouldnt be that different performance wise!

Make sure you have 3 for optimization level

Might not change anything, but I remember that 03 pretty much only enabled better optimizations for reducing over a variable in a loop, which we are actually doing here…

kristoffer.carlsson · March 5, 2020, 1:52pm

I don’t think -O3 does really anything right now. At least I haven’t seen a single example where it is faster since SLP got enabled with -O2.

sdanisch · March 5, 2020, 1:57pm

Yeah I thought almost nothing changes with -O3 but this one thing… but my information is pretty old

zenon · March 5, 2020, 9:36pm

I come to the conclusion that I don’t know yet what I’m doing; getting results all over the place.

So finally updated the gist again. Gist
j(…) is now a comparing test, calling (at)btimes for both our code versions.

simon:
12.790 ms (116 allocations: 13.72 KiB)
zenon:
34.274 ms (2085420 allocations: 57.48 MiB)

Well this is AFTER I changed my code:

corrected UInt to a concrete UInt32 (change to 32 from 8, seems to do no harm) in my inner loop.
replaced z^2 by z*z (from your code), inner loop
replaced abs by abs2, which seems to be much faster for Complex{Float64} (from your code), inner loop
4.used (at)inbounds (from your code), but that seemed not to change anything.

So. Many thanks.
I have to leave the large computer again, for 3 days. That was fun!

Kind greetings, z.

c42f · March 6, 2020, 3:14am

Having a quick look at the code, a couple of thoughts:

batchify could be replaced by Iterators.partition? But it also should be unnecessary because @threads already statically partitions the work into nthreads batches, I think.
In juliaSet you use for y = 1:width. This likely kills performance because width is a global variable. You can use size(dataGrid,2) instead.

zenon · March 6, 2020, 5:58am

Ah, thank you!
I assumed there must be sth like batchify already, I just didn’t know where to find it

The global vars are gone in rev. 2 and 3 of the gist.

c42f · March 6, 2020, 6:26am

In the latest version I can see, there’s still the use of global width in:

function juliaSet(batches, valueGrid, dataGrid, c)
    @threads for batch in batches
        for x in batch
            for y = 1:width
                @inbounds dataGrid[x,y] = julia(valueGrid[x,y], c, MAXITER)
            end
        end
    end
end

zenon · March 6, 2020, 6:40am

Uuups. Thank you!, escaped me.

mschauer · March 6, 2020, 6:43am

If you are doing Julia sets in Julia you might be tempted to replace z*z by z*z*z…

https://github.com/JuliaLang/julia/issues/16357#issuecomment-219693645

Topic		Replies	Views
Slower with threads Performance question	26	1157	August 6, 2022
Will Julia ever fix its "using ..." latency problems? Performance	69	6988	April 25, 2022
Julia plots are very slow in every run Visualization question , plotting , performance	5	1186	November 7, 2022
Simple For Loops and Speed New to Julia question , performance	8	3413	November 19, 2017
Speed up for-loop with multithreading Performance question , multithreading	10	1820	April 22, 2022

Julia sets in Julia, using Interact.jl - I'd like to get it faster

Related topics