Slow parallel fractals

Trying @threads and @distributed with my favourite parallel test case. Can’t figure out why @threads is so slow and allocates so much memory and why @distributed is so slow.

function genFractal(α, n=100)
    niter = 255 # max number of iterations
    threshold = 10.0 # limit of |z|, before label as divergence
    len = 3.0  # len^2 is area of picture
    xmin, ymin = -1.5, -1.5
    ymax = ymin + len
    ax = len/n
    z::Complex = 0.0
    zinit::Complex = α > 0 ? 0.0 : 1.0+im
    count = Array{Int,2}(undef,n,n)
    @threads for j in 1:n
        cy = ymax - ax*j
        for i in 1:n
            cx = ax*i + xmin
            c = cx + im*cy
            nk = niter
            z = zinit
            for k in 1:niter
                if abs(z) < threshold
                    z = z^α + c
                    nk = k-1
            @inbounds count[i,j] = nk
    return count

frac = genFractal(2,1000);
@time genFractal(2,1000);

running this without the @threads on my MacBook pro gives:

1.609887 seconds (6 allocations: 7.630 MiB)

Now with @threads and 2 threads:

6.242670 seconds (125.02 M allocations: 3.098 GiB, 4.51% gc time)

Look at the allocations and memory used! I’ve got much better results by turning that outer loop into recursive calls and using @spawn, getting 0.815582 seconds for 2 threads.

@distributed is not much better. Array Count is now a shared array:

count = SharedArray{Int,2}(n,n)
@sync @distributed for j in 1:n

Starting Julia with -p 1 gives:

33.244954 seconds (1.26 k allocations: 67.547 KiB)

and starting with -p 2 gives:

16.797049 seconds (598 allocations: 24.391 KiB)

So it scales, but is much slower than the sequential run.

Just delete this line of code:

    z::Complex = 0.0

That’s forcing z to be boxed and shared across all threads/workers — with all the contention and overhead that it incurs.

Edit: and it’s also giving you the wrong answer because all the workers are racing to update/use the same z.


Is there a way to tell when the compiler is doing this for you? I’d almost prefer the compiler to just refuse to generate the code and tell me a screwed up in my attempt to use multiple threads.

Having the compiler “do stuff” for me is nice, but trying to figure out what is slowing down the process is more troublesome to figure out since the synchronization is happening invisibly.

When you put z outside of the loop like that you make the compiler think you want to keep it after the loop as well. It isn’t “wrong” and so the program should definitely not refuse to run – there are many cases where that is what you might want to do.

In reality, z is just a temporary variable that exists only within your loop. So, in this case, no it isn’t the right thing to do. The reason why was already mentioned.

I mean, any of our performance introspection tools will flag this for you. That’s how I quickly identified it.

julia> @code_warntype genFractal(2, 1000)
  z::Core.Box # in big bold red

It’s nice to be able to do this when you want to — but it is easy to incur race conditions. Fortunately those race conditions are quite often performance bugs, too, making them easy to spot (as in this case).


:slight_smile: Okay that makes sense, thanks.

That did it, for both threads and distributed versions. That was a dumb race condition I set up :roll_eyes:, but the results hadn’t been affected (at least not the fractal, visually). Now the @threads macro with 2 threads gives:

0.818296 seconds (22 allocations: 7.631 MiB)

and @distribute with 2 processes gives:

0.815279 seconds (587 allocations: 24.156 KiB)


Good to know about this. Now I’ve got to study the Performance Tips page carefully!