Hard to beat Numba / JAX loop for generating Mandelbrot

find the numba code in: mandelbrot-on-all-accelerators.ipynb · GitHub under title “Numba: imperative in a compiled subset of Python”, on my laptop, the numba code reports:

In [30]: %%timeit
    ...: run_numba(2000, 3000)
144 ms ± 357 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [31]: %%timeit
    ...: run_numba(4000, 6000)
579 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

My Julia attempt in 1.8-rc1:

function run_julia(height, width)
    y = range(-1.0f0, 0.0f0; length = height)
    x = range(-1.5f0, 0.0f0; length = width)
    c = x' .+ y*im
    fractal = fill(Int32(20), height, width)
    @fastmath @inbounds for h in 1:height
        for w in 1:width
            _c = c[h, w]
            z = _c
            for i in 1:20
                z = z^2 + _c
                if abs2(z) > 4
                    fractal[h, w] = i
                    break
                end
            end
        end
    end
    return fractal
end

julia> @btime run_julia(2000, 3000);
  174.944 ms (2 allocations: 22.89 MiB)

julia> @btime run_julia(4000, 6000);
  704.037 ms (2 allocations: 91.55 MiB)

Invert the h/w loop order maybe?

in this case doesn’t seems to matter at all:

# with correct w/h order
julia> @btime run_julia(2000, 3000);
  175.864 ms (4 allocations: 68.66 MiB)

175.864 ms (4 allocations: 68.66 MiB)

it seems that 1.7.3 is significantly faster:

julia> @btime run_julia(2000, 3000);
  166.743 ms (4 allocations: 68.66 MiB)

Matters for me:

Inverted: 433.209 ms (4 allocations: 274.66 MiB)
As written: 706.969 ms (4 allocations: 274.66 MiB)

(using 1.7.3)

can confirm, on 1.7.3:

 166.743 ms slows down to

julia> @btime run_julia(2000, 3000);
  181.135 ms (4 allocations: 68.66 MiB)
julia> include("mandelbrot.jl");
  435.391 ms (4 allocations: 274.66 MiB)

julia> VERSION
v"1.8.0-beta3"
julia> include("mandelbrot.jl");
  427.554 ms (4 allocations: 274.66 MiB)

julia> VERSION
v"1.7.3"
`mandelbrot.jl`
using BenchmarkTools

@inline function mandel_step(c, h, w, fractal)
    @inbounds _c = c[h, w]
    z = _c
    @fastmath @inbounds for i in Int32(1):Int32(20)
        z = z^2 + _c
        if abs2(z) > 4
            fractal[h, w] = i
            break
        end
    end
end

function run_julia(height, width)
    y = range(-1.0f0, 0.0f0; length = height)
    x = range(-1.5f0, 0.0f0; length = width)
    c = x' .+ y*im
    fractal = fill(Int32(20), height, width)
    for w in 1:width
        for h in 1:height
            mandel_step(c, h, w, fractal)
        end
    end
    return fractal
end

@btime run_julia(4000, 6000);
Julia Version 1.8.0-beta3
Commit 3e092a2521 (2022-03-29 15:42 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, tigerlake)
  Threads: 4 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = 4
  JULIA_EDITOR = nvim

is that faster than Numba on your local computer?

Threading helps

julia> @btime run_julia(2000, 3000); # as written
  206.204 ms (4 allocations: 68.66 MiB)

julia> @btime run_julia(2000, 3000); # inverted loops
  175.691 ms (4 allocations: 68.66 MiB)

julia> @btime run_julia(2000, 3000); # threads
  56.702 ms (47 allocations: 68.67 MiB)

julia> @btime run_julia(2000, 3000); # Polyester.@batch
  41.661 ms (6 allocations: 68.66 MiB)

yeah but that’s cheating compare to Numba / JAX no?

In [3]: %%timeit
   ...:  run_numba(4000, 6000)
   ...: 
430 ms ± 2.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So pretty much equal.

Why wouldn’t they use threads? The goal is to make the computation fast right?

So Numba uses 64-bit floats here, you should also use that in Julia.

I get ~560 ms for Numba and ~805 ms for Julia (column major ordered). They generate different code. Inspecting the assembly, I think the difference may be that Julia doesn’t use fma for squaring the complex numbers in the inner loop, whereas Numba does. I’m not confident that is correct, though.

Edit: Replacing the squaring operation with this:

@inline function f(x)
    @fastmath begin
    r = real(x)
    i = imag(x)
    Complex(fma(r, r, -i * i), fma(r, i, i * r))
    end
end

I get 610 ms for Julia. So not quite all the difference explained, but most of it.
Edit2: Fixed typo in fma computation, thanks to @JM_Beckers

The square of the complex number is just lowered to x*x. Could that be an alternative default implementation, or is there a tradeoff?

Complex(fma(r, i, -i * i), fma(r, i, i * r))

Should be

Complex(fma(r, r, -i * i), fma(r, i, i * r))

I thought so too, and indeed, on my machine using z*z in conjunction with @fastmath shaves off significant time (~15%), but it also becomes incorrect :sweat_smile:

julia> function run_julia_fast(height, width)
           y = range(-1.0f0, 0.0f0; length = height)
           x = range(-1.5f0, 0.0f0; length = width)
           c = x' .+ y*im
           fractal = fill(Int32(20), height, width)
           @fastmath @inbounds for p in eachindex(c)
                   _c = c[p]
                   z = _c
                   for i in 1:20
                       z = z^2 + _c
                       if real(z)*real(z)+imag(z)*imag(z) > 4
                           fractal[p] = i
                           break
                       end
                   end
               end
               return fractal
       end
run_julia_fast (generic function with 1 method)

julia> function run_julia2_fast(height, width)
           y = range(-1.0f0, 0.0f0; length = height)
           x = range(-1.5f0, 0.0f0; length = width)
           c = x' .+ y*im
           fractal = fill(Int32(20), height, width)
           @fastmath @inbounds for p in eachindex(c)
                   _c = c[p]
                   z = _c
                   for i in 1:20
                       z = z*z + _c
                       if real(z)*real(z)+imag(z)*imag(z) > 4
                           fractal[p] = i
                           break
                       end
                   end
               end
               return fractal
       end
run_julia2_fast (generic function with 1 method)
julia> frac1_fast = run_julia_fast(2000,3000);

julia> frac2_fast = run_julia2_fast(2000,3000);

julia> findall(!=(0), frac1_fast - frac2_fast)
11-element Vector{CartesianIndex{2}}:
 CartesianIndex(1715, 105)
 CartesianIndex(1900, 153)
 CartesianIndex(1763, 206)
 CartesianIndex(1911, 223)
 CartesianIndex(1781, 356)
 CartesianIndex(1607, 451)
 CartesianIndex(1238, 466)
 CartesianIndex(1676, 507)
 CartesianIndex(1238, 512)
 CartesianIndex(1529, 1335)
 CartesianIndex(294, 2485)

The z^2 version returns the same result with and without fastmath.

It is not necessary to create c, it’s an unnecessary allocation. Instead, you can just iterate over the ranges:

@inbounds for (w, x) in pairs(range(-1.5f0, 0.0f0, width))
        for (h, y) in pairs(range(-1.0f0, 0.0f0, height))
            z = _c = complex(x, y)

Also,

Maybe Int32(1):Int32(20) can help avoid conversion when writing into fractal.

I though fma was for accuracy, and muladd was for performance:

fma(x, y, z)
  Computes x*y+z without rounding the intermediate result x*y. On some systems this is
  significantly more expensive than x*y+z. fma is used to improve accuracy in certain algorithms.
muladd(x, y, z)
  Combined multiply-add: computes x*y+z, but allowing the add and multiply to be merged with each
  other or with surrounding operations for performance.

Edit: It seems fma isn’t even defined for complex numbers, while muladd is. So you can try

z = muladd(z, z, _c)

The python code does allocate this, so isn’t it reasonable to allocate this in Julia when comparing performance?

Then it should be fixed in both places. Pointless allocations are inexcusable.