Hard to beat Numba / JAX loop for generating Mandelbrot

find the numba code in: mandelbrot-on-all-accelerators.ipynb · GitHub under title “Numba: imperative in a compiled subset of Python”, on my laptop, the numba code reports:

In [30]: %%timeit
    ...: run_numba(2000, 3000)
144 ms ± 357 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [31]: %%timeit
    ...: run_numba(4000, 6000)
579 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

My Julia attempt in 1.8-rc1:

function run_julia(height, width)
    y = range(-1.0f0, 0.0f0; length = height)
    x = range(-1.5f0, 0.0f0; length = width)
    c = x' .+ y*im
    fractal = fill(Int32(20), height, width)
    @fastmath @inbounds for h in 1:height
        for w in 1:width
            _c = c[h, w]
            z = _c
            for i in 1:20
                z = z^2 + _c
                if abs2(z) > 4
                    fractal[h, w] = i
                    break
                end
            end
        end
    end
    return fractal
end

julia> @btime run_julia(2000, 3000);
  174.944 ms (2 allocations: 22.89 MiB)

julia> @btime run_julia(4000, 6000);
  704.037 ms (2 allocations: 91.55 MiB)
3 Likes

Invert the h/w loop order maybe?

3 Likes

in this case doesn’t seems to matter at all:

# with correct w/h order
julia> @btime run_julia(2000, 3000);
  175.864 ms (4 allocations: 68.66 MiB)

175.864 ms (4 allocations: 68.66 MiB)

1 Like

it seems that 1.7.3 is significantly faster:

julia> @btime run_julia(2000, 3000);
  166.743 ms (4 allocations: 68.66 MiB)

Matters for me:

Inverted: 433.209 ms (4 allocations: 274.66 MiB)
As written: 706.969 ms (4 allocations: 274.66 MiB)

(using 1.7.3)

1 Like

can confirm, on 1.7.3:

 166.743 ms slows down to

julia> @btime run_julia(2000, 3000);
  181.135 ms (4 allocations: 68.66 MiB)
julia> include("mandelbrot.jl");
  435.391 ms (4 allocations: 274.66 MiB)

julia> VERSION
v"1.8.0-beta3"
julia> include("mandelbrot.jl");
  427.554 ms (4 allocations: 274.66 MiB)

julia> VERSION
v"1.7.3"
`mandelbrot.jl`
using BenchmarkTools

@inline function mandel_step(c, h, w, fractal)
    @inbounds _c = c[h, w]
    z = _c
    @fastmath @inbounds for i in Int32(1):Int32(20)
        z = z^2 + _c
        if abs2(z) > 4
            fractal[h, w] = i
            break
        end
    end
end

function run_julia(height, width)
    y = range(-1.0f0, 0.0f0; length = height)
    x = range(-1.5f0, 0.0f0; length = width)
    c = x' .+ y*im
    fractal = fill(Int32(20), height, width)
    for w in 1:width
        for h in 1:height
            mandel_step(c, h, w, fractal)
        end
    end
    return fractal
end

@btime run_julia(4000, 6000);
Julia Version 1.8.0-beta3
Commit 3e092a2521 (2022-03-29 15:42 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, tigerlake)
  Threads: 4 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = 4
  JULIA_EDITOR = nvim

is that faster than Numba on your local computer?

Threading helps

julia> @btime run_julia(2000, 3000); # as written
  206.204 ms (4 allocations: 68.66 MiB)

julia> @btime run_julia(2000, 3000); # inverted loops
  175.691 ms (4 allocations: 68.66 MiB)

julia> @btime run_julia(2000, 3000); # threads
  56.702 ms (47 allocations: 68.67 MiB)

julia> @btime run_julia(2000, 3000); # Polyester.@batch
  41.661 ms (6 allocations: 68.66 MiB)
1 Like

yeah but that’s cheating compare to Numba / JAX no?

In [3]: %%timeit
   ...:  run_numba(4000, 6000)
   ...: 
430 ms ± 2.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So pretty much equal.

Why wouldn’t they use threads? The goal is to make the computation fast right?

3 Likes

So Numba uses 64-bit floats here, you should also use that in Julia.

I get ~560 ms for Numba and ~805 ms for Julia (column major ordered). They generate different code. Inspecting the assembly, I think the difference may be that Julia doesn’t use fma for squaring the complex numbers in the inner loop, whereas Numba does. I’m not confident that is correct, though.

Edit: Replacing the squaring operation with this:

@inline function f(x)
    @fastmath begin
    r = real(x)
    i = imag(x)
    Complex(fma(r, r, -i * i), fma(r, i, i * r))
    end
end

I get 610 ms for Julia. So not quite all the difference explained, but most of it.
Edit2: Fixed typo in fma computation, thanks to @JM_Beckers

12 Likes

The square of the complex number is just lowered to x*x. Could that be an alternative default implementation, or is there a tradeoff?

1 Like
Complex(fma(r, i, -i * i), fma(r, i, i * r))

Should be

Complex(fma(r, r, -i * i), fma(r, i, i * r))
1 Like

I thought so too, and indeed, on my machine using z*z in conjunction with @fastmath shaves off significant time (~15%), but it also becomes incorrect :sweat_smile:

julia> function run_julia_fast(height, width)
           y = range(-1.0f0, 0.0f0; length = height)
           x = range(-1.5f0, 0.0f0; length = width)
           c = x' .+ y*im
           fractal = fill(Int32(20), height, width)
           @fastmath @inbounds for p in eachindex(c)
                   _c = c[p]
                   z = _c
                   for i in 1:20
                       z = z^2 + _c
                       if real(z)*real(z)+imag(z)*imag(z) > 4
                           fractal[p] = i
                           break
                       end
                   end
               end
               return fractal
       end
run_julia_fast (generic function with 1 method)

julia> function run_julia2_fast(height, width)
           y = range(-1.0f0, 0.0f0; length = height)
           x = range(-1.5f0, 0.0f0; length = width)
           c = x' .+ y*im
           fractal = fill(Int32(20), height, width)
           @fastmath @inbounds for p in eachindex(c)
                   _c = c[p]
                   z = _c
                   for i in 1:20
                       z = z*z + _c
                       if real(z)*real(z)+imag(z)*imag(z) > 4
                           fractal[p] = i
                           break
                       end
                   end
               end
               return fractal
       end
run_julia2_fast (generic function with 1 method)
julia> frac1_fast = run_julia_fast(2000,3000);

julia> frac2_fast = run_julia2_fast(2000,3000);

julia> findall(!=(0), frac1_fast - frac2_fast)
11-element Vector{CartesianIndex{2}}:
 CartesianIndex(1715, 105)
 CartesianIndex(1900, 153)
 CartesianIndex(1763, 206)
 CartesianIndex(1911, 223)
 CartesianIndex(1781, 356)
 CartesianIndex(1607, 451)
 CartesianIndex(1238, 466)
 CartesianIndex(1676, 507)
 CartesianIndex(1238, 512)
 CartesianIndex(1529, 1335)
 CartesianIndex(294, 2485)

The z^2 version returns the same result with and without fastmath.

It is not necessary to create c, it’s an unnecessary allocation. Instead, you can just iterate over the ranges:

@inbounds for (w, x) in pairs(range(-1.5f0, 0.0f0, width))
        for (h, y) in pairs(range(-1.0f0, 0.0f0, height))
            z = _c = complex(x, y)

Also,

Maybe Int32(1):Int32(20) can help avoid conversion when writing into fractal.

I though fma was for accuracy, and muladd was for performance:

fma(x, y, z)
  Computes x*y+z without rounding the intermediate result x*y. On some systems this is
  significantly more expensive than x*y+z. fma is used to improve accuracy in certain algorithms.
muladd(x, y, z)
  Combined multiply-add: computes x*y+z, but allowing the add and multiply to be merged with each
  other or with surrounding operations for performance.

Edit: It seems fma isn’t even defined for complex numbers, while muladd is. So you can try

z = muladd(z, z, _c)
2 Likes

The python code does allocate this, so isn’t it reasonable to allocate this in Julia when comparing performance?

Then it should be fixed in both places. Pointless allocations are inexcusable.

5 Likes