Hard to beat Numba / JAX loop for generating Mandelbrot

jling · June 14, 2022, 6:21am

find the numba code in: mandelbrot-on-all-accelerators.ipynb · GitHub under title “Numba: imperative in a compiled subset of Python”, on my laptop, the numba code reports:

In [30]: %%timeit
    ...: run_numba(2000, 3000)
144 ms ± 357 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [31]: %%timeit
    ...: run_numba(4000, 6000)
579 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

My Julia attempt in 1.8-rc1:

function run_julia(height, width)
    y = range(-1.0f0, 0.0f0; length = height)
    x = range(-1.5f0, 0.0f0; length = width)
    c = x' .+ y*im
    fractal = fill(Int32(20), height, width)
    @fastmath @inbounds for h in 1:height
        for w in 1:width
            _c = c[h, w]
            z = _c
            for i in 1:20
                z = z^2 + _c
                if abs2(z) > 4
                    fractal[h, w] = i
                    break
                end
            end
        end
    end
    return fractal
end

julia> @btime run_julia(2000, 3000);
  174.944 ms (2 allocations: 22.89 MiB)

julia> @btime run_julia(4000, 6000);
  704.037 ms (2 allocations: 91.55 MiB)

lmiq · June 14, 2022, 6:29am

Invert the h/w loop order maybe?

jling · June 14, 2022, 6:30am

in this case doesn’t seems to matter at all:

# with correct w/h order
julia> @btime run_julia(2000, 3000);
  175.864 ms (4 allocations: 68.66 MiB)

175.864 ms (4 allocations: 68.66 MiB)

jling · June 14, 2022, 6:33am

it seems that 1.7.3 is significantly faster:

julia> @btime run_julia(2000, 3000);
  166.743 ms (4 allocations: 68.66 MiB)

kristoffer.carlsson · June 14, 2022, 6:34am

Matters for me:

Inverted: 433.209 ms (4 allocations: 274.66 MiB)
As written: 706.969 ms (4 allocations: 274.66 MiB)

(using 1.7.3)

jling · June 14, 2022, 6:36am

can confirm, on 1.7.3:

 166.743 ms slows down to

julia> @btime run_julia(2000, 3000);
  181.135 ms (4 allocations: 68.66 MiB)

kristoffer.carlsson · June 14, 2022, 6:37am

julia> include("mandelbrot.jl");
  435.391 ms (4 allocations: 274.66 MiB)

julia> VERSION
v"1.8.0-beta3"

julia> include("mandelbrot.jl");
  427.554 ms (4 allocations: 274.66 MiB)

julia> VERSION
v"1.7.3"

`mandelbrot.jl`

using BenchmarkTools

@inline function mandel_step(c, h, w, fractal)
    @inbounds _c = c[h, w]
    z = _c
    @fastmath @inbounds for i in Int32(1):Int32(20)
        z = z^2 + _c
        if abs2(z) > 4
            fractal[h, w] = i
            break
        end
    end
end

function run_julia(height, width)
    y = range(-1.0f0, 0.0f0; length = height)
    x = range(-1.5f0, 0.0f0; length = width)
    c = x' .+ y*im
    fractal = fill(Int32(20), height, width)
    for w in 1:width
        for h in 1:height
            mandel_step(c, h, w, fractal)
        end
    end
    return fractal
end

@btime run_julia(4000, 6000);

jling · June 14, 2022, 6:40am

Julia Version 1.8.0-beta3
Commit 3e092a2521 (2022-03-29 15:42 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, tigerlake)
  Threads: 4 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = 4
  JULIA_EDITOR = nvim

jling · June 14, 2022, 6:41am

is that faster than Numba on your local computer?

baggepinnen · June 14, 2022, 6:44am

Threading helps

julia> @btime run_julia(2000, 3000); # as written
  206.204 ms (4 allocations: 68.66 MiB)

julia> @btime run_julia(2000, 3000); # inverted loops
  175.691 ms (4 allocations: 68.66 MiB)

julia> @btime run_julia(2000, 3000); # threads
  56.702 ms (47 allocations: 68.67 MiB)

julia> @btime run_julia(2000, 3000); # Polyester.@batch
  41.661 ms (6 allocations: 68.66 MiB)

jling · June 14, 2022, 6:44am

yeah but that’s cheating compare to Numba / JAX no?

kristoffer.carlsson · June 14, 2022, 6:45am

In [3]: %%timeit
   ...:  run_numba(4000, 6000)
   ...: 
430 ms ± 2.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So pretty much equal.

baggepinnen · June 14, 2022, 6:46am

Why wouldn’t they use threads? The goal is to make the computation fast right?

jakobnissen · June 14, 2022, 7:21am

So Numba uses 64-bit floats here, you should also use that in Julia.

I get ~560 ms for Numba and ~805 ms for Julia (column major ordered). They generate different code. Inspecting the assembly, I think the difference may be that Julia doesn’t use fma for squaring the complex numbers in the inner loop, whereas Numba does. I’m not confident that is correct, though.

Edit: Replacing the squaring operation with this:

@inline function f(x)
    @fastmath begin
    r = real(x)
    i = imag(x)
    Complex(fma(r, r, -i * i), fma(r, i, i * r))
    end
end

I get 610 ms for Julia. So not quite all the difference explained, but most of it.
Edit2: Fixed typo in fma computation, thanks to @JM_Beckers

lmiq · June 14, 2022, 7:45am

The square of the complex number is just lowered to x*x. Could that be an alternative default implementation, or is there a tradeoff?

JM_Beckers · June 14, 2022, 7:53am

Complex(fma(r, i, -i * i), fma(r, i, i * r))

Should be

Complex(fma(r, r, -i * i), fma(r, i, i * r))

skleinbo · June 14, 2022, 8:18am

I thought so too, and indeed, on my machine using z*z in conjunction with @fastmath shaves off significant time (~15%), but it also becomes incorrect

julia> function run_julia_fast(height, width)
           y = range(-1.0f0, 0.0f0; length = height)
           x = range(-1.5f0, 0.0f0; length = width)
           c = x' .+ y*im
           fractal = fill(Int32(20), height, width)
           @fastmath @inbounds for p in eachindex(c)
                   _c = c[p]
                   z = _c
                   for i in 1:20
                       z = z^2 + _c
                       if real(z)*real(z)+imag(z)*imag(z) > 4
                           fractal[p] = i
                           break
                       end
                   end
               end
               return fractal
       end
run_julia_fast (generic function with 1 method)

julia> function run_julia2_fast(height, width)
           y = range(-1.0f0, 0.0f0; length = height)
           x = range(-1.5f0, 0.0f0; length = width)
           c = x' .+ y*im
           fractal = fill(Int32(20), height, width)
           @fastmath @inbounds for p in eachindex(c)
                   _c = c[p]
                   z = _c
                   for i in 1:20
                       z = z*z + _c
                       if real(z)*real(z)+imag(z)*imag(z) > 4
                           fractal[p] = i
                           break
                       end
                   end
               end
               return fractal
       end
run_julia2_fast (generic function with 1 method)
julia> frac1_fast = run_julia_fast(2000,3000);

julia> frac2_fast = run_julia2_fast(2000,3000);

julia> findall(!=(0), frac1_fast - frac2_fast)
11-element Vector{CartesianIndex{2}}:
 CartesianIndex(1715, 105)
 CartesianIndex(1900, 153)
 CartesianIndex(1763, 206)
 CartesianIndex(1911, 223)
 CartesianIndex(1781, 356)
 CartesianIndex(1607, 451)
 CartesianIndex(1238, 466)
 CartesianIndex(1676, 507)
 CartesianIndex(1238, 512)
 CartesianIndex(1529, 1335)
 CartesianIndex(294, 2485)

The z^2 version returns the same result with and without fastmath.

DNF · June 14, 2022, 9:01am

It is not necessary to create c, it’s an unnecessary allocation. Instead, you can just iterate over the ranges:

@inbounds for (w, x) in pairs(range(-1.5f0, 0.0f0, width))
        for (h, y) in pairs(range(-1.0f0, 0.0f0, height))
            z = _c = complex(x, y)

Also,

Maybe Int32(1):Int32(20) can help avoid conversion when writing into fractal.

I though fma was for accuracy, and muladd was for performance:

fma(x, y, z)
  Computes x*y+z without rounding the intermediate result x*y. On some systems this is
  significantly more expensive than x*y+z. fma is used to improve accuracy in certain algorithms.

muladd(x, y, z)
  Combined multiply-add: computes x*y+z, but allowing the add and multiply to be merged with each
  other or with surrounding operations for performance.

Edit: It seems fma isn’t even defined for complex numbers, while muladd is. So you can try

z = muladd(z, z, _c)

jishnub · June 14, 2022, 9:41am

DNF:

It is not necessary to create c, it’s an unnecessary allocation. Instead, you can just iterate over the ranges:
@inbounds for (w, x) in pairs(range(-1.5f0, 0.0f0, width))
        for (h, y) in pairs(range(-1.0f0, 0.0f0, height))
            z = _c = complex(x, y)

The python code does allocate this, so isn’t it reasonable to allocate this in Julia when comparing performance?

DNF · June 14, 2022, 10:08am

Then it should be fixed in both places. Pointless allocations are inexcusable.

Topic		Replies	Views
Julia & Mojo Mandelbrot Benchmark General Usage mojo	57	26653	October 15, 2023
Quite bad performance of Julia 0.6.4 vs Python+Numpy General Usage	26	5200	November 13, 2018
Julia Beginner (from Python): Numba outperforms Julia in rewrite. Any tips to improve performance? Performance benchmark , python , tullio , loopvectorization	56	6232	August 18, 2021
Improving an algorithm that compute gps distances Performance tullio	41	1947	April 27, 2020
General questions from Python user Performance	59	4318	March 8, 2021

Hard to beat Numba / JAX loop for generating Mandelbrot

Related topics