# Hard to beat Numba / JAX loop for generating Mandelbrot

find the numba code in: mandelbrot-on-all-accelerators.ipynb · GitHub under title “Numba: imperative in a compiled subset of Python”, on my laptop, the numba code reports:

``````In : %%timeit
...: run_numba(2000, 3000)
144 ms ± 357 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In : %%timeit
...: run_numba(4000, 6000)
579 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
``````

My Julia attempt in 1.8-rc1:

``````function run_julia(height, width)
y = range(-1.0f0, 0.0f0; length = height)
x = range(-1.5f0, 0.0f0; length = width)
c = x' .+ y*im
fractal = fill(Int32(20), height, width)
@fastmath @inbounds for h in 1:height
for w in 1:width
_c = c[h, w]
z = _c
for i in 1:20
z = z^2 + _c
if abs2(z) > 4
fractal[h, w] = i
break
end
end
end
end
return fractal
end

julia> @btime run_julia(2000, 3000);
174.944 ms (2 allocations: 22.89 MiB)

julia> @btime run_julia(4000, 6000);
704.037 ms (2 allocations: 91.55 MiB)
``````
3 Likes

Invert the h/w loop order maybe?

3 Likes

in this case doesn’t seems to matter at all:

``````# with correct w/h order
julia> @btime run_julia(2000, 3000);
175.864 ms (4 allocations: 68.66 MiB)
``````

175.864 ms (4 allocations: 68.66 MiB)

1 Like

it seems that 1.7.3 is significantly faster:

``````julia> @btime run_julia(2000, 3000);
166.743 ms (4 allocations: 68.66 MiB)
``````

Matters for me:

Inverted: `433.209 ms (4 allocations: 274.66 MiB)`
As written: `706.969 ms (4 allocations: 274.66 MiB)`

(using 1.7.3)

1 Like

can confirm, on 1.7.3:

`````` 166.743 ms slows down to

julia> @btime run_julia(2000, 3000);
181.135 ms (4 allocations: 68.66 MiB)
``````
``````julia> include("mandelbrot.jl");
435.391 ms (4 allocations: 274.66 MiB)

julia> VERSION
v"1.8.0-beta3"
``````
``````julia> include("mandelbrot.jl");
427.554 ms (4 allocations: 274.66 MiB)

julia> VERSION
v"1.7.3"
``````
`mandelbrot.jl`
``````using BenchmarkTools

@inline function mandel_step(c, h, w, fractal)
@inbounds _c = c[h, w]
z = _c
@fastmath @inbounds for i in Int32(1):Int32(20)
z = z^2 + _c
if abs2(z) > 4
fractal[h, w] = i
break
end
end
end

function run_julia(height, width)
y = range(-1.0f0, 0.0f0; length = height)
x = range(-1.5f0, 0.0f0; length = width)
c = x' .+ y*im
fractal = fill(Int32(20), height, width)
for w in 1:width
for h in 1:height
mandel_step(c, h, w, fractal)
end
end
return fractal
end

@btime run_julia(4000, 6000);
``````
``````Julia Version 1.8.0-beta3
Commit 3e092a2521 (2022-03-29 15:42 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: 8 × 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, tigerlake)
Threads: 4 on 8 virtual cores
Environment:
JULIA_EDITOR = nvim
``````

is that faster than Numba on your local computer?

``````julia> @btime run_julia(2000, 3000); # as written
206.204 ms (4 allocations: 68.66 MiB)

julia> @btime run_julia(2000, 3000); # inverted loops
175.691 ms (4 allocations: 68.66 MiB)

julia> @btime run_julia(2000, 3000); # threads
56.702 ms (47 allocations: 68.67 MiB)

julia> @btime run_julia(2000, 3000); # Polyester.@batch
41.661 ms (6 allocations: 68.66 MiB)
``````
1 Like

yeah but that’s cheating compare to Numba / JAX no?

``````In : %%timeit
...:  run_numba(4000, 6000)
...:
430 ms ± 2.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
``````

So pretty much equal.

Why wouldn’t they use threads? The goal is to make the computation fast right?

3 Likes

So Numba uses 64-bit floats here, you should also use that in Julia.

I get ~560 ms for Numba and ~805 ms for Julia (column major ordered). They generate different code. Inspecting the assembly, I think the difference may be that Julia doesn’t use `fma` for squaring the complex numbers in the inner loop, whereas Numba does. I’m not confident that is correct, though.

Edit: Replacing the squaring operation with this:

``````@inline function f(x)
@fastmath begin
r = real(x)
i = imag(x)
Complex(fma(r, r, -i * i), fma(r, i, i * r))
end
end
``````

I get 610 ms for Julia. So not quite all the difference explained, but most of it.
Edit2: Fixed typo in fma computation, thanks to @JM_Beckers

12 Likes

The square of the complex number is just lowered to `x*x`. Could that be an alternative default implementation, or is there a tradeoff?

1 Like
``````Complex(fma(r, i, -i * i), fma(r, i, i * r))
``````

Should be

``````Complex(fma(r, r, -i * i), fma(r, i, i * r))
``````
1 Like

I thought so too, and indeed, on my machine using `z*z` in conjunction with `@fastmath` shaves off significant time (~15%), but it also becomes incorrect ``````julia> function run_julia_fast(height, width)
y = range(-1.0f0, 0.0f0; length = height)
x = range(-1.5f0, 0.0f0; length = width)
c = x' .+ y*im
fractal = fill(Int32(20), height, width)
@fastmath @inbounds for p in eachindex(c)
_c = c[p]
z = _c
for i in 1:20
z = z^2 + _c
if real(z)*real(z)+imag(z)*imag(z) > 4
fractal[p] = i
break
end
end
end
return fractal
end
run_julia_fast (generic function with 1 method)

julia> function run_julia2_fast(height, width)
y = range(-1.0f0, 0.0f0; length = height)
x = range(-1.5f0, 0.0f0; length = width)
c = x' .+ y*im
fractal = fill(Int32(20), height, width)
@fastmath @inbounds for p in eachindex(c)
_c = c[p]
z = _c
for i in 1:20
z = z*z + _c
if real(z)*real(z)+imag(z)*imag(z) > 4
fractal[p] = i
break
end
end
end
return fractal
end
run_julia2_fast (generic function with 1 method)
julia> frac1_fast = run_julia_fast(2000,3000);

julia> frac2_fast = run_julia2_fast(2000,3000);

julia> findall(!=(0), frac1_fast - frac2_fast)
11-element Vector{CartesianIndex{2}}:
CartesianIndex(1715, 105)
CartesianIndex(1900, 153)
CartesianIndex(1763, 206)
CartesianIndex(1911, 223)
CartesianIndex(1781, 356)
CartesianIndex(1607, 451)
CartesianIndex(1238, 466)
CartesianIndex(1676, 507)
CartesianIndex(1238, 512)
CartesianIndex(1529, 1335)
CartesianIndex(294, 2485)
``````

The `z^2` version returns the same result with and without fastmath.

It is not necessary to create `c`, it’s an unnecessary allocation. Instead, you can just iterate over the ranges:

``````@inbounds for (w, x) in pairs(range(-1.5f0, 0.0f0, width))
for (h, y) in pairs(range(-1.0f0, 0.0f0, height))
z = _c = complex(x, y)
``````

Also,

Maybe `Int32(1):Int32(20)` can help avoid conversion when writing into `fractal`.

I though `fma` was for accuracy, and `muladd` was for performance:

``````fma(x, y, z)
Computes x*y+z without rounding the intermediate result x*y. On some systems this is
significantly more expensive than x*y+z. fma is used to improve accuracy in certain algorithms.
``````
``````muladd(x, y, z)
Combined multiply-add: computes x*y+z, but allowing the add and multiply to be merged with each
other or with surrounding operations for performance.
``````

Edit: It seems `fma` isn’t even defined for complex numbers, while `muladd` is. So you can try

``````z = muladd(z, z, _c)
``````
2 Likes

The python code does allocate this, so isn’t it reasonable to allocate this in Julia when comparing performance?

Then it should be fixed in both places. Pointless allocations are inexcusable.

5 Likes