In [30]: %%timeit
...: run_numba(2000, 3000)
144 ms ± 357 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [31]: %%timeit
...: run_numba(4000, 6000)
579 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

My Julia attempt in 1.8-rc1:

function run_julia(height, width)
y = range(-1.0f0, 0.0f0; length = height)
x = range(-1.5f0, 0.0f0; length = width)
c = x' .+ y*im
fractal = fill(Int32(20), height, width)
@fastmath @inbounds for h in 1:height
for w in 1:width
_c = c[h, w]
z = _c
for i in 1:20
z = z^2 + _c
if abs2(z) > 4
fractal[h, w] = i
break
end
end
end
end
return fractal
end
julia> @btime run_julia(2000, 3000);
174.944 ms (2 allocations: 22.89 MiB)
julia> @btime run_julia(4000, 6000);
704.037 ms (2 allocations: 91.55 MiB)

julia> include("mandelbrot.jl");
435.391 ms (4 allocations: 274.66 MiB)
julia> VERSION
v"1.8.0-beta3"

julia> include("mandelbrot.jl");
427.554 ms (4 allocations: 274.66 MiB)
julia> VERSION
v"1.7.3"

`mandelbrot.jl`

using BenchmarkTools
@inline function mandel_step(c, h, w, fractal)
@inbounds _c = c[h, w]
z = _c
@fastmath @inbounds for i in Int32(1):Int32(20)
z = z^2 + _c
if abs2(z) > 4
fractal[h, w] = i
break
end
end
end
function run_julia(height, width)
y = range(-1.0f0, 0.0f0; length = height)
x = range(-1.5f0, 0.0f0; length = width)
c = x' .+ y*im
fractal = fill(Int32(20), height, width)
for w in 1:width
for h in 1:height
mandel_step(c, h, w, fractal)
end
end
return fractal
end
@btime run_julia(4000, 6000);

So Numba uses 64-bit floats here, you should also use that in Julia.

I get ~560 ms for Numba and ~805 ms for Julia (column major ordered). They generate different code. Inspecting the assembly, I think the difference may be that Julia doesn’t use fma for squaring the complex numbers in the inner loop, whereas Numba does. I’m not confident that is correct, though.

Edit: Replacing the squaring operation with this:

@inline function f(x)
@fastmath begin
r = real(x)
i = imag(x)
Complex(fma(r, r, -i * i), fma(r, i, i * r))
end
end

I get 610 ms for Julia. So not quite all the difference explained, but most of it.
Edit2: Fixed typo in fma computation, thanks to @JM_Beckers

I thought so too, and indeed, on my machine using z*z in conjunction with @fastmath shaves off significant time (~15%), but it also becomes incorrect

julia> function run_julia_fast(height, width)
y = range(-1.0f0, 0.0f0; length = height)
x = range(-1.5f0, 0.0f0; length = width)
c = x' .+ y*im
fractal = fill(Int32(20), height, width)
@fastmath @inbounds for p in eachindex(c)
_c = c[p]
z = _c
for i in 1:20
z = z^2 + _c
if real(z)*real(z)+imag(z)*imag(z) > 4
fractal[p] = i
break
end
end
end
return fractal
end
run_julia_fast (generic function with 1 method)
julia> function run_julia2_fast(height, width)
y = range(-1.0f0, 0.0f0; length = height)
x = range(-1.5f0, 0.0f0; length = width)
c = x' .+ y*im
fractal = fill(Int32(20), height, width)
@fastmath @inbounds for p in eachindex(c)
_c = c[p]
z = _c
for i in 1:20
z = z*z + _c
if real(z)*real(z)+imag(z)*imag(z) > 4
fractal[p] = i
break
end
end
end
return fractal
end
run_julia2_fast (generic function with 1 method)
julia> frac1_fast = run_julia_fast(2000,3000);
julia> frac2_fast = run_julia2_fast(2000,3000);
julia> findall(!=(0), frac1_fast - frac2_fast)
11-element Vector{CartesianIndex{2}}:
CartesianIndex(1715, 105)
CartesianIndex(1900, 153)
CartesianIndex(1763, 206)
CartesianIndex(1911, 223)
CartesianIndex(1781, 356)
CartesianIndex(1607, 451)
CartesianIndex(1238, 466)
CartesianIndex(1676, 507)
CartesianIndex(1238, 512)
CartesianIndex(1529, 1335)
CartesianIndex(294, 2485)

The z^2 version returns the same result with and without fastmath.

It is not necessary to create c, it’s an unnecessary allocation. Instead, you can just iterate over the ranges:

@inbounds for (w, x) in pairs(range(-1.5f0, 0.0f0, width))
for (h, y) in pairs(range(-1.0f0, 0.0f0, height))
z = _c = complex(x, y)

Also,

Maybe Int32(1):Int32(20) can help avoid conversion when writing into fractal.

I though fma was for accuracy, and muladd was for performance:

fma(x, y, z)
Computes x*y+z without rounding the intermediate result x*y. On some systems this is
significantly more expensive than x*y+z. fma is used to improve accuracy in certain algorithms.

muladd(x, y, z)
Combined multiply-add: computes x*y+z, but allowing the add and multiply to be merged with each
other or with surrounding operations for performance.

Edit: It seems fma isn’t even defined for complex numbers, while muladd is. So you can try