Julia & Mojo Mandelbrot Benchmark

For those of you who aren’t aware, the Mojo SDK was recently released, so I thought I would take the opportunity to start benchmarking some Julia code against Mojo. As a first test, I am calculating the Mandelbrot set using the code provided by Modular.

This is my Julia implementation:

``````using Plots

const xn = 960
const yn = 960
const xmin = -2.0
const xmax = 0.6
const ymin = -1.5
const ymax = 1.5
const MAX_ITERS = 200

function mandelbrot_kernel(c)
z = c
for i = 1:MAX_ITERS
z = z * z + c
if abs2(z) > 4
return i
end
end
return MAX_ITERS
end

function compute_mandelbrot()
result = zeros(yn, xn)

x_range = range(xmin, xmax, xn)
y_range = range(ymin, ymax, xn)

for i = 1:xn
x = x_range[i]
y = y_range[j]
result[j, i] = mandelbrot_kernel(complex(x, y))
end
end
return result
end

result = compute_mandelbrot()

x_range = range(xmin, xmax, xn)
y_range = range(ymin, ymax, yn)
heatmap(x_range, y_range, result)
``````

I then benchmarked the Julia code

``````julia> @btime compute_mandelbrot()
7.452 ms (341 allocations: 7.07 MiB)
``````

For completeness, this is the Mojo code I used

``````from benchmark import Benchmark
from complex import ComplexSIMD, ComplexFloat64
from math import iota
from python import Python
from runtime.llcl import num_cores, Runtime
from algorithm import parallelize, vectorize
from tensor import Tensor
from utils.index import Index

alias width = 960
alias height = 960
alias MAX_ITERS = 200

alias min_x = -2.0
alias max_x = 0.6
alias min_y = -1.5
alias max_y = 1.5

alias float_type = DType.float64
alias simd_width = simdwidthof[float_type]()

def show_plot(tensor: Tensor[float_type]):
alias scale = 10
alias dpi = 64

np = Python.import_module("numpy")
plt = Python.import_module("matplotlib.pyplot")
colors = Python.import_module("matplotlib.colors")

numpy_array = np.zeros((height, width), np.float64)

for row in range(height):
for col in range(width):
numpy_array.itemset((col, row), tensor[col, row])

fig = plt.figure(1, [scale, scale * height // width], dpi)
ax = fig.add_axes([0.0, 0.0, 1.0, 1.0], False, 1)
light = colors.LightSource(315, 10, 0, 1, 1, 0)

numpy_array, plt.cm.hot, colors.PowerNorm(0.3), "hsv", 0, 0, 1.5
)
plt.imshow(image)
plt.axis("off")
plt.show()

fn mandelbrot_kernel_SIMD[
simd_width: Int
](c: ComplexSIMD[float_type, simd_width]) -> SIMD[float_type, simd_width]:
"""A vectorized implementation of the inner mandelbrot computation."""
var z = ComplexSIMD[float_type, simd_width](0, 0)
var iters = SIMD[float_type, simd_width](0)

var in_set_mask: SIMD[DType.bool, simd_width] = True
for i in range(MAX_ITERS):
break
iters = in_set_mask.select(iters + 1, iters)

return iters

fn parallelized():
let t = Tensor[float_type](height, width)

@parameter
fn worker(row: Int):
let scale_x = (max_x - min_x) / width
let scale_y = (max_y - min_y) / height

@parameter
fn compute_vector[simd_width: Int](col: Int):
"""Each time we oeprate on a `simd_width` vector of pixels."""
let cx = min_x + (col + iota[float_type, simd_width]()) * scale_x
let cy = min_y + row * scale_y
let c = ComplexSIMD[float_type, simd_width](cx, cy)
t.data().simd_store[simd_width](
row * width + col, mandelbrot_kernel_SIMD[simd_width](c)
)

# Vectorize the call to compute_vector where call gets a chunk of pixels.
vectorize[simd_width, compute_vector](width)

with Runtime() as rt:

@parameter
fn bench_parallel[simd_width: Int]():
parallelize[worker](rt, height, 5 * num_cores())

alias simd_width = simdwidthof[DType.float64]()
let parallelized = Benchmark().run[bench_parallel[simd_width]]() / 1e6
print("Parallelized:", parallelized, "ms")

try:
_ = show_plot(t)
except e:
print("failed to show plot:", e.value)

def main():
parallelized()

``````

This gave me the result:

``````Parallelized: 2.1398090000000001 ms
``````

On my machine, the Mojo code will run in 2.14 ms. The Julia code takes 7.45 ms. I’m using 32 threads for Julia, but I admit that I haven’t fine tuned that number. I was wondering if anyone would be willing to offer up ways to improve the Julia code further? I don’t really know what else I can do and I think this would be an excellent way to learn more about squeezing every last drop of performance out of Julia. If we could end up beating the Mojo code that’d be fantastic! Doubly so because of how much more elegant the Julia code is.

Thank you all for your time.

13 Likes

We probably should use SIMD on the Julia side as well.

6 Likes

The difference here appears to be whether you get vectorization. Getting vectorized complex numbers isn’t entirely trivial in julia but at the very least you should be able to pretty easily decompose it for something this simple.

3 Likes

I know I shouldn’t say so but I can’t help: I dislike the looking of the Mojo code. Those `[ ]` seem to appear everywhere and it hurts my eyes.

5 Likes

changing `z = z * z + c` to `z = muladd(z, z, c)` gets about 25% faster right off the bat

9 Likes

I would check if Mojo is automatically unrolling the loop. This is also what Jax does. Although in this case I see max iteration is quite big (200) so maybe not, but worth checking

2 Likes

``````function compute_mandelbrot()
result = zeros(yn, xn)

x_range = range(xmin, xmax, xn)
y_range = range(ymin, ymax, yn)

for j = 1:yn
x = x_range[i]
y = y_range[j]
# Note Julia arrays contiguous in columns
result[j, i] = mandelbrot_kernel(complex(x, y))
end
#You could also do without second loop
#result[:,i] .= mandelbrot_kernel.(complex.(x, y_range)
end
return result
end
``````
2 Likes

This was one of the questions I had when writing the code. When I was reading through the Julia performance tips, I found this section

• Write `@simd` in front of `for` loops to promise that the iterations are independent and may be reordered. Note that in many cases, Julia can automatically vectorize code without the `@simd` macro

I tried adding `@simd`, but I didn’t see a change in performance, so I figured the code was already vectorized, am I mistaken? What are the situations where Julia can automatically vectorize and how can I know that my code is properly vectorized?

2 Likes

The `@simd` macro is very limited and specifically, it can’t see through the function call.

Use loopvectorization and decompose the complex number into a pair of numbers. Try that.

1 Like

There’s a bunch of differences between your Julia code and your Mojo code that could perhaps matter for performance. For example, in Mojo you just do

… for the linear interpolation, while in Julia you use the FP ranges that do extra work in an attempt of preserving accuracy. The Mojo Mandelbrot kernel also seems to be written somewhat differently than your Julia version.

2 Likes

while in Julia you use the FP ranges that do extra work in an attempt of preserving accuracy.

Basically none of this matters because all the time is in the Mandelbrot kernel

1 Like

The kernel function contains a conditional return statement, I would be very impressed if Loopvectorization can handle this.

To me this seems like a classic case for manual vectorization using SIMD.jl. One can read the Mojo implementation for tips on how to do it.

1 Like

I wrote one with a fixed iteration number. That’s how I do it.

If I read the Mojo code correctly, the tensor allocation `let t = Tensor[...]` seems to not be part of the benchmark time (it is outside the `worker` function), whereas the julia function `compute_mandelbrot` allocates the `result` array which is timed. I don’t know if it makes a big difference however but perhaps it should be passed as a parameter instead as `compute_mandelbrot!(result)`.

3 Likes

This one runs 5x 8x faster then the original julia example posted here, and illustrates nicely how something like ComplexSIMD can be defined quite easily in Julia:

``````const xn = 960
const yn = 960
const xmin = -2.0
const xmax = 0.6
const ymin = -1.5
const ymax = 1.5
const MAX_ITERS = 200

struct ComplexSIMD{N,T}
re::NTuple{N,T}
im::NTuple{N,T}
end

Base.abs2(z::ComplexSIMD) = z.re .* z.re .+ z.im .* z.im
Base.:(*)(a::ComplexSIMD, b::ComplexSIMD) = ComplexSIMD(
a.re .* b.re .- a.im .* b.im,
a.re .* b.im .+ a.im .* b.re
)
Base.:(+)(a::ComplexSIMD, b::ComplexSIMD) = ComplexSIMD(a.re .+ b.re, a.im .+ b.im)

@inline function mandelbrot_kernel(c::ComplexSIMD{N,T}) where {N,T}
z = c
mask = ntuple(k -> true, N)
iters = ntuple(k -> T(0), N)
for i in 1:MAX_ITERS
iters = iters .+ (mask .* 1)
z = z * z + c
end
return iters
end

@inbounds function inner(result, x_range, y_range, y, ::Val{N}) where N
ci = y_range[y]
for x in 1:N:length(x_range)
# Calculate whether the (x:x+7)-th real coordinates with
# the y-th imaginary coordinate belong to the
# mandelbrot set.
c = ComplexSIMD(ntuple(k -> x_range[x+k-1], N), ntuple(k -> ci, N))
iters = mandelbrot_kernel(c)
foreach(enumerate(iters)) do (k, v)
# Write the result to the output array
result[y, x+k-1] = v
end
end
end

function mandelbrot(result, x_range, y_range, N)
@sync for y in eachindex(y_range)
# See <https://github.com/JuliaLang/julia/issues/21017>.
inner(result, x_range, y_range, y, N)
end
end
return result
end

using BenchmarkTools
result = zeros(yn, xn)
x_range = range(xmin, xmax, xn)
y_range = range(ymin, ymax, xn)
@btime mandelbrot(result, x_range, y_range, Val(4));
``````

Edit:
Also allocated outside the function like Mojo, now its even a bit faster!

(Adapted from mandelbrot Julia #7 program (Benchmarks Game))

55 Likes
``````using Plots
using SharedArrays
using Distributed

# Add worker processes. If you started Julia without `-p`, then this will add 3 worker processes.
# Otherwise, it's redundant and can be skipped.

# Constants
const xmin = -2.0
const xmax = 0.6
const xn = 960
const ymin = -1.5
const ymax = 1.5
const yn = 960
const max_iter = 200

function mandelbrot_kernel(c::Complex{Float64})
z = c
for i in 1:max_iter
if abs2(z) > 4.0
return i
end
z = z * z + c
end
return max_iter
end

@everywhere begin
# Ensuring all processes have access to the constants and the kernel function
const xmin = -2.25
const xmax = 0.75
const xn = 450
const ymin = -1.25
const ymax = 1.25
const yn = 375
const max_iter = 200

function mandelbrot_kernel(c::Complex{Float64})
z = c
for i in 1:max_iter
if abs2(z) > 4.0
return i
end
z = z * z + c
end
return max_iter
end
end

function mandelbrot_parallel()
x_vals = LinRange(xmin, xmax, xn)
y_vals = LinRange(ymin, ymax, yn)

result = SharedArray{Int}(yn, xn)

@distributed for y in 1:yn
for x in 1:xn
result[y, x] = mandelbrot_kernel(complex(x_vals[x], y_vals[y]))
end
end

return result
end

function make_plot_julia(m)
heatmap(m, color=:hot, axis=false)
end

function main()
start_time = time()
mandelbrot_set = mandelbrot_parallel()
end_time = time()
execution_time = (end_time - start_time) * 1000  # Make it milliseconds

make_plot_julia(mandelbrot_set)
println("Execution time for Parallelized Julia Mandelbrot: ", execution_time, " ms")
end

main()
``````

I used this code for paralleized version with the same params as mandelbrot
Execution time for Parallelized Julia Mandelbrot: 0.141143798828125 ms

Actually this seems to benefit from 24 threads on my 12 core machine, so now I’m at 8x faster.

@Radu_Mihai_Diaconu, are you sure this is correct?
I would be quite surprised if using Distributed here could get into the range of using Threads.
You also seem to have added a smaller `xn = 450` in the `@everywhere` part, resulting in a much smaller array.
Also, the plot seems to be empty:

2 Likes

my plot is not empty. Yeah you are right to 450 part

Remarkable! Much cleaner than Mojo. I love it!