Julia & Mojo Mandelbrot Benchmark

For those of you who aren’t aware, the Mojo SDK was recently released, so I thought I would take the opportunity to start benchmarking some Julia code against Mojo. As a first test, I am calculating the Mandelbrot set using the code provided by Modular.

This is my Julia implementation:

using Plots 

const xn = 960
const yn = 960
const xmin = -2.0
const xmax = 0.6
const ymin = -1.5
const ymax = 1.5
const MAX_ITERS = 200

function mandelbrot_kernel(c)
    z = c
    for i = 1:MAX_ITERS
        z = z * z + c
        if abs2(z) > 4
            return i
        end
    end
    return MAX_ITERS
end


function compute_mandelbrot()
    result = zeros(yn, xn)

    x_range = range(xmin, xmax, xn)
    y_range = range(ymin, ymax, xn)

    Threads.@threads for j = 1:yn
        for i = 1:xn
            x = x_range[i]
            y = y_range[j]
            result[j, i] = mandelbrot_kernel(complex(x, y))
        end
    end
    return result
end

result = compute_mandelbrot()

x_range = range(xmin, xmax, xn)
y_range = range(ymin, ymax, yn)
heatmap(x_range, y_range, result)

I then benchmarked the Julia code

julia> @btime compute_mandelbrot()
  7.452 ms (341 allocations: 7.07 MiB)

For completeness, this is the Mojo code I used

from benchmark import Benchmark
from complex import ComplexSIMD, ComplexFloat64
from math import iota
from python import Python
from runtime.llcl import num_cores, Runtime
from algorithm import parallelize, vectorize
from tensor import Tensor
from utils.index import Index

alias width = 960
alias height = 960
alias MAX_ITERS = 200

alias min_x = -2.0
alias max_x = 0.6
alias min_y = -1.5
alias max_y = 1.5

alias float_type = DType.float64
alias simd_width = simdwidthof[float_type]()


def show_plot(tensor: Tensor[float_type]):
    alias scale = 10
    alias dpi = 64

    np = Python.import_module("numpy")
    plt = Python.import_module("matplotlib.pyplot")
    colors = Python.import_module("matplotlib.colors")

    numpy_array = np.zeros((height, width), np.float64)

    for row in range(height):
        for col in range(width):
            numpy_array.itemset((col, row), tensor[col, row])

    fig = plt.figure(1, [scale, scale * height // width], dpi)
    ax = fig.add_axes([0.0, 0.0, 1.0, 1.0], False, 1)
    light = colors.LightSource(315, 10, 0, 1, 1, 0)

    image = light.shade(
        numpy_array, plt.cm.hot, colors.PowerNorm(0.3), "hsv", 0, 0, 1.5
    )
    plt.imshow(image)
    plt.axis("off")
    plt.show()


fn mandelbrot_kernel_SIMD[
    simd_width: Int
](c: ComplexSIMD[float_type, simd_width]) -> SIMD[float_type, simd_width]:
    """A vectorized implementation of the inner mandelbrot computation."""
    var z = ComplexSIMD[float_type, simd_width](0, 0)
    var iters = SIMD[float_type, simd_width](0)

    var in_set_mask: SIMD[DType.bool, simd_width] = True
    for i in range(MAX_ITERS):
        if not in_set_mask.reduce_or():
            break
        in_set_mask = z.squared_norm() <= 4
        iters = in_set_mask.select(iters + 1, iters)
        z = z.squared_add(c)

    return iters


fn parallelized():
    let t = Tensor[float_type](height, width)

    @parameter
    fn worker(row: Int):
        let scale_x = (max_x - min_x) / width
        let scale_y = (max_y - min_y) / height

        @parameter
        fn compute_vector[simd_width: Int](col: Int):
            """Each time we oeprate on a `simd_width` vector of pixels."""
            let cx = min_x + (col + iota[float_type, simd_width]()) * scale_x
            let cy = min_y + row * scale_y
            let c = ComplexSIMD[float_type, simd_width](cx, cy)
            t.data().simd_store[simd_width](
                row * width + col, mandelbrot_kernel_SIMD[simd_width](c)
            )

        # Vectorize the call to compute_vector where call gets a chunk of pixels.
        vectorize[simd_width, compute_vector](width)

    with Runtime() as rt:

        @parameter
        fn bench_parallel[simd_width: Int]():
            parallelize[worker](rt, height, 5 * num_cores())

        alias simd_width = simdwidthof[DType.float64]()
        let parallelized = Benchmark().run[bench_parallel[simd_width]]() / 1e6
        print("Parallelized:", parallelized, "ms")

    try:
        _ = show_plot(t)
    except e:
        print("failed to show plot:", e.value)


def main():
    parallelized()

This gave me the result:

Parallelized: 2.1398090000000001 ms

On my machine, the Mojo code will run in 2.14 ms. The Julia code takes 7.45 ms. I’m using 32 threads for Julia, but I admit that I haven’t fine tuned that number. I was wondering if anyone would be willing to offer up ways to improve the Julia code further? I don’t really know what else I can do and I think this would be an excellent way to learn more about squeezing every last drop of performance out of Julia. If we could end up beating the Mojo code that’d be fantastic! Doubly so because of how much more elegant the Julia code is.

Thank you all for your time.

13 Likes

We probably should use SIMD on the Julia side as well.

6 Likes

The difference here appears to be whether you get vectorization. Getting vectorized complex numbers isn’t entirely trivial in julia but at the very least you should be able to pretty easily decompose it for something this simple.

3 Likes

I know I shouldn’t say so but I can’t help: I dislike the looking of the Mojo code. Those [ ] seem to appear everywhere and it hurts my eyes.

5 Likes

changing z = z * z + c to z = muladd(z, z, c) gets about 25% faster right off the bat

9 Likes

for reference: Why is JAX so fast? · google/jax · Discussion #11078 · GitHub

I would check if Mojo is automatically unrolling the loop. This is also what Jax does. Although in this case I see max iteration is quite big (200) so maybe not, but worth checking

2 Likes

Instead, I would try

function compute_mandelbrot()
    result = zeros(yn, xn)

    x_range = range(xmin, xmax, xn)
    y_range = range(ymin, ymax, yn)

    Threads.@threads for i = 1:xn
        for j = 1:yn
            x = x_range[i]
            y = y_range[j]
            # Note Julia arrays contiguous in columns
            result[j, i] = mandelbrot_kernel(complex(x, y))
        end
        #You could also do without second loop
        #result[:,i] .= mandelbrot_kernel.(complex.(x, y_range)
    end
    return result
end
2 Likes

This was one of the questions I had when writing the code. When I was reading through the Julia performance tips, I found this section

  • Write @simd in front of for loops to promise that the iterations are independent and may be reordered. Note that in many cases, Julia can automatically vectorize code without the @simd macro

I tried adding @simd, but I didn’t see a change in performance, so I figured the code was already vectorized, am I mistaken? What are the situations where Julia can automatically vectorize and how can I know that my code is properly vectorized?

2 Likes

The @simd macro is very limited and specifically, it can’t see through the function call.

Use loopvectorization and decompose the complex number into a pair of numbers. Try that.

1 Like

There’s a bunch of differences between your Julia code and your Mojo code that could perhaps matter for performance. For example, in Mojo you just do

… for the linear interpolation, while in Julia you use the FP ranges that do extra work in an attempt of preserving accuracy. The Mojo Mandelbrot kernel also seems to be written somewhat differently than your Julia version.

2 Likes

while in Julia you use the FP ranges that do extra work in an attempt of preserving accuracy.

Basically none of this matters because all the time is in the Mandelbrot kernel

1 Like

The kernel function contains a conditional return statement, I would be very impressed if Loopvectorization can handle this.

To me this seems like a classic case for manual vectorization using SIMD.jl. One can read the Mojo implementation for tips on how to do it.

1 Like

I wrote one with a fixed iteration number. That’s how I do it.

If I read the Mojo code correctly, the tensor allocation let t = Tensor[...] seems to not be part of the benchmark time (it is outside the worker function), whereas the julia function compute_mandelbrot allocates the result array which is timed. I don’t know if it makes a big difference however but perhaps it should be passed as a parameter instead as compute_mandelbrot!(result).

3 Likes

This one runs 5x 8x faster then the original julia example posted here, and illustrates nicely how something like ComplexSIMD can be defined quite easily in Julia:

const xn = 960
const yn = 960
const xmin = -2.0
const xmax = 0.6
const ymin = -1.5
const ymax = 1.5
const MAX_ITERS = 200

struct ComplexSIMD{N,T}
    re::NTuple{N,T}
    im::NTuple{N,T}
end

Base.abs2(z::ComplexSIMD) = z.re .* z.re .+ z.im .* z.im
Base.:(*)(a::ComplexSIMD, b::ComplexSIMD) = ComplexSIMD(
    a.re .* b.re .- a.im .* b.im,
    a.re .* b.im .+ a.im .* b.re
)
Base.:(+)(a::ComplexSIMD, b::ComplexSIMD) = ComplexSIMD(a.re .+ b.re, a.im .+ b.im)

@inline function mandelbrot_kernel(c::ComplexSIMD{N,T}) where {N,T}
    z = c
    mask = ntuple(k -> true, N)
    iters = ntuple(k -> T(0), N)
    for i in 1:MAX_ITERS
        !any(mask) && return iters
        mask = abs2(z) .<= 4.0
        iters = iters .+ (mask .* 1)
        z = z * z + c
    end
    return iters
end

@inbounds function inner(result, x_range, y_range, y, ::Val{N}) where N
    ci = y_range[y]
    for x in 1:N:length(x_range)
        # Calculate whether the (x:x+7)-th real coordinates with
        # the y-th imaginary coordinate belong to the
        # mandelbrot set.
        c = ComplexSIMD(ntuple(k -> x_range[x+k-1], N), ntuple(k -> ci, N))
        iters = mandelbrot_kernel(c)
        foreach(enumerate(iters)) do (k, v)
            # Write the result to the output array
            result[y, x+k-1] = v
        end
    end
end

function mandelbrot(result, x_range, y_range, N)
    @sync for y in eachindex(y_range)
        # Threads.@spawn allows dynamic scheduling instead of static scheduling
        # of Threads.@threads macro.
        # See <https://github.com/JuliaLang/julia/issues/21017>.
        Threads.@spawn @inbounds begin
            inner(result, x_range, y_range, y, N)
        end
    end
    return result
end

using BenchmarkTools
result = zeros(yn, xn)
x_range = range(xmin, xmax, xn)
y_range = range(ymin, ymax, xn)
@btime mandelbrot(result, x_range, y_range, Val(4));

Edit:
Also allocated outside the function like Mojo, now its even a bit faster!

(Adapted from mandelbrot Julia #7 program (Benchmarks Game))

56 Likes
using Plots
using SharedArrays
using Distributed

# Add worker processes. If you started Julia without `-p`, then this will add 3 worker processes.
# Otherwise, it's redundant and can be skipped.
# addprocs(3)

# Constants
const xmin = -2.0
const xmax = 0.6
const xn = 960
const ymin = -1.5
const ymax = 1.5
const yn = 960
const max_iter = 200

function mandelbrot_kernel(c::Complex{Float64})
    z = c
    for i in 1:max_iter
        if abs2(z) > 4.0
            return i
        end
        z = z * z + c
    end
    return max_iter
end

@everywhere begin
    # Ensuring all processes have access to the constants and the kernel function
    const xmin = -2.25
    const xmax = 0.75
    const xn = 450
    const ymin = -1.25
    const ymax = 1.25
    const yn = 375
    const max_iter = 200

    function mandelbrot_kernel(c::Complex{Float64})
        z = c
        for i in 1:max_iter
            if abs2(z) > 4.0
                return i
            end
            z = z * z + c
        end
        return max_iter
    end
end

function mandelbrot_parallel()
    x_vals = LinRange(xmin, xmax, xn)
    y_vals = LinRange(ymin, ymax, yn)
    
    result = SharedArray{Int}(yn, xn)

    @distributed for y in 1:yn
        for x in 1:xn
            result[y, x] = mandelbrot_kernel(complex(x_vals[x], y_vals[y]))
        end
    end
    
    return result
end

function make_plot_julia(m)
    heatmap(m, color=:hot, axis=false)
end

function main()
    start_time = time()
    mandelbrot_set = mandelbrot_parallel()
    end_time = time()
    execution_time = (end_time - start_time) * 1000  # Make it milliseconds

    make_plot_julia(mandelbrot_set)
    println("Execution time for Parallelized Julia Mandelbrot: ", execution_time, " ms")
end

main()

I used this code for paralleized version with the same params as mandelbrot
Execution time for Parallelized Julia Mandelbrot: 0.141143798828125 ms

Actually this seems to benefit from 24 threads on my 12 core machine, so now I’m at 8x faster.

@Radu_Mihai_Diaconu, are you sure this is correct?
I would be quite surprised if using Distributed here could get into the range of using Threads.
You also seem to have added a smaller xn = 450 in the @everywhere part, resulting in a much smaller array.
Also, the plot seems to be empty:

2 Likes

my plot is not empty. Yeah you are right to 450 part

Remarkable! Much cleaner than Mojo. I love it!