Code review: Mandelbrot set with CUDA.jl

Hello there! I’ve been experimenting with CUDA.jl and it has been a fantastic experience so far. Recently, I’ve made two simple implementations of the Mandelbrot Fractal using CUDA.jl(repository: https://github.com/vini-fda/Mandelbrot-julia/blob/main/Mandelbrot.jl). In the first implementation, I use broadcasting and CuArrays, and I often get reproducible results, with calculation times often being a bit on the “slow” side: it takes ~13-19 seconds to render a 4K image with 1024 iterations. The implementation code is below.

function gpu_mandelbrot(Z, index)
	(i, j) = index.I
	# Affine transformation
	# Screen space to world space
	x = ax*j + bx
	y = ay*i + by
	# Mandelbrot
	c = x + y*im
	z = c
	iterations = 0
	while CUDA.abs2(z) < 400.0 && iterations < max_iter
	   z = z^2 + c 
	   iterations += 1
	end
	Z = iterations
	return Z
end

For reference, I use the built-in Pluto.jl timer on the following cell:

begin
	Z = CuArray{Float64}(undef, height, width);
	CUDA.@sync begin
		Z .= gpu_mandelbrot.(Z, CartesianIndices(Z));
	end
end

I’m posting this thread because I ran into a problem with indexing 2D arrays when trying to build a kernel - my second implementation. The code is as follows:

function kernel_mandelbrot!(Z)
	id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x

    Nx, Ny = size(Z)

    cind = CartesianIndices((Nx, Ny))
	
	
	for k=id:stride:Nx*Ny
		i = cind[k][1]
        j = cind[k][2]
		# Affine transformation
		# Screen space to world space
		x = ax*j + bx
		y = ay*i + by
		# Mandelbrot
		c = x + y*im
		z = c
		iterations = 0
		while CUDA.abs2(z) < 400.0 && iterations < max_iter
		   z = z^2 + c 
		   iterations += 1
		end
		@inbounds Z[i, j] = iterations

    end
	return nothing
end

I then run it using the following code:

begin
	Z2 = CuArray{Float64}(undef, height, width);
	CUDA.@sync @cuda kernel_mandelbrot!(Z2);
end

The problem is that the kernel takes way too long to execute, often blocking my GPU and freezing my screen for a few seconds(I’ve never been able to see it finish with 4K resolution, only with full HD and a max of 64 iterations, and it took about a minute to finish). Any idea of what I might be doing wrong?

PS: CUDA.version() outputs v"11.4.0". My hardware is an NVIDIA GeForce RTX 2060.

Edit: For posterity, I’ve updated my github repository with @maleadt 's suggestions, and added a runnable binder instance for experimentation. On my machine, after proper benchmarking, both implementations run with an avg time of about 330 ms(1024 iterations, 3840 x 2160 pixels - 4K res).

In your second version, you’re only launching a single thread and having it calculate all values. That’s using the GPU really badly. You should spawn multiple threads and blocks, and calculate i and j from the thread and block indices.

1 Like

Oh it’s because I forgot the threads argument, correct? I guess that’s a silly mistake, haha. I’ve just now run with threads=1024 and it ran in about 13 seconds. Thank you.

Another question: I’ve seen the post on @cuda threads and the new occupancy API, but I’m not sure I understood how to use it. Could you expand on that?

There’s an example in the source code: https://github.com/JuliaGPU/CUDA.jl/blob/cb4a6b03ff443a062212c7a8aad260cfbf134410/src/indexing.jl#L36-L40. Basically, compile the kernel but don’t launch it, and then use the occupancy API to query how many threads to launch at most, and how many blocks you should ideally launch. Finally, you can launch the kernel by calling the returned kernel object.

1 Like

Thanks!