Code review: Mandelbrot set with CUDA.jl

vini-fda · September 24, 2021, 5:21am

Hello there! I’ve been experimenting with CUDA.jl and it has been a fantastic experience so far. Recently, I’ve made two simple implementations of the Mandelbrot Fractal using CUDA.jl(repository: https://github.com/vini-fda/Mandelbrot-julia/blob/main/Mandelbrot.jl). In the first implementation, I use broadcasting and CuArrays, and I often get reproducible results, with calculation times often being a bit on the “slow” side: it takes ~13-19 seconds to render a 4K image with 1024 iterations. The implementation code is below.

function gpu_mandelbrot(Z, index)
	(i, j) = index.I
	# Affine transformation
	# Screen space to world space
	x = ax*j + bx
	y = ay*i + by
	# Mandelbrot
	c = x + y*im
	z = c
	iterations = 0
	while CUDA.abs2(z) < 400.0 && iterations < max_iter
	   z = z^2 + c 
	   iterations += 1
	end
	Z = iterations
	return Z
end

For reference, I use the built-in Pluto.jl timer on the following cell:

begin
	Z = CuArray{Float64}(undef, height, width);
	CUDA.@sync begin
		Z .= gpu_mandelbrot.(Z, CartesianIndices(Z));
	end
end

I’m posting this thread because I ran into a problem with indexing 2D arrays when trying to build a kernel - my second implementation. The code is as follows:

function kernel_mandelbrot!(Z)
	id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x

    Nx, Ny = size(Z)

    cind = CartesianIndices((Nx, Ny))
	
	
	for k=id:stride:Nx*Ny
		i = cind[k][1]
        j = cind[k][2]
		# Affine transformation
		# Screen space to world space
		x = ax*j + bx
		y = ay*i + by
		# Mandelbrot
		c = x + y*im
		z = c
		iterations = 0
		while CUDA.abs2(z) < 400.0 && iterations < max_iter
		   z = z^2 + c 
		   iterations += 1
		end
		@inbounds Z[i, j] = iterations

    end
	return nothing
end

I then run it using the following code:

begin
	Z2 = CuArray{Float64}(undef, height, width);
	CUDA.@sync @cuda kernel_mandelbrot!(Z2);
end

The problem is that the kernel takes way too long to execute, often blocking my GPU and freezing my screen for a few seconds(I’ve never been able to see it finish with 4K resolution, only with full HD and a max of 64 iterations, and it took about a minute to finish). Any idea of what I might be doing wrong?

PS: CUDA.version() outputs v"11.4.0". My hardware is an NVIDIA GeForce RTX 2060.

Edit: For posterity, I’ve updated my github repository with @maleadt 's suggestions, and added a runnable binder instance for experimentation. On my machine, after proper benchmarking, both implementations run with an avg time of about 330 ms(1024 iterations, 3840 x 2160 pixels - 4K res).

maleadt · September 24, 2021, 5:33am

In your second version, you’re only launching a single thread and having it calculate all values. That’s using the GPU really badly. You should spawn multiple threads and blocks, and calculate i and j from the thread and block indices.

vini-fda · September 24, 2021, 5:52am

Oh it’s because I forgot the threads argument, correct? I guess that’s a silly mistake, haha. I’ve just now run with threads=1024 and it ran in about 13 seconds. Thank you.

Another question: I’ve seen the post on @cuda threads and the new occupancy API, but I’m not sure I understood how to use it. Could you expand on that?

maleadt · September 24, 2021, 6:54am

There’s an example in the source code: https://github.com/JuliaGPU/CUDA.jl/blob/cb4a6b03ff443a062212c7a8aad260cfbf134410/src/indexing.jl#L36-L40. Basically, compile the kernel but don’t launch it, and then use the occupancy API to query how many threads to launch at most, and how many blocks you should ideally launch. Finally, you can launch the kernel by calling the returned kernel object.

vini-fda · September 24, 2021, 3:40pm

Thanks!

Topic		Replies	Views
Mandelbrot with Plots.jl New to Julia plotting	7	1930	December 27, 2020
Benchmark function that uses CUDA.jl Performance cuda , benchmark , precompilation	2	701	November 7, 2021
Optimizing CUDA.jl performance for small array operations GPU performance , cuda	5	2448	February 8, 2021
CUDA.jl v3.0 Package Announcements	10	1349	April 16, 2021
Parallel prefix sum with CUDA.jl GPU	2	1282	June 3, 2021

Code review: Mandelbrot set with CUDA.jl

Related topics