Hello there! I’ve been experimenting with CUDA.jl and it has been a fantastic experience so far. Recently, I’ve made two simple implementations of the Mandelbrot Fractal using CUDA.jl(repository: https://github.com/vini-fda/Mandelbrot-julia/blob/main/Mandelbrot.jl). In the first implementation, I use broadcasting and CuArrays, and I often get reproducible results, with calculation times often being a bit on the “slow” side: it takes ~13-19 seconds to render a 4K image with 1024 iterations. The implementation code is below.

```
function gpu_mandelbrot(Z, index)
(i, j) = index.I
# Affine transformation
# Screen space to world space
x = ax*j + bx
y = ay*i + by
# Mandelbrot
c = x + y*im
z = c
iterations = 0
while CUDA.abs2(z) < 400.0 && iterations < max_iter
z = z^2 + c
iterations += 1
end
Z = iterations
return Z
end
```

For reference, I use the built-in Pluto.jl timer on the following cell:

```
begin
Z = CuArray{Float64}(undef, height, width);
CUDA.@sync begin
Z .= gpu_mandelbrot.(Z, CartesianIndices(Z));
end
end
```

I’m posting this thread because I ran into a problem with indexing 2D arrays when trying to build a kernel - my second implementation. The code is as follows:

```
function kernel_mandelbrot!(Z)
id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = blockDim().x * gridDim().x
Nx, Ny = size(Z)
cind = CartesianIndices((Nx, Ny))
for k=id:stride:Nx*Ny
i = cind[k][1]
j = cind[k][2]
# Affine transformation
# Screen space to world space
x = ax*j + bx
y = ay*i + by
# Mandelbrot
c = x + y*im
z = c
iterations = 0
while CUDA.abs2(z) < 400.0 && iterations < max_iter
z = z^2 + c
iterations += 1
end
@inbounds Z[i, j] = iterations
end
return nothing
end
```

I then run it using the following code:

```
begin
Z2 = CuArray{Float64}(undef, height, width);
CUDA.@sync @cuda kernel_mandelbrot!(Z2);
end
```

The problem is that the kernel takes way too long to execute, often blocking my GPU and freezing my screen for a few seconds(I’ve never been able to see it finish with 4K resolution, only with full HD and a max of 64 iterations, and it took about a minute to finish). Any idea of what I might be doing wrong?

PS: CUDA.version() outputs v"11.4.0". My hardware is an NVIDIA GeForce RTX 2060.

Edit: For posterity, I’ve updated my github repository with @maleadt 's suggestions, and added a runnable binder instance for experimentation. On my machine, after proper benchmarking, both implementations run with an avg time of about 330 ms(1024 iterations, 3840 x 2160 pixels - 4K res).