Hi

I have a question concerning the organization of threads and blocks calling a custom CUDA kernel. First of all my workstation runs a GTX Titan, here are the corresponding specs

```
Device 0: "GeForce GTX TITAN"
CUDA Driver Version / Runtime Version 5.5 / 4.2
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 6144 MBytes (6442254336 bytes)
MapSMtoCores SM 3.5 is undefined (please update to the latest SDK)!
MapSMtoCores SM 3.5 is undefined (please update to the latest SDK)!
(14) Multiprocessors x ( -1) CUDA Cores/MP: -14 CUDA Cores
GPU Clock rate: 876 MHz (0.88 GHz)
Memory Clock rate: 3004 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
```

The underlying issue is to generate a complex 3D (64x64x64) array, as an example lets say a simple plane wave. Therefore I’m using the following code

```
using CuArrays
using CUDAnative
using CUDAdrv
@inline function CUDAnative.exp(x::ComplexF64)
scale = CUDAnative.exp( x.re )
return ComplexF64( scale * CUDAnative.cos(x.im), scale * CUDAnative.sin(x.im) )
end
function kernel_initial_psi(a,N,k_in,dx,i)
j = (blockIdx().x -1) * blockDim().x + threadIdx().x
k = (blockIdx().y -1) * blockDim().y + threadIdx().y
if j <= size(a,1)
a[i , j, k] = CUDAnative.exp(-im*dx*((i-N/2)*k_in[1]+ (j-N/2)*k_in[2]+(k-N/2)*k_in[3]))
end
return nothing
end
function kernel_twotothree(b, i, a)
j = (blockIdx().x - 1) * blockDim().x + threadIdx().x
k = (blockIdx().y - 1) * blockDim().y + threadIdx().y
if j <= length(a)
b[i, j, k] = a[j, k]
end
return nothing
end
#Parameters---------------------------------------------------------------------
N=64 # size of spatial grid
k=CuArray([1.0,1.0,1.0]) #inital wave direction
dx=0.1
#Initializations----------------------------------------------------------------
psi_3D=CuArrays.cuzeros(ComplexF64, (N,N,N))
psi_2D=CuArrays.cuzeros(ComplexF64, (N,N))
#Threads and Blocks-------------------------------------------------------------
blocks = (2,2)
threads = (32,32)
#initial data-------------------------------------------------------------------
for i=1:N
@cuda blocks=blocks threads=threads kernel_initial_psi(psi_2D, N, k, dx, i)
@cuda blocks=blocks threads=threads kernel_twotothree(psi_3D, i, psi_2D)
end
```

In theory, the strategy is the following. To produce the 3D Array I’m gonna divide it in 2D Arrays, such that each 2D Array have `64 x 64`

entries. In my opinion, it is reasonable to subdivide the 2D Arrays into `4`

times a `32 x 32`

Array (—> use `4`

blocks á `1024`

threads ). To produce the initial data of a `32 x 32`

Array I need exactly `1024`

threads (this is btw the maximum of threads per block for this GPU). The maximum amount of threads per Multiprocessor is `2048`

so I will need two MPs to generate the full `64 x 64`

Array. The `deviceQuery`

above says that I have a total amount of 14 MPs on this GPU but I still get error messages like this

```
ERROR: LoadError: CUDA error: an illegal instruction was encountered (code #715, ERROR_ILLEGAL_INSTRUCTION)
Stacktrace:
[1] #upload!#10(::Bool, ::Function, ::CUDAdrv.Mem.Buffer, ::Ptr{Float64}, ::Int64, ::CuStream) at C:\Users\Noobie\.julia\packages\CUDAdrv\JWljj\src\base.jl:147
[2] upload! at C:\Users\Noobie\.julia\packages\CUDAdrv\JWljj\src\memory.jl:229 [inlined] (repeats 2 times)
[3] unsafe_copyto! at C:\Users\Noobie\.julia\packages\CuArrays\PD3UJ\src\array.jl:161 [inlined]
[4] copyto!(::CuArray{Float64,1}, ::Array{Float64,1}) at C:\Users\Noobie\.julia\packages\GPUArrays\t8tJB\src\abstractarray.jl:110
[5] CuArray{Float64,1}(::Array{Float64,1}) at C:\Users\Noobie\.julia\packages\CuArrays\PD3UJ\src\array.jl:85
[6] CuArray(::Array{Float64,1}) at C:\Users\Noobie\.julia\packages\CuArrays\PD3UJ\src\array.jl:95
[7] top-level scope at none:0
in expression starting at C:\Users\Noobie\Documents\TESTJULIAGPU\inital_data.jl:31
```

Is it possible that I missed something? Or did I confound my threads/blocks organization? Im relatively new at GPU programming but I thought I got the underlying basics… which is obviously not case. It would be great if someone can tell me what is my misstake here.

So I know that is maybe not the most efficient method to generate this data (when you have performance improvement tips I’m open and thankful for your comments) but I think, at least it should work.

Further, I want to use the full potential of my GPU, so I want to run the kernel at all MPs so it must be possible to calculate 7 slices of my 3D Array at once, is this possible at all?