I have a working kernel for a basic matrix multiplication based on the add example. It checks out fine for small two dimensional matrices, but hits a problem with output matrix greater than 32x32.
Here is how I send up my matrices:
(rowsa,colsa) = dimsa = (32,648)
a = randn(Float32,dimsa)
(rowsb,colsb) = dimsb = (648,32)
b = randn(Float32,dimsb)
c = zeros(Float32,(rowsa,colsb))
When matrix a is multiplied with matrix b the result should be 32x32 which it is, and the answer tests correct.
Note that I am using a zeros()
function here; if I use the similar()
function I get accumulation errors since similar does not initialize and I am using a += operator. The resulting matrix is 32x32 so I reserve the memory for c. The output should fit easily into the card grid which is 1024x1024.
The problem arises when I push the output to 33x32 or 32x33. Cuda errors out. Evidently I have hit a 1024 limit somewhere. The output matrix c should fit easily if it was indeed a multidimensional array, but perhaps it has been linearized and overflows a block. Am I missing a key point here?