I have a working kernel for a basic matrix multiplication based on the add example. It checks out fine for small two dimensional matrices, but hits a problem with output matrix greater than 32x32.

Here is how I send up my matrices:

```
(rowsa,colsa) = dimsa = (32,648)
a = randn(Float32,dimsa)
(rowsb,colsb) = dimsb = (648,32)
b = randn(Float32,dimsb)
c = zeros(Float32,(rowsa,colsb))
```

When matrix a is multiplied with matrix b the result should be 32x32 which it is, and the answer tests correct.

Note that I am using a `zeros()`

function here; if I use the `similar()`

function I get accumulation errors since similar does not initialize and I am using a += operator. The resulting matrix is 32x32 so I reserve the memory for c. The output should fit easily into the card grid which is 1024x1024.

The problem arises when I push the output to 33x32 or 32x33. Cuda errors out. Evidently I have hit a 1024 limit somewhere. The output matrix c should fit easily if it was indeed a multidimensional array, but perhaps it has been linearized and overflows a block. Am I missing a key point here?