With that PR, you’ll end up executing the generic GEMM method from GPUArrays, which is OK but slow. The second operation, which you mentioned ‘works fine’ actually doesn’t and triggers scalar iteration, which is extremely slow and should be avoided. Optionally, if you want a really fast GPU execution, you need to make sure your arrays are recognized as strided GPU arrays so that we can dispatch to the CUBLAS library. That involves making sure the memory is contiguous, and that you’re not using too many array wrappers (because of how Julia’s array hierarchy is currently designed, it’s hard to recognize GPU arrays when they are wrapped a bunch).