Hi everyone. This is my first time posting on this site. Let me start off by saying that I am not a professional software developer (at least, not yet). I am more or less just a tinkerer or hobbyist atm, both in terms of programming and mathematics. But I WAS able to implement in code the math (which I also mostly worked out) for the “standard” Machine Learning algorithm. (I hope this is clear. Sometimes I forget the names of things.)

In any event, for the most part I’ve enjoyed learning and experimenting with Julia, especially considering that it has such a robust CUDA implementation. That is, until recently, when I came across a Julia/CUDA implementation detail that is both frustrating and puzzling. I am, of course, speaking of the `Transpose`

or `Adjoint`

wrapper that is applied to a `CuArray`

when taking the transpose or adjoint of a `CuArray`

. (From now on, I will only speak of the transpose of a matrix or vector, since I never deal with complex matrices or vectors in what I’m working on.) I would really love to know why this is done, because it DRAMATICALLY slows down (at least in my experience) any GPU kernel that gets a transposed matrix or vector passed to it.