Adjoint/transpose wrapper question

Hi everyone. This is my first time posting on this site. Let me start off by saying that I am not a professional software developer (at least, not yet). I am more or less just a tinkerer or hobbyist atm, both in terms of programming and mathematics. But I WAS able to implement in code the math (which I also mostly worked out) for the “standard” Machine Learning algorithm. (I hope this is clear. Sometimes I forget the names of things.)

In any event, for the most part I’ve enjoyed learning and experimenting with Julia, especially considering that it has such a robust CUDA implementation. That is, until recently, when I came across a Julia/CUDA implementation detail that is both frustrating and puzzling. I am, of course, speaking of the Transpose or Adjoint wrapper that is applied to a CuArray when taking the transpose or adjoint of a CuArray. (From now on, I will only speak of the transpose of a matrix or vector, since I never deal with complex matrices or vectors in what I’m working on.) I would really love to know why this is done, because it DRAMATICALLY slows down (at least in my experience) any GPU kernel that gets a transposed matrix or vector passed to it.

For example, I recently tried my hand at writing custom GPU functions and kernels for things like repeat and broadcast in an attempt to make my code a little faster. While my horizontal repeat function was quite often faster than the built-in repeat function (depending on the matrix size), my vertical repeat function was always slower than the built-in repeat function–sometimes dramatically so–due to the necessity of using the transpose operation on the input matrix before passing it to the requisite GPU kernel. So my question is: Why is this done? Why is it necessary to wrap a CuArray matrix or vector in a Transpose wrapper when taking the transpose of a CuArray matrix or vector? Without that wrapper, my vertical repeat function would be faster than the built-in function, just as my horizontal repeat function, thereby speeding up my ML code a bit. I can make the functions a little faster with various methods such as A'[:,:] or A' * I, where I is the LinearAlgebra identity matrix (but not convert(CuArray, A') or CuArray(A’), which are WAY too slow), but each of these methods has a cost, and so they prevent the code from being as fast as it could be were the wrapper not present. This is very frustrating.

Anyway, I’d appreciate any feedback I can get from anyone who is much more knowledgeable of these matters than I am. Thanks in advance.

Hi and welcome!
I am not familiar with GPU programming, but the rationale behind the Adjoint / Transpose wrapper is to can avoid creating a new matrix object, and thus allocating memory (which is expensive in Julia). It works by just changing the way the old object is accessed, kinda like this:

# not the true implementation, just the intuition

struct Transpose{M<:AbstractMatrix} <: AbstractMatrix
    mat::M
end

Base.getindex(t::Transpose, i, j) = t.mat[j, i]

If you share some reproducible example for your specific problem, along with the approaches you have tried, people here might be able to help you speed things up!

2 Likes

Yeah, we would need to see the kernel in question to see why the wrapper is slowing things down. If you need to materialize the transpose, you can use permutedims.

Also make sure you are not triggering scalar fallback functionality by either not ignoring the warning or setting CUDA.allowscalar(false). It can happen that array wrappers break method dispatch, and you end up not executing GPU kernels at all. In any case, I generally wouldn’t expect Transposed CuArrays to result in much slower execution (except for specific kinds of kernels that depend a lot on cache behavior, of course).

I greatly appreciate all of the replies and will provide some code examples when I am able. I just ask that you don’t laugh at my code. Remember, I’m just a hobbyist at present. :slight_smile:

2 Likes

So I’ve done a lot of testing of various functions over the past few months–probably more than I should have–and I noticed certain trends about my GPU kernels. Specifically, the kernels that are one dimensional, i.e., they operate on vectors, are in general faster than the 2d-dimensional kernels that I write. Consequently, I decided that when the input to my repeat (and other) functions were matrices, rather than writing a 2-dimensional GPU kernel I would first vectorize the matrix before sending it to a one-dimensional kernel (I hope this makes sense) and then reshape it to its intended repeated matrix size after the kernel was finished with it. This worked well for the case of repeating a matrix horizontally, but it did not for the case of a vertical repeat of the matrix input. The real purpose of this post is to identify why this is the case. So, onto the code. First, the `hrepeat’ function:

The input matrix:

A = CUDA.rand(Float32, 64,128) 

… and the kernel and function:

# ** GPU kernel
function GPUrepeat!(Aʳ, A, n, nels) 

    idx  = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    strd = gridDim().x * blockDim().x

    ii = 1
    @inbounds while ii <= n
        i = idx
        @inbounds while i <= nels
            Aʳ[(ii-1)*nels + i] = A[i] 

            i += strd
        end
        ii += 1
    end

    nothing
end
# **

# ** Repeat a CuArray matrix horizontally `n` times
function hrepeat(A::CuArray{Float32,2}, n) 

    nels, dims = length(A), size(A) 
    # Initialize the repeat vector; will be reshaped
    Aʳ = CuArray{Float32}(undef, nels*n) 
    # Vectorize input matrix before passing to kernel
    Aᵗ = vec(A) 

    threads = 128
    blocks  = (Int ∘ ceil)(nels / threads) 

    @cuda threads=threads blocks=blocks GPUrepeat!(Aʳ, Aᵗ, n, nels) 

    reshape(Aʳ, dims[1], dims[2]*n) 
end
# **

This produced the following results (repeat included for comparison):

@btime CUDA.@sync repeat(A, 1,128) 
@btime CUDA.@sync hrepeat(A, 128) 
;

34.500 μs (12 allocations: 448 bytes)
28.300 μs (13 allocations: 544 bytes)

And now for the vrepeat function. I included some commented-out code so that viewers can see the various things I tried to make the code faster. While all of these worked to one degree or another, they did not produce the results I was looking for, which is performance similar to my hrepeat function. (Note that hrepeat and vrepeat call the same GPU kernel.)

# ** Repeat a CuArray matrix vertically `n` times
function vrepeat(A::CuArray{Float32,2}, n, 𝟙s=nothing) 

    nels, dims = length(A), size(A) 
    Aʳ = CuArray{Float32}(undef, nels*n) 
    Aᵗ = vec(A') 
    #Aᵗ = A'[:] 
    #Aᵗ = vec(A'[:,:]) 
    #Aᵗ = vec(A' * I) 
    #Aᵗ = vec(A') .* 𝟙s
    #Aᵗ = (vec ∘ permutedims)(A, (2,1)) 

    threads = 128
    blocks  = (Int ∘ ceil)(nels / threads) 

    @cuda threads=threads blocks=blocks GPUrepeat!(Aʳ, Aᵗ, n, nels) 

    reshape(Aʳ, dims[2], dims[1]*n)'
end
# **

Results:

@btime CUDA.@sync repeat(A, 128) 
@btime CUDA.@sync vrepeat(A, 128) 
;

34.000 μs (12 allocations: 448 bytes)
68.600 μs (12 allocations: 480 bytes)

Finally, here is the results with the transpose operation removed, for speed comparison purposes. Yes, this code no longer produces correct results, but its only purpose is to show how removing the transpose operation affects the speed of the kernel.

function vrepeat(A::CuArray{Float32,2}, n, 𝟙s=nothing) 

    ...

    Aᵗ = vec(A) # I.e., the `'` operation is removed; all else the same

    ...
end

Results:

@btime CUDA.@sync repeat(A, 128) 
@btime CUDA.@sync vrepeat(A, 128) 
;

34.100 μs (12 allocations: 448 bytes)
27.900 μs (14 allocations: 560 bytes)

Quite similar to the hrepeat results.

So the processing time drops from around 70 μs to around 30 μs when removing the transpose operation. That is quite a difference.

Oh yeah. One other thing–for now. Here is a comparison of vec(A) and vec(A') by themselves:

@btime CUDA.@sync vec(A) 
@btime CUDA.@sync vec(A') 
;

543.085 ns (2 allocations: 80 bytes)
566.304 ns (2 allocations: 64 bytes)

Not much of a difference. Good night, for now.

Just thought of something else. Here’s my environment:

CUDA.versioninfo() 

CUDA runtime 12.3, artifact installation
CUDA driver 12.4
NVIDIA driver 551.23.0

CUDA libraries:

  • CUBLAS: 12.3.4
  • CURAND: 10.3.4
  • CUFFT: 11.0.12
  • CUSOLVER: 11.5.4
  • CUSPARSE: 12.2.0
  • CUPTI: 21.0.0
  • NVML: 12.0.0+551.23

Julia packages:

  • CUDA: 5.2.0
  • CUDA_Driver_jll: 0.7.0+1
  • CUDA_Runtime_jll: 0.11.1+0

Toolchain:

  • Julia: 1.10.1
  • LLVM: 15.0.7

Environment:

  • JULIA_CUDA_HARD_MEMORY_LIMIT: 89%

1 device:
0: NVIDIA GeForce GTX 1660 Ti with Max-Q Design (sm_75, 663.422 MiB / 6.000 GiB available)

With such simple kernels, that are completely memory bound, memory access patterns can indeed greatly influence execution time. Without profiling in depth (there’s tools for that, see the CUDA.jl documentation on kernel profiling) I would guess that in the fast case memory accesses can be coalesced, which breaks when transposing the input. Basically, you want consecutive threads (by thread index) to be accessing consecutive memory locations, which will allow the GPU to merge these memory accesses into a single bus request. This is different from programming on the CPU, where you want consecutive iterations of a loop to be accessing consecutive elements.
If bad memory coalescing is the culprit, you can consider writing a specialized kernel that changes the iteration order to take this into account. Alternatively, if you increase the arithmetic intensity of your kernel (i.e. make it compute stuff instead of just copying memory) those operations could be able to hide the added latency of the suboptimal memory accesses.

Also, FYI, for micro-optimizing kernel execution times I would recommend using CUDA.@profile instead of @btime.

Hi Guillaume. Thank you for your reply. Sorry I didn’t reply sooner. As I mentioned in my intro, I am not a software developer–just a hobbyist, at the moment. As such, I don’t have a great deal of experience programming. As a result, I have never used a Struct in anything that I’ve worked on. No joke. Consequently, I don’t understand the code you provided. Arrays have served my purpose for almost everything I’ve done in Julia, so I never felt a need to learn a new data type, with one exception: named tuples, and I just use those to reduce the number of variables that I pass to functions. It was getting out of hand, so that’s the solution I came up with. However, I think it’s time for me to leave my comfort zone and learn something new, so with your permission I’d like to copy your code into VS Code and examine it a little bit to see if I can figure out what it’s doing. Thanks.

Hi Simone. Thank you for your reply. I did try permutedims, but it did not have the speed I was looking for. In fact, for larger arrays (1024x1024, for example) it performed the worst of all the methods I tried. But for smaller arrays, it was fastest of all the methods. Hmm. Regardless, none of the methods I tried had the speed I was looking for, but I think Mr. Besard’s most recent reply will point me in the right direction. Thanks again.

Thanks so much for the reply. Whoo. That’s a lot to digest. It looks like I have my homework assignment for the week: figuring out what coalescing means and how to take advantage of it. Again, I am far from an expert at any of this, so there is still a lot that I have to learn. But I realize that if I am going to write kernels that are high-performance, I have to do more than just follow a recipe. I have to increase my understanding of what goes on “under the hood.” Thanks again for your reply. I am grateful that you took the time to answer the question of a mere hobbyist. All the best.

And, I have used CUDA.@profile in the past. In fact, I’ve also used nVidia’s Nsight Compute. But since I’m just learning about the intricacies of how GPU’s work, I did not understand most of the information it provided or how to implement its suggestions. Anyway, one step at a time.

I just found something interesting.

CUDA.allowscalar(false) 

vec(A') 

Outputs:

Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore should be avoided.

If you want to allow scalar iteration, use `allowscalar` or `@allowscalar`
to enable scalar iteration globally or for the operations in question.

Stacktrace:
...

That’s okay! A struct in Julia, like a class in Python, is just a way to combine several objects into a more complex one. You can read about it here:
https://docs.julialang.org/en/v1/manual/types/#Composite-Types

Of course you can copy my code into your REPL and experiment. The whole point of this forum is to solve people’s problems, it would be weird if you weren’t allowed to do that :wink: If you wanted to publish, license or sell some work that contains sizeable code taken from Discourse, then there would be a discussion, but for such small fragments it’s basically never an issue

Thanks!