⸘How do I do "tensor slicing" in Flux and CUDA‽

If I have a tensor T ∈ ∏ ℝⁱₖ|k ∈ {1,…,m} and I want to drop some elements to get T̂ ∈ ∏ ℝⁿₖ|k ∈ {1,…,m}, nₖ = iₖ | k ∈ {1,…,m}\{p| p ∈ ℕ, 1 ≤ p ≤ m} ∧ 1 ≤ nₖ ≤ iₖ | k = p, how should I then include that transformation in a Flux.Chain?

In a less symbol laden sentence: How can i achieve the equivalent of logical indexing in one of the array dimensions in a Flux model?

Apart from that I also want:

  1. it to be possible to train the model
  2. it to be possible to transfer the model to a CUDA-compatible card
  3. it to be possible to train the model on a CUDA-compatible card

It is entirely possible to achieve 1 and 2 without 3, by the way.

The particular selection in the relevant dimension - while it can’t be hard coded - is static and known when the model is being built. Or put in another way, while the selection in the relevant dimension may change over the process’s lifetime, this will require the construction of a new model.

I have tried:

  • Logical indexing ( x->x[:,:,[false, true, true, …, false],:] )
  • Making a CUDA kernel (with previous as fallback for cpu mode)
  • Using an incidence matrix (an identity matrix with some columns missing) and OMEinsum
  • cat on contigous blocks extracted from array

While I cannot explain everything about what I have tried, suffice to say I haven’t found a way to accomplish this seemingly very simple operation.

I have tried to make a prototype to illustrate the problem and I’ll post it here, but I’m uncertain if there are aspects of my original code that could be relevant.

import Pkg
Pkg.activate(".")
import CUDA
import Flux
import InteractiveUtils

InteractiveUtils.versioninfo()
println(CUDA.device!(0))
println("Driver version: ", CUDA.version())
println("System driver version: ", CUDA.system_version())

# The data is k×l×m×n and logic_idx is of length m.
# logic_idx could chosen arbitrarily except that all falses need not be 
# supported
logic_idx = [false, true, true, true, true, false, true, false] 

# These are two other ways to express the same information (except length, which
# can be accessed in some global constant, let's say m)
channel_idx = [2, 3, 4, 5, 7]
range_idx = [2:5, 7:7]
const m = 8
# Here is another way to convey what should be done
transform_M = Float32[0 0 0 0 0 
                      1 0 0 0 0
                      0 1 0 0 0
                      0 0 1 0 0
                      0 0 0 1 0
                      0 0 0 0 0
                      0 0 0 0 1
                      0 0 0 0 0]

# The question is: how should slicer be defined?
# Example 1
slicer(x) = x[:, :, logic_idx, :]

# another example, comment out the previous to test
# Example 2
# f_expr = Expr(:->, :x, Expr(:call, :cat,
#         Any[Base.Meta.parse("x[:,:,$cr,:]") for cr ∈ range_idx]...,
#         :($(Expr(:kw, :dims, :((3,)))))))
# slicer = eval( f_expr )

# Yet another:
# Example 3
# import OMEinsum
# slicer(x) = OMEinsum.ein"ijkl, km -> ijml"(x, transform_M)

my_model = Flux.Chain(
    slicer,
    Flux.Conv((3,3), sum(logic_idx) => 4, Flux.relu; pad = 1),
    Flux.Conv((7,7), 4 => 4, Flux.relu; stride = 4, pad = 3),
    Flux.Conv((7,7), 4 => 4, Flux.relu; stride = 4, pad = 3),
    Flux.Conv((7,7), 4 => 4, Flux.relu; stride = 4, pad = 3),
    Flux.Conv((7,7), 4 => 4, Flux.relu; stride = 4, pad = 3),
    x -> dropdims(x; dims = (1, 2)),
    Flux.Dense(4, 1, Flux.relu),
)

y = exp2.(rand(Float32, (1, 32)))
a_batch = rand(Float32, (256, 256, 8, 32));
gy = Flux.gpu(y)
a_gbatch = Flux.gpu(a_batch) # Save a gpu version for later
println("a_batch: $(ndims(a_batch)) dimensional tensor of size ",
    "$(size(a_batch)) and element type $(eltype(a_batch)) ($(typeof(a_batch)))")

ŷ = my_model(a_batch);
println("ŷ: $(ndims(ŷ)) dimensional tensor of size $(size(ŷ)) and element ",
    "type $(eltype(ŷ)) ($(typeof(ŷ)))")

println("\nTrying to train on cpu...")
opt = Flux.AdaBelief()
state = Flux.setup(opt, my_model)
loss = Flux.Losses.mse
for _ ∈ 1:8
    l::Float32 = 0.0
    grads = Flux.gradient(my_model) do m
        ŷ = m(a_batch)
        l = loss(ŷ, y)
    end
    println(l)
    Flux.update!(state, my_model, grads[1])
end

my_gmodel = Flux.gpu(my_model)
println("\n\e[1;38;2;208;160;0mModel could be successfully transfered to ",
    "gpu!\e[0m")

println("\nTrying to train on gpu...")
opt = Flux.AdaBelief()
state = Flux.setup(opt, my_gmodel)
loss = Flux.Losses.mse
for _ ∈ 1:8
    l::Float32 = 0.0
    grads = Flux.gradient(my_gmodel) do m
        ŷ = m(a_gbatch)
        l = loss(ŷ, gy)
    end
    println(l)
    Flux.update!(state, my_gmodel, grads[1])
end

println("\e[1;38;2;0;224;0mModel successfully trained on gpu!\e[0m")

Anyway, what is the best slicer? I imagine that there is a simple and known solution to the problem.

Additional information:

Julia 1.8.5
Flux.jl 0.32.11
CUDA.jl 3.12.1

The cuda environment, when queried, responds with
Driver version: 12.0.0
System driver version: 12.0.0

I’m not sure I understand the question. It looks like slicer works for you, so is this just about how to optimize it?

1 Like

How do you infer that it works? My point is that it errors somewhere along the line, however I try to make my slicer. Of course, if anyone would like to outline some principles for how this thing should be accomplished in the most time efficient manner possible, I would be chuffed; but I’d be happy with any solution that let me train on the gpu without being multiple orders of magnitude slower than the optimal solution.

I have been starting to suspect that the issue could be muddled by there being more than one thing going on. Unlike older experiments, some recent slicer implementations result in a huge amount of error messages like (repeating blocks edited out):

┌ Error: CuDNN (v8302) function cudnnGetConvolutionForwardAlgorithmMaxCount() called:
│     Info: Traceback contains 44 message(s)
│         Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: false == cudnn::cnn::isForwardSupported(handle, xDesc, wDesc, cDesc, yDesc, algo)
│         Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: T_ENGINEMAP::isLegacyAlgoSupported(handle, xDesc, wDesc, cDesc, yDesc, algo)
│         Error: CUDNN_STATUS_BAD_PARAM; Reason: dimA[i] <= 0
│         Error: CUDNN_STATUS_BAD_PARAM; Reason: cudnn::ops::setTensorNdDescriptor(desc, dtype, nbDims, dimA, strideA, true)
│         Error: CUDNN_STATUS_BAD_PARAM; Reason: initStatus = getXDescriptor(conv, &xDescCompat)
│ Time: 2023-01-29T16:51:48.755533 (0d+0h+2m+18s since start)
│ Process=9257; Thread=9257; GPU=NULL; Handle=NULL; StreamId=NULL.
└ @ CUDA.CUDNN ~/.julia/packages/CUDA/Ey3w2/lib/cudnn/CUDNN.jl:140

That looks unrelated to me? As long as the output of slicer is a CuArray when you pass it a CuArray, then I don’t see why it would cause issues. If it isn’t, that’s another discussion.

Then I suppose I have solved it.

Example 2 shows the strategy of taking some simple (using a unit range) slices of a CuArray and then concatenating them.

Both taking slices with fixed ranges and concatenation are as I undetstand it allowed operations on CuArrays and return CuArrays. Right?

Now I need to understand what all the other stuff is about.

1 Like