Row and column major order for arrays of different shape

Ok, the actual code looks like this

for ik=id:stride:Nk
    for m=1:Nb
    for n=1:Nb
        xsum = zero(eltype(u))
        for j=1:Nb
            xsum += u[n,j,ik] * xi[j,m,ik] - xi[n,j,ik] * u[j,m,ik]
        end
        xsum += (ueq[n,n,ik] - ueq[m,m,ik]) * xi[n,m,ik]

        du[n,m,ik] = -1im * G1 * (W[n,ik] - W[m,ik]) * u[n,m,ik]
        du[n,m,ik] += 1im * G2 * E * xsum
        du[n,m,ik] += -gamma[n,m] * u[n,m,ik]

        # probably some other terms which depend on specific ik
        # if W[m,ik] >= Wcrit
        #     do somthing...
        # end
    end
    end
end

This is a part of CUDA kernel which is parallel over ik index. In this specific example, I guess, I can emulate eachindex with some CartesianIndices approach. However, if I will have some additional terms which explicitly depend on ik, this will not work. In any case, since Nk is much larger than Nb, I thought that it makes no big sense to make the corresponding loops over Nb to be parallel.

Looks like you could set this up as a vector of matrices, and then use CartesianIndices as you say. Why can’t you keep ik around if that is required?

This:

for j=1:Nb
    xsum += u[n,j,ik] * xi[j,m,ik] - xi[n,j,ik] * u[j,m,ik]
end

might be a nice expression for Tullio, or otherwise just use matrix multiplication (minus its own transpose) outside the m,n loop.

Looks like you could set this up as a vector of matrices, and then use CartesianIndices as you say. Why can’t you keep ik around if that is required?

This is one of the possibilities that I have to try.

My original question was more academic: I just was very surprised that

A rule of thumb to keep in mind is that with column-major arrays, the first index changes most rapidly. Essentially this means that looping will be faster if the inner-most loop index is the first to appear in a slice expression.

is not a universal rule. Moreover, it depends on array size and shape. Shaking the foundations.

or otherwise just use matrix multiplication (minus its own transpose) outside the m,n loop

Unfortunately, LinearAlgebra functions does not work inside CUDA kernels (I guess, because BLAS is not GPU-friendly).

I thought there was some CuBLAS library. I’m not familiar with CUDA, but I would be surprised if there isn’t something like that.