Help with Custom Struct for High Dimensional COO Arrays

Hello,

I have an application where I need >2 dimensional sparse arrays and I would like to test running them on the GPU as the application is embarrassingly parallel. My plan was to make a CuArray of the type TensorVal type using the function below. This function does run but only if I use @allowscalar. From my understanding its ok that this runs on the CPU because I’m just initializing the array. However, everything breaks when I try to use this array in the cuda_sparse function and I get index out of bounds error and the REPL crashes. I’m not sure if the issue is in the initialization or in the execution. Any help would be appreciated! If there are better ways to do this I’d love to hear them.

struct TensorVal
    i::Int32
    j::Int32
    k::Int32
    val::Float32
end

function get_non_zero_gpu(F3::SparseArray)
    num_nonzero = length(nonzero_values(F3))
    F3_non_zero = CuArray{TensorVal}(undef,(num_nonzero,))
    for (idx, val) in nonzero_pairs(F3)
        F3_non_zero[idx] = TensorVal(idx[1], idx[2], idx[3], val)
    end
    return F3_non_zero
end

For reference this is how I am trying to use the array (basically a high dimensional dot product):

function cuda_sparse(cuF3_sparse, cuPhi1, cuPhi2, cuPhi3)
    f = (f3_data) -> f3_data.val * cuPhi1[f3_data.i] * cuPhi2[f3_data.j] * cuPhi3[f3_data.k]
   return mapreduce(f, +, cuF3_sparse)
end

As it turns out I was being stupid, the get_non_zero_gpu() function should have been the code below. I would still love to hear if people have recommendations as to how I should go about this or if this approach seems optimal.

function get_non_zero_gpu(F3::SparseArray)
    num_nonzero = length(nonzero_values(F3))
    F3_non_zero = CuArray{TensorVal}(undef,(num_nonzero,))
    count = 1
    for (idx, val) in nonzero_pairs(F3)
        F3_non_zero[count] = TensorVal(idx[1], idx[2], idx[3], val)
        count += 1
    end
    return F3_non_zero
end