CUDA.jl kernel is half as fast as c++ Kernel

Thanks for the suggestions. I’ll rework it and see waht I can do. With regards to In64 types when doing loops such as

for tr_yi = 1:size(tr_array, 3)

tr_yi will be 64 bit. For loops like this is there a straightforward way to control the index type? including eachindex() and CartesianIndices()?

thank you.