Using Tullio#master I’m getting the following error from the code above:
julia> let A = CUDA.zeros(4), I = cu([1,3,1]), v = cu([1,2,3])
@tullio A[I[i]] += v[i]
end
ERROR: AssertionError: length(__workgroupsize) <= length(ndrange)
Stacktrace:
[1] partition at /home/user/.julia/packages/KernelAbstractions/jAutM/src/nditeration.jl:103 [inlined]
[2] partition(::KernelAbstractions.Kernel{CUDADevice,KernelAbstractions.NDIteration.StaticSize{(256,)},KernelAbstractions.NDIteration.DynamicSize,var"#gpu_##🇨🇺#253#3"}, ::Tuple{}, ::Nothing) at /home/user/.julia/packages/KernelAbstractions/jAutM/src/KernelAbstractions.jl:385
[3] launch_config(::KernelAbstractions.Kernel{CUDADevice,KernelAbstractions.NDIteration.StaticSize{(256,)},KernelAbstractions.NDIteration.DynamicSize,var"#gpu_##🇨🇺#253#3"}, ::Tuple{}, ::Nothing) at /home/user/.julia/packages/KernelAbstractions/jAutM/src/backends/cuda.jl:156
[4] (::KernelAbstractions.Kernel{CUDADevice,KernelAbstractions.NDIteration.StaticSize{(256,)},KernelAbstractions.NDIteration.DynamicSize,var"#gpu_##🇨🇺#253#3"})(::CuArray{Float32,1}, ::Vararg{Any,N} where N; ndrange::Tuple{}, dependencies::KernelAbstractions.CudaEvent, workgroupsize::Nothing, progress::Function) at /home/user/.julia/packages/KernelAbstractions/jAutM/src/backends/cuda.jl:163
[5] 𝒜𝒸𝓉! at /home/user/.julia/packages/Tullio/RAkkV/src/macro.jl:1166 [inlined]
[6] 𝒜𝒸𝓉! at /home/user/.julia/packages/Tullio/RAkkV/src/macro.jl:1163 [inlined]
[7] threader(::var"#𝒜𝒸𝓉!#1", ::Type{CuArray{T,1} where T}, ::CuArray{Float32,1}, ::Tuple{CuArray{Int64,1},CuArray{Int64,1}}, ::Tuple{}, ::Tuple{Base.OneTo{Int64}}, ::Function, ::Int64, ::Bool) at /home/user/.julia/packages/Tullio/RAkkV/src/eval.jl:86
[8] top-level scope at /home/user/.julia/packages/Tullio/RAkkV/src/macro.jl:1002
[9] top-level scope at REPL[2]:2
But on CPU it indeed does the correct thing.
Another approach I thought about was grouping indices and summing corresponding values before adding them to A
, but it seems to boil down to the same issue.