Hello,
I wonder what is the correct way to implement element-wise operations in Tullio.jl on GPU. The MWE is as follows:
using CUDA, Tullio
using BenchmarkTools
mul(A, B, C) = @tullio C[k] = A[k] * B[k]
a = rand(80000);
b = rand(80000);
c = similar(b);
@btime mul($a, $b, $c); # works fine on CPU
@btime mul($d_a, $d_b, $d_c); # fails: Scalar indexing is disallowed
I’m aware of another implementation without Tullio such as @. d_c = d_a * d_b;
, but I just wonder if and how it can be done with Tullio.
Thanks!
Not the most friendly error message, but what you need to do is to load some extra packages:
julia> using KernelAbstractions, CUDAKernels
julia> CUDA.allowscalar(false);
julia> mul(A, B, C) = @tullio C[k] = A[k] * B[k]; # run macro after loading packages
julia> d_a, d_b, d_c = cu.((a, b, c));
julia> @btime mul($d_a, $d_b, $d_c);
min 34.842 μs, mean 48.070 μs (86 allocations, 3.33 KiB)
julia> @btime CUDA.@sync mul($d_a, $d_b, $d_c);
min 43.031 μs, mean 108.234 μs (86 allocations, 3.33 KiB)
julia> @btime CUDA.@sync $d_c .= $d_a .* $d_b;
min 17.036 μs, mean 176.596 μs (7 allocations, 480 bytes)
julia> @btime mul($a, $b, $c); # CPU
min 122.099 μs, mean 123.461 μs (2 allocations, 32 bytes)
julia> c ≈ collect(d_c)
true
3 Likes
I see. I noticed them while reading the README but didn’t connect them with the error.