Element-wise operations in Tullio.jl on GPU

Hello,

I wonder what is the correct way to implement element-wise operations in Tullio.jl on GPU. The MWE is as follows:

using CUDA, Tullio
using BenchmarkTools

mul(A, B, C) = @tullio C[k] = A[k] * B[k]

a = rand(80000);
b = rand(80000);
c = similar(b);

@btime mul($a, $b, $c);  # works fine on CPU
@btime mul($d_a, $d_b, $d_c);  # fails: Scalar indexing is disallowed

I’m aware of another implementation without Tullio such as @. d_c = d_a * d_b;, but I just wonder if and how it can be done with Tullio.

Thanks!

Not the most friendly error message, but what you need to do is to load some extra packages:

julia> using KernelAbstractions, CUDAKernels

julia> CUDA.allowscalar(false);

julia> mul(A, B, C) = @tullio C[k] = A[k] * B[k];  # run macro after loading packages

julia> d_a, d_b, d_c = cu.((a, b, c));

julia> @btime mul($d_a, $d_b, $d_c);
  min 34.842 μs, mean 48.070 μs (86 allocations, 3.33 KiB)

julia> @btime CUDA.@sync mul($d_a, $d_b, $d_c);
  min 43.031 μs, mean 108.234 μs (86 allocations, 3.33 KiB)

julia> @btime CUDA.@sync $d_c .= $d_a .* $d_b; 
  min 17.036 μs, mean 176.596 μs (7 allocations, 480 bytes)

julia> @btime mul($a, $b, $c);  # CPU
  min 122.099 μs, mean 123.461 μs (2 allocations, 32 bytes)

julia> c ≈ collect(d_c)
true
3 Likes

I see. I noticed them while reading the README but didn’t connect them with the error.