ERROR: LoadError: Scalar indexing is disallowed in tullio

For the following code

using CUDA
using Tullio
using BenchmarkTools

A = rand(500, 500)
B = rand(500, 500)

A_d = CUDA.CuArray(A)
B_d = CUDA.CuArray(B)

@btime @tullio C_d[i,k] := A_d[i,j] * B_d[j,k]

C = Array(C_d)

I tried to run from julia 1.9.2 and I got

ERROR: LoadError: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore are only permitted from the REPL for prototyping purposes.
If you did intend to index this array, annotate the caller with @allowscalar.
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] assertscalar(op::String)
    @ GPUArraysCore ~/.julia/packages/GPUArraysCore/uOYfN/src/GPUArraysCore.jl:103
  [3] getindex(::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, ::Int64, ::Int64)
    @ GPUArrays ~/.julia/packages/GPUArrays/5XhED/src/host/indexing.jl:9
  [4] π’œπ’Έπ“‰!
    @ ~/.julia/packages/Tullio/NGyNM/src/macro.jl:1211 [inlined]
  [5] π’œπ’Έπ“‰!
    @ ~/.julia/packages/Tullio/NGyNM/src/macro.jl:1041 [inlined]
  [6] threader(fun!::var"#π’œπ’Έπ“‰!#3", ::Type{CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}, Z::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, As::Tuple{CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}, Is::Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, Js::Tuple{Base.OneTo{Int64}}, redfun::Function, block::Int64, keep::Nothing)
    @ Tullio ~/.julia/packages/Tullio/NGyNM/src/eval.jl:104
  [7] (::var"#ℳ𝒢𝓀ℯ#6"{var"#π’œπ’Έπ“‰!#3"})(A_d::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, B_d::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer})
    @ Main ~/.julia/packages/Tullio/NGyNM/src/macro.jl:807
  [8] (::Tullio.Eval{var"#ℳ𝒢𝓀ℯ#6"{var"#π’œπ’Έπ“‰!#3"}, var"#14#βˆ‡β„³π’Άπ“€β„―#5"{var"#βˆ‡π’œπ’Έπ“‰!#4"}})(::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, ::Vararg{CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}})
    @ Tullio ~/.julia/packages/Tullio/NGyNM/src/eval.jl:20
  [9] macro expansion
    @ ~/.julia/packages/Tullio/NGyNM/src/macro.jl:976 [inlined]
 [10] var"##core#297"()
    @ Main ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:489
 [11] var"##sample#298"(::Tuple{}, __params::BenchmarkTools.Parameters)
    @ Main ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:495
 [12] _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Pairs{Symbol, Integer, NTuple{4, Symbol}, NamedTuple{(:samples, :evals, :gctrial, :gcsample), Tuple{Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:99
 [13] #invokelatest#2
    @ ./essentials.jl:818 [inlined]
 [14] invokelatest
    @ ./essentials.jl:813 [inlined]
 [15] #run_result#45
    @ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:34 [inlined]
 [16] run_result
    @ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:34 [inlined]
 [17] run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, kwargs::Base.Pairs{Symbol, Integer, NTuple{5, Symbol}, NamedTuple{(:verbose, :samples, :evals, :gctrial, :gcsample), Tuple{Bool, Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:117
 [18] run (repeats 2 times)
    @ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:117 [inlined]
 [19] #warmup#54
    @ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:169 [inlined]
 [20] warmup(item::BenchmarkTools.Benchmark)
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:168
 [21] top-level scope
    @ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:575

what is the reason for it and how to fix it? Thank you very much

I can get rid of the error message by the following code. But, if i don’t use scalar indices, how to utilize GPU for complicated tensor contraction, that is not in form of matrix multiplication, e.g., A[i,j,k]*B[k,i]*C[l,j]

using CUDA
using BenchmarkTools

A = rand(500, 500)
B = rand(500, 500)

A_d = CUDA.CuArray(A)
B_d = CUDA.CuArray(B)

C_d = @btime $A_d * $B_d

C = Array(C_d)

You need to load some other packages, at present using CUDA, CUDAKernels, KernelAbstractions as here. Sadly this will only work with CUDAv3, work to support the current v4 is in progress, e.g. this PR, and will remove the use of package CUDAKernels.

Note also that Tullio will not be especially quick at operations that look like matrix multiplication, with CUDA arrays. There are many tricks for efficient access which it does not know. (On ordinary CPU arrays, the situation is better.)

1 Like

Thank you so much! For the link Gradients & GPU
it works. But, if I add @btime on mul(A, B) = @tullio C[i,k] := A[i,j] * B[j,k], the code also gave me

ERROR: LoadError: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore are only permitted from the REPL for prototyping purposes.
If you did intend to index this array, annotate the caller with @allowscalar.
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
...

May I know why @btime leads to this issue and how to fix it?
Here is the code with @btime

using Tullio
using BenchmarkTools

mul(A, B) = @tullio C[i,k] := A[i,j] * B[j,k]

A = rand(3,40); B = rand(40,500);
@btime A * B β‰ˆ mul(A, B) # true

using Tracker # or Zygote
Ξ”A = Tracker.gradient((A,B) -> sum(mul(A, B)), A, B)[1]
Ξ”A β‰ˆ ones(3,500) * B' # true

using CUDA, CUDAKernels, KernelAbstractions # Now defined with a GPU version:
@btime mul(A, B) = @tullio C[i,k] := A[i,j] * B[j,k]

cu(A * B) β‰ˆ mul(cu(A), cu(B)) # true

cu(Ξ”A) β‰ˆ Tracker.gradient((A,B) -> sum(mul(A, B)), cu(A), cu(B))[1] # true

# Reduction over min/max:
Tracker.gradient(x -> (@tullio (max) res := x[i]^3), [1,2,3,-2,-1,3])[1]
~
~

When you put the @btime in front of that method definition expression, it put that expression into a local scope (including the benchmark loop), so no global mul redefinition happened. By the time your code tried to call mul on CuArrays, it used the first definition, which fails as expected.

julia> @btime f() = 0  # #-filled name indicates locally scoped function
  1.549 ns (0 allocations: 0 bytes)
(::var"#f#1") (generic function with 1 method)

julia> let # block makes a local scope
         f() = 0
       end
(::var"#f#3") (generic function with 1 method)

julia> f  # no global f was made
ERROR: UndefVarError: f not defined

There’s also no point in trying to benchmark a method definition, it’s not running the function body, which is probably what you wanted to know. If so, do this:

mul(A, B) = @tullio C[i,k] := A[i,j] * B[j,k]
@btime mul( $(cu(A)) , $(cu(B)) )
1 Like