ERROR: LoadError: Scalar indexing is disallowed in tullio

Nellatur · July 19, 2023, 8:32am

For the following code

using CUDA
using Tullio
using BenchmarkTools

A = rand(500, 500)
B = rand(500, 500)

A_d = CUDA.CuArray(A)
B_d = CUDA.CuArray(B)

@btime @tullio C_d[i,k] := A_d[i,j] * B_d[j,k]

C = Array(C_d)

I tried to run from julia 1.9.2 and I got

ERROR: LoadError: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore are only permitted from the REPL for prototyping purposes.
If you did intend to index this array, annotate the caller with @allowscalar.
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] assertscalar(op::String)
    @ GPUArraysCore ~/.julia/packages/GPUArraysCore/uOYfN/src/GPUArraysCore.jl:103
  [3] getindex(::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, ::Int64, ::Int64)
    @ GPUArrays ~/.julia/packages/GPUArrays/5XhED/src/host/indexing.jl:9
  [4] 𝒜𝒸𝓉!
    @ ~/.julia/packages/Tullio/NGyNM/src/macro.jl:1211 [inlined]
  [5] 𝒜𝒸𝓉!
    @ ~/.julia/packages/Tullio/NGyNM/src/macro.jl:1041 [inlined]
  [6] threader(fun!::var"#𝒜𝒸𝓉!#3", ::Type{CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}, Z::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, As::Tuple{CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}, Is::Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, Js::Tuple{Base.OneTo{Int64}}, redfun::Function, block::Int64, keep::Nothing)
    @ Tullio ~/.julia/packages/Tullio/NGyNM/src/eval.jl:104
  [7] (::var"#ℳ𝒶𝓀ℯ#6"{var"#𝒜𝒸𝓉!#3"})(A_d::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, B_d::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer})
    @ Main ~/.julia/packages/Tullio/NGyNM/src/macro.jl:807
  [8] (::Tullio.Eval{var"#ℳ𝒶𝓀ℯ#6"{var"#𝒜𝒸𝓉!#3"}, var"#14#∇ℳ𝒶𝓀ℯ#5"{var"#∇𝒜𝒸𝓉!#4"}})(::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, ::Vararg{CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}})
    @ Tullio ~/.julia/packages/Tullio/NGyNM/src/eval.jl:20
  [9] macro expansion
    @ ~/.julia/packages/Tullio/NGyNM/src/macro.jl:976 [inlined]
 [10] var"##core#297"()
    @ Main ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:489
 [11] var"##sample#298"(::Tuple{}, __params::BenchmarkTools.Parameters)
    @ Main ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:495
 [12] _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Pairs{Symbol, Integer, NTuple{4, Symbol}, NamedTuple{(:samples, :evals, :gctrial, :gcsample), Tuple{Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:99
 [13] #invokelatest#2
    @ ./essentials.jl:818 [inlined]
 [14] invokelatest
    @ ./essentials.jl:813 [inlined]
 [15] #run_result#45
    @ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:34 [inlined]
 [16] run_result
    @ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:34 [inlined]
 [17] run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, kwargs::Base.Pairs{Symbol, Integer, NTuple{5, Symbol}, NamedTuple{(:verbose, :samples, :evals, :gctrial, :gcsample), Tuple{Bool, Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:117
 [18] run (repeats 2 times)
    @ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:117 [inlined]
 [19] #warmup#54
    @ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:169 [inlined]
 [20] warmup(item::BenchmarkTools.Benchmark)
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:168
 [21] top-level scope
    @ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:575

what is the reason for it and how to fix it? Thank you very much

Nellatur · July 22, 2023, 11:49pm

I can get rid of the error message by the following code. But, if i don’t use scalar indices, how to utilize GPU for complicated tensor contraction, that is not in form of matrix multiplication, e.g., A[i,j,k]*B[k,i]*C[l,j]

using CUDA
using BenchmarkTools

A = rand(500, 500)
B = rand(500, 500)

A_d = CUDA.CuArray(A)
B_d = CUDA.CuArray(B)

C_d = @btime $A_d * $B_d

C = Array(C_d)

mcabbott · July 23, 2023, 12:26am

You need to load some other packages, at present using CUDA, CUDAKernels, KernelAbstractions as here. Sadly this will only work with CUDAv3, work to support the current v4 is in progress, e.g. this PR, and will remove the use of package CUDAKernels.

Note also that Tullio will not be especially quick at operations that look like matrix multiplication, with CUDA arrays. There are many tricks for efficient access which it does not know. (On ordinary CPU arrays, the situation is better.)

Nellatur · July 24, 2023, 2:45am

Thank you so much! For the link Gradients & GPU
it works. But, if I add @btime on mul(A, B) = @tullio C[i,k] := A[i,j] * B[j,k], the code also gave me

ERROR: LoadError: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore are only permitted from the REPL for prototyping purposes.
If you did intend to index this array, annotate the caller with @allowscalar.
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
...

May I know why @btime leads to this issue and how to fix it?
Here is the code with @btime

using Tullio
using BenchmarkTools

mul(A, B) = @tullio C[i,k] := A[i,j] * B[j,k]

A = rand(3,40); B = rand(40,500);
@btime A * B ≈ mul(A, B) # true

using Tracker # or Zygote
ΔA = Tracker.gradient((A,B) -> sum(mul(A, B)), A, B)[1]
ΔA ≈ ones(3,500) * B' # true

using CUDA, CUDAKernels, KernelAbstractions # Now defined with a GPU version:
@btime mul(A, B) = @tullio C[i,k] := A[i,j] * B[j,k]

cu(A * B) ≈ mul(cu(A), cu(B)) # true

cu(ΔA) ≈ Tracker.gradient((A,B) -> sum(mul(A, B)), cu(A), cu(B))[1] # true

# Reduction over min/max:
Tracker.gradient(x -> (@tullio (max) res := x[i]^3), [1,2,3,-2,-1,3])[1]
~
~

Benny · July 24, 2023, 3:45am

When you put the @btime in front of that method definition expression, it put that expression into a local scope (including the benchmark loop), so no global mul redefinition happened. By the time your code tried to call mul on CuArrays, it used the first definition, which fails as expected.

julia> @btime f() = 0  # #-filled name indicates locally scoped function
  1.549 ns (0 allocations: 0 bytes)
(::var"#f#1") (generic function with 1 method)

julia> let # block makes a local scope
         f() = 0
       end
(::var"#f#3") (generic function with 1 method)

julia> f  # no global f was made
ERROR: UndefVarError: f not defined

There’s also no point in trying to benchmark a method definition, it’s not running the function body, which is probably what you wanted to know. If so, do this:

mul(A, B) = @tullio C[i,k] := A[i,j] * B[j,k]
@btime mul( $(cu(A)) , $(cu(B)) )

Topic		Replies	Views
Element-wise operations in Tullio.jl on GPU Numerics gpu , tullio	2	1131	November 28, 2021
Overcoming Slow Scalar Operations on GPU Arrays GPU performance	19	6234	January 4, 2021
Scalar indexing GPU problem in Flux.jl model GPU question , flux	4	368	May 8, 2024
Scalar Indexing Error: multiplying matrix by scalar GPU	2	926	October 25, 2021
Brusselator example from DiffEqGPU won't run or performed badly after simple fix New to Julia diffeq , cuda	7	152	September 16, 2024

ERROR: LoadError: Scalar indexing is disallowed in tullio

Related topics