Converting result of round or floor as Int in Metal

I’m trying to get into GPU programming and I’m trying to write a simple kernel that calculates the correlation function of an image. I’ve run into a problem where I cannot convert the result of the round or floor functions to an Int.

If I try to do the following as a very simple prototype

result = MtlArray(zeros(Int32, 10))

function test(result)
    i = thread_position_in_grid_1d()
    i1 = 1.f0
    i2 = 3.3f0
  
    result[i] = Int32(round(i1+i2))

    return
end

@metal threads = 10 groups = 1 test(result)

And run it, I get the following. Why is this not possible?

ERROR: InvalidIRError: compiling MethodInstance for test(::MtlDeviceVector{Int32, 1}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to gpu_malloc)
Stacktrace:
 [1] malloc
   @ ~/.julia/packages/GPUCompiler/BxfIW/src/runtime.jl:89
 [2] macro expansion
   @ ~/.julia/packages/GPUCompiler/BxfIW/src/runtime.jl:184
 [3] macro expansion
   @ ./none:0
 [4] box
   @ ./none:0
 [5] box_float32
   @ ~/.julia/packages/GPUCompiler/BxfIW/src/runtime.jl:213
 [6] Int32
   @ ./float.jl:900
 [7] test
   @ ~/Library/Mobile Documents/com~apple~CloudDocs/Documents/PhD/JuliaProjects/InteractiveIsing.jl/Tests/MetalSampling.jl:53
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/BxfIW/src/validation.jl:145
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/BxfIW/src/driver.jl:407 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/BxfIW/src/driver.jl:406 [inlined]
  [5] emit_llvm(job::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, cleanup::Bool, only_entry::Bool, validate::Bool, ctx::LLVM.ThreadSafeContext)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/BxfIW/src/utils.jl:89
  [6] emit_llvm
    @ ~/.julia/packages/GPUCompiler/BxfIW/src/utils.jl:83 [inlined]
  [7] codegen(output::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing, ctx::LLVM.ThreadSafeContext)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/BxfIW/src/driver.jl:120
  [8] codegen
    @ ~/.julia/packages/GPUCompiler/BxfIW/src/driver.jl:94 [inlined]
  [9] compile(target::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool, ctx::LLVM.ThreadSafeContext)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/BxfIW/src/driver.jl:90
 [10] compile
    @ ~/.julia/packages/GPUCompiler/BxfIW/src/driver.jl:81 [inlined]
 [11] compile(job::GPUCompiler.CompilerJob, ctx::LLVM.ThreadSafeContext)
    @ Metal ~/.julia/packages/Metal/TtPHW/src/compiler/compilation.jl:59
 [12] #59
    @ ~/.julia/packages/Metal/TtPHW/src/compiler/compilation.jl:55 [inlined]
 [13] LLVM.ThreadSafeContext(f::Metal.var"#59#60"{GPUCompiler.CompilerJob{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams}})
    @ LLVM ~/.julia/packages/LLVM/TLGyi/src/executionengine/ts_module.jl:14
 [14] JuliaContext(f::Metal.var"#59#60"{GPUCompiler.CompilerJob{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/BxfIW/src/driver.jl:35
 [15] compile
    @ ~/.julia/packages/Metal/TtPHW/src/compiler/compilation.jl:54 [inlined]
 [16] actual_compilation(cache::Dict{UInt64, Any}, key::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams}, ft::Type, tt::Type, compiler::typeof(Metal.compile), linker::typeof(Metal.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/BxfIW/src/cache.jl:53
 [17] cached_compilation(cache::Dict{UInt64, Any}, cfg::GPUCompiler.CompilerConfig{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams}, ft::Type, tt::Type, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/BxfIW/src/cache.jl:37
 [18] macro expansion
    @ ~/.julia/packages/Metal/TtPHW/src/compiler/execution.jl:161 [inlined]
 [19] macro expansion
    @ ./lock.jl:267 [inlined]
 [20] mtlfunction(f::typeof(test), tt::Type{Tuple{MtlDeviceVector{Int32, 1}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Metal ~/.julia/packages/Metal/TtPHW/src/compiler/execution.jl:157
 [21] mtlfunction(f::typeof(test), tt::Type{Tuple{MtlDeviceVector{Int32, 1}}})
    @ Metal ~/.julia/packages/Metal/TtPHW/src/compiler/execution.jl:155
 [22] top-level scope
    @ ~/.julia/packages/Metal/TtPHW/src/compiler/execution.jl:77

Does round(Int32, i1+i2) work?

1 Like

Wow, yes this works! Why does the other code error, however? I assumed there were no type instabilities in the code. I assume round(x) returns a ::typeof(x) and Int32 obviously returns an Int32.

Okay, I seem to really be misunderstanding something. Your suggestion works, but if I now try: round(Int32, sqrt(i1+i2))

I get:

ERROR: InvalidIRError: compiling MethodInstance for test(::MtlDeviceVector{Int32, 1}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to gpu_malloc)
Stacktrace:
 [1] malloc
   @ ~/.julia/packages/GPUCompiler/BxfIW/src/runtime.jl:89
 [2] macro expansion
   @ ~/.julia/packages/GPUCompiler/BxfIW/src/runtime.jl:184
 [3] macro expansion
   @ ./none:0
 [4] box
   @ ./none:0
 [5] box_float32
   @ ~/.julia/packages/GPUCompiler/BxfIW/src/runtime.jl:213
 [6] trunc
   @ ./float.jl:893
 [7] round
   @ ./float.jl:384
 [8] test
   @ ~/Library/Mobile Documents/com~apple~CloudDocs/Documents/PhD/JuliaProjects/InteractiveIsing.jl/Tests/MetalSampling.jl:57
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/BxfIW/src/validation.jl:145
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/BxfIW/src/driver.jl:407 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/BxfIW/src/driver.jl:406 [inlined]
  [5] emit_llvm(job::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, cleanup::Bool, only_entry::Bool, validate::Bool, ctx::LLVM.ThreadSafeContext)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/BxfIW/src/utils.jl:89
  [6] emit_llvm
    @ ~/.julia/packages/GPUCompiler/BxfIW/src/utils.jl:83 [inlined]
  [7] codegen(output::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing, ctx::LLVM.ThreadSafeContext)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/BxfIW/src/driver.jl:120
  [8] codegen
    @ ~/.julia/packages/GPUCompiler/BxfIW/src/driver.jl:94 [inlined]
  [9] compile(target::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool, ctx::LLVM.ThreadSafeContext)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/BxfIW/src/driver.jl:90
 [10] compile
    @ ~/.julia/packages/GPUCompiler/BxfIW/src/driver.jl:81 [inlined]
 [11] compile(job::GPUCompiler.CompilerJob, ctx::LLVM.ThreadSafeContext)
    @ Metal ~/.julia/packages/Metal/TtPHW/src/compiler/compilation.jl:59
 [12] #59
    @ ~/.julia/packages/Metal/TtPHW/src/compiler/compilation.jl:55 [inlined]
 [13] LLVM.ThreadSafeContext(f::Metal.var"#59#60"{GPUCompiler.CompilerJob{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams}})
    @ LLVM ~/.julia/packages/LLVM/TLGyi/src/executionengine/ts_module.jl:14
 [14] JuliaContext(f::Metal.var"#59#60"{GPUCompiler.CompilerJob{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/BxfIW/src/driver.jl:35
 [15] compile
    @ ~/.julia/packages/Metal/TtPHW/src/compiler/compilation.jl:54 [inlined]
 [16] actual_compilation(cache::Dict{UInt64, Any}, key::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams}, ft::Type, tt::Type, compiler::typeof(Metal.compile), linker::typeof(Metal.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/BxfIW/src/cache.jl:53
 [17] cached_compilation(cache::Dict{UInt64, Any}, cfg::GPUCompiler.CompilerConfig{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams}, ft::Type, tt::Type, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/BxfIW/src/cache.jl:37
 [18] macro expansion
    @ ~/.julia/packages/Metal/TtPHW/src/compiler/execution.jl:161 [inlined]
 [19] macro expansion
    @ ./lock.jl:267 [inlined]
 [20] mtlfunction(f::typeof(test), tt::Type{Tuple{MtlDeviceVector{Int32, 1}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Metal ~/.julia/packages/Metal/TtPHW/src/compiler/execution.jl:157
 [21] mtlfunction(f::typeof(test), tt::Type{Tuple{MtlDeviceVector{Int32, 1}}})
    @ Metal ~/.julia/packages/Metal/TtPHW/src/compiler/execution.jl:155
 [22] top-level scope
    @ ~/.julia/packages/Metal/TtPHW/src/compiler/execution.jl:77

I think in all cases the problem is that even if the return value is correct, it is important the the function themselves do not use 64bit types. For example the sqrt is probably using a Float64 if you provide Integer as inputs…

(Maybe round( Float32( i1 + i2 ) ) works. But I’m also guessing.)

So you mean to say that there probably is some intermediate step in sqrt which uses Float64? I’m inputting two Float32’s into the sqrt though. Also, why does this problem not come up if I don’t use it as an input to the round function. If I just set result[I] = sqrt(i1+i2) it works fine.

Edit:

Of course if I do result[i] = round(Int32, somefloat), I insert a result = MtlArray(zeros(Int32,10)) into the test function (and for Floats I use the corresponding MtlArray(zeros(Float32, 10))). I made a typo before in my opening post which I fixed now.

Oh, maybe someone else here can help better. I tried to look into the source code of round and got confused. (I will look at it later since I want to know the solution as well :wink:

1 Like

Not a type instability, but a checked conversion: round still returns a Float32, so the conversion to Int32 may throw (which is currently unsupported by Metal.jl). The typical workaround is to use unsafe_trunc(Int32, ...) or ...%Int32.

3 Likes

apologies for reviving this thread, but I’ve run into exactly the same issue, which seems to still be around. I also don’t see an Issue on Metal.jl (there seems to be a similar but not the same issue on KernelAbstraction.jl). I’d just like to understand whether this is actually a “bug” or an inherent limitation of Metal, or ???

And how “safe” is unsafe_trunc ? What are the potential pitfalls? Should I go through the extra work to use different round or floor on different backends?

Would it be worthwhile to have “official” workarounds in KernelAbstractions.jl that can be centrally updated as the backends improve?

Presumably exactly as unsafe as the docstring says:

Return the nearest integral value of type T whose absolute value is less
than or equal to the absolute value of x. If the value is not representable
by T, an arbitrary value will be returned. See also trunc.

well, that didn’t sound so unsafe to me, so the only difference is that it doesn’t throw an exception I guess?

Would I be right in assuming that said potential for an exception is what prevents using round, floor, ceil, trunc and so forth?