Calling Julia from PyTorch

Hello

I have written a GPU kernel in Julia that I use to accelerate a custom layer that I have written to be used with PyTorch. Everything works great, except when I try to use DDP from PyTorch. Here is a simple MWE that reproduces the issue.

main.py

from julia import Julia
jpath = "~path to julia~"
jl = Julia(runtime=jpath, compiled_modules=False)
from julia import Pkg
Pkg.activate("./matmul")
from julia import matmul
import torch

dev = '0'
x = torch.randn(1024, 1024).to(f"cuda:{dev}")
w = torch.randn(1024, 1024).to(f"cuda:{dev}")
y = torch.zeros(1024, 1024).to(f"cuda:{dev}")
matmul.MatMul(y, x, w, int(dev))

yGolden = torch.matmul(w, x)

print(torch.allclose(y, yGolden, atol=1e-2))

matmul/src/matmul.jl

__precompile__()
module matmul
using DLPack
using PyCall
using CUDA
const torch = PyNULL()
const dl = PyNULL()

function __init__()
    copy!(torch, pyimport("torch"))
    copy!(dl, pyimport("torch.utils.dlpack"))
end

function matrixMul(y, x, w)
    tx = threadIdx().x
    ty = threadIdx().y
    bx = blockIdx().x
    by = blockIdx().y
    row = tx + (bx - 1) * blockDim().x
    col = ty + (by - 1) * blockDim().y
    for i in 1:1024
        y[row, col] += x[row, i] * w[i, col]
    end
    return
end
function MatMul(yTensor, xInTensor, wTensor, dev)

    device!(dev)
    y = DLPack.wrap(yTensor, o -> @pycall dl.to_dlpack(o)::PyObject)::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}
    xIN = DLPack.wrap(xInTensor, o -> @pycall dl.to_dlpack(o)::PyObject)::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}
    w = DLPack.wrap(wTensor, o -> @pycall dl.to_dlpack(o)::PyObject)::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}

    ySize = size(y)
    tthreads = (16, 16)
    bblocks = (ceil(Int, ySize[1] / tthreads[1]), ceil(Int, ySize[2] / tthreads[2]))
    @cuda threads=tthreads blocks=bblocks matrixMul(y, xIN, w)
end

export MatMul
end # module matmul

When I run this as python-jl main.py, it works. But, when I try to use torchrun as python-jl -m torch.distributed.launch --nproc_per_node=1 main.py, I get the following error:

/opt/miniconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
  Activating project at `~/ddpJulia/matmul`
Traceback (most recent call last):
  File "/ddpJulia/main.py", line 13, in <module>
    matmul.MatMul(y, x, w, int(dev))
RuntimeError: <PyCall.jlwrap (in a Julia function called from Python)
JULIA: InvalidIRError: compiling MethodInstance for matmul.matrixMul(::CUDA.CuDeviceMatrix{Float32, 1}, ::CUDA.CuDeviceMatrix{Float32, 1}, ::CUDA.CuDeviceMatrix{Float32, 1}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to pointerref(ptr::Core.LLVMPtr{T, A}, i::I, ::Val{align}) where {T, A, I, align} in LLVM.Interop at /.julia/packages/LLVM/Od0DH/src/interop/pointer.jl:9)
Stacktrace:
 [1] unsafe_load
   @ ~/.julia/packages/LLVM/Od0DH/src/interop/pointer.jl:85
 [2] arrayref_bits
   @ ~/.julia/packages/CUDA/tVtYo/src/device/array.jl:91
 [3] #arrayref
   @ ~/.julia/packages/CUDA/tVtYo/src/device/array.jl:85
 [4] getindex
   @ ~/.julia/packages/CUDA/tVtYo/src/device/array.jl:164
 [5] getindex
   @ ~/.julia/packages/CUDA/tVtYo/src/device/array.jl:176
 [6] matrixMul
   @ ~/ddpJulia/matmul/src/matmul.jl:23
Reason: unsupported dynamic function invocation (call to *)
Stacktrace:
 [1] matrixMul
   @ ~/ddpJulia/matmul/src/matmul.jl:23
Reason: unsupported dynamic function invocation (call to +)
Stacktrace:
 [1] matrixMul
   @ ~/ddpJulia/matmul/src/matmul.jl:23
Reason: unsupported dynamic function invocation (call to convert)
Stacktrace:
 [1] setindex!
   @ ~/.julia/packages/CUDA/tVtYo/src/device/array.jl:166
 [2] setindex!
   @ ~/.julia/packages/CUDA/tVtYo/src/device/array.jl:179
 [3] matrixMul
   @ ~/ddpJulia/matmul/src/matmul.jl:23
Reason: unsupported dynamic function invocation (call to pointerset(ptr::Core.LLVMPtr{T, A}, x::T, i::I, ::Val{align}) where {T, A, I, align} in LLVM.Interop at /.julia/packages/LLVM/Od0DH/src/interop/pointer.jl:46)
Stacktrace:
 [1] unsafe_store!
   @ ~/.julia/packages/LLVM/Od0DH/src/interop/pointer.jl:88
 [2] arrayset_bits
   @ ~/.julia/packages/CUDA/tVtYo/src/device/array.jl:134
 [3] #arrayset
   @ ~/.julia/packages/CUDA/tVtYo/src/device/array.jl:127
 [4] setindex!
   @ ~/.julia/packages/CUDA/tVtYo/src/device/array.jl:166
 [5] setindex!
   @ ~/.julia/packages/CUDA/tVtYo/src/device/array.jl:179
 [6] matrixMul
   @ ~/ddpJulia/matmul/src/matmul.jl:23
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/l8TxP/src/validation.jl:149
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/l8TxP/src/driver.jl:415 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/l8TxP/src/driver.jl:414 [inlined]
  [5] emit_llvm(job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, only_entry::Bool, validate::Bool)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/l8TxP/src/utils.jl:89
  [6] codegen(output::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/l8TxP/src/driver.jl:129
  [7] compile(target::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/l8TxP/src/driver.jl:106
  [8] compile
    @ ~/.julia/packages/GPUCompiler/l8TxP/src/driver.jl:98 [inlined]
  [9] #1037
    @ ~/.julia/packages/CUDA/tVtYo/src/compiler/compilation.jl:104 [inlined]
 [10] JuliaContext(f::CUDA.var"#1037#1040"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/l8TxP/src/driver.jl:58
 [11] compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/tVtYo/src/compiler/compilation.jl:103
 [12] actual_compilation(cache::Dict{Any, CUDA.CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/l8TxP/src/execution.jl:125
 [13] cached_compilation(cache::Dict{Any, CUDA.CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/l8TxP/src/execution.jl:103
 [14] macro expansion
    @ ~/.julia/packages/CUDA/tVtYo/src/compiler/execution.jl:318 [inlined]
 [15] macro expansion
    @ ./lock.jl:223 [inlined]
 [16] cufunction(f::typeof(matmul.matrixMul), tt::Type{Tuple{CUDA.CuDeviceMatrix{Float32, 1}, CUDA.CuDeviceMatrix{Float32, 1}, CUDA.CuDeviceMatrix{Float32, 1}}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/tVtYo/src/compiler/execution.jl:313
 [17] cufunction
    @ ~/.julia/packages/CUDA/tVtYo/src/compiler/execution.jl:310 [inlined]
 [18] macro expansion
    @ ~/.julia/packages/CUDA/tVtYo/src/compiler/execution.jl:104 [inlined]
 [19] MatMul(yTensor::PyObject, xInTensor::PyObject, wTensor::PyObject, dev::Int64)
    @ matmul ~/ddpJulia/matmul/src/matmul.jl:38
 [20] invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Base ./essentials.jl:729
 [21] invokelatest(::Any, ::Any, ::Vararg{Any})
    @ Base ./essentials.jl:726
 [22] _pyjlwrap_call(f::Function, args_::Ptr{PyCall.PyObject_struct}, kw_::Ptr{PyCall.PyObject_struct})
    @ PyCall ~/.julia/packages/PyCall/SBNSg/src/callback.jl:28
 [23] pyjlwrap_call(self_::Ptr{PyCall.PyObject_struct}, args_::Ptr{PyCall.PyObject_struct}, kw_::Ptr{PyCall.PyObject_struct})
    @ PyCall ~/.julia/packages/PyCall/SBNSg/src/callback.jl:44>
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 49569) of binary: /opt/miniconda3/envs/torch/bin/python
Traceback (most recent call last):
  File "/opt/miniconda3/envs/torch/lib/python3.10/site-packages/julia/pseudo_python_cli.py", line 308, in main
    python(**vars(ns))
  File "/opt/miniconda3/envs/torch/lib/python3.10/site-packages/julia/pseudo_python_cli.py", line 54, in python
    scope = runpy.run_module(module, run_name="__main__", alter_sys=True)
  File "/opt/miniconda3/envs/torch/lib/python3.10/runpy.py", line 224, in run_module
    return _run_module_code(code, init_globals, run_name, mod_spec)
  File "/opt/miniconda3/envs/torch/lib/python3.10/runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/opt/miniconda3/envs/torch/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/miniconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/opt/miniconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/opt/miniconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/opt/miniconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/miniconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/miniconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
main.py FAILED

This MWE might be too simplistic, but in my larger project this setup makes more sense.

As I intend to train and perform inference with large neural networks, I would need DDP to perform multi-GPU and multi-node runs. Any help in understanding what is causing this InvalidIRError would be greatly appreciated.

Thanks!

When I decorate the kernel call in matmul.jl with @device_code_warntype interactive=false:

/opt/miniconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
  Activating project at `~/ddpJulia/matmul`
PTX CompilerJob of MethodInstance for matmul.matrixMul(::CUDA.CuDeviceMatrix{Float32, 1}, ::CUDA.CuDeviceMatrix{Float32, 1}, ::CUDA.CuDeviceMatrix{Float32, 1}) for sm_80

MethodInstance for matmul.matrixMul(::CUDA.CuDeviceMatrix{Float32, 1}, ::CUDA.CuDeviceMatrix{Float32, 1}, ::CUDA.CuDeviceMatrix{Float32, 1})
  from matrixMul(y, x, w) in matmul at /ddpJulia/matmul/src/matmul.jl:15
Arguments
  #self#::Core.Const(matmul.matrixMul)
  y::CUDA.CuDeviceMatrix{Float32, 1}
  x::CUDA.CuDeviceMatrix{Float32, 1}
  w::CUDA.CuDeviceMatrix{Float32, 1}
Locals
  @_5::UNION{NOTHING, TUPLE{INT64, INT64}}
  col::Int64
  row::Int64
  by::Int32
  bx::Int32
  ty::Int32
  tx::Int32
  i::Int64
Body::Nothing
1 ─ %1  = matmul.threadIdx()::NamedTuple{(:x, :y, :z), Tuple{Int32, Int32, Int32}}
│         (tx = Base.getproperty(%1, :x))
│   %3  = matmul.threadIdx()::NamedTuple{(:x, :y, :z), Tuple{Int32, Int32, Int32}}
│         (ty = Base.getproperty(%3, :y))
│   %5  = matmul.blockIdx()::NamedTuple{(:x, :y, :z), Tuple{Int32, Int32, Int32}}
│         (bx = Base.getproperty(%5, :x))
│   %7  = matmul.blockIdx()::NamedTuple{(:x, :y, :z), Tuple{Int32, Int32, Int32}}
│         (by = Base.getproperty(%7, :y))
│   %9  = tx::Int32
│   %10 = (bx - 1)::Int64
│   %11 = matmul.blockDim()::NamedTuple{(:x, :y, :z), Tuple{Int32, Int32, Int32}}
│   %12 = Base.getproperty(%11, :x)::Int32
│   %13 = (%10 * %12)::Int64
│         (row = %9 + %13)
│   %15 = ty::Int32
│   %16 = (by - 1)::Int64
│   %17 = matmul.blockDim()::NamedTuple{(:x, :y, :z), Tuple{Int32, Int32, Int32}}
│   %18 = Base.getproperty(%17, :y)::Int32
│   %19 = (%16 * %18)::Int64
│         (col = %15 + %19)
│   %21 = (1:1024)::Core.Const(1:1024)
│         (@_5 = Base.iterate(%21))
│   %23 = (@_5::Core.Const((1, 1)) === nothing)::Core.Const(false)
│   %24 = Base.not_int(%23)::Core.Const(true)
└──       goto #4 if not %24
2 ┄ %26 = @_5::Tuple{Int64, Int64}
│         (i = Core.getfield(%26, 1))
│   %28 = Core.getfield(%26, 2)::Int64
│   %29 = Base.getindex(y, row, col)::ANY
│   %30 = Base.getindex(x, row, i)::ANY
│   %31 = Base.getindex(w, i, col)::ANY
│   %32 = (%30 * %31)::ANY
│   %33 = (%29 + %32)::ANY
│         Base.setindex!(y, %33, row, col)
│         (@_5 = Base.iterate(%21, %28))
│   %36 = (@_5 === nothing)::Bool
│   %37 = Base.not_int(%36)::Bool
└──       goto #4 if not %37
3 ─       goto #2
4 ┄       return nothing

Traceback (most recent call last):
  File "/ddpJulia/main.py", line 13, in <module>
    matmul.MatMul(y, x, w, int(dev))
RuntimeError: <PyCall.jlwrap (in a Julia function called from Python)
JULIA: InvalidIRError: compiling MethodInstance for matmul.matrixMul(::CUDA.CuDeviceMatrix{Float32, 1}, ::CUDA.CuDeviceMatrix{Float32, 1}, ::CUDA.CuDeviceMatrix{Float32, 1}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to pointerref(ptr::Core.LLVMPtr{T, A}, i::I, ::Val{align}) where {T, A, I, align} in LLVM.Interop at /.julia/packages/LLVM/Od0DH/src/interop/pointer.jl:9)
Stacktrace:
 [1] unsafe_load
   @ ~/.julia/packages/LLVM/Od0DH/src/interop/pointer.jl:85
 [2] arrayref_bits
   @ ~/.julia/packages/CUDA/tVtYo/src/device/array.jl:91
 [3] #arrayref
   @ ~/.julia/packages/CUDA/tVtYo/src/device/array.jl:85
 [4] getindex
   @ ~/.julia/packages/CUDA/tVtYo/src/device/array.jl:164
 [5] getindex
   @ ~/.julia/packages/CUDA/tVtYo/src/device/array.jl:176
 [6] matrixMul
   @ ~/ddpJulia/matmul/src/matmul.jl:23
Reason: unsupported dynamic function invocation (call to *)
Stacktrace:
 [1] matrixMul
   @ ~/ddpJulia/matmul/src/matmul.jl:23
Reason: unsupported dynamic function invocation (call to +)
Stacktrace:
 [1] matrixMul
   @ ~/ddpJulia/matmul/src/matmul.jl:23
Reason: unsupported dynamic function invocation (call to convert)
Stacktrace:
 [1] setindex!
   @ ~/.julia/packages/CUDA/tVtYo/src/device/array.jl:166
 [2] setindex!
   @ ~/.julia/packages/CUDA/tVtYo/src/device/array.jl:179
 [3] matrixMul
   @ ~/ddpJulia/matmul/src/matmul.jl:23
Reason: unsupported dynamic function invocation (call to pointerset(ptr::Core.LLVMPtr{T, A}, x::T, i::I, ::Val{align}) where {T, A, I, align} in LLVM.Interop at /.julia/packages/LLVM/Od0DH/src/interop/pointer.jl:46)
Stacktrace:
 [1] unsafe_store!
   @ ~/.julia/packages/LLVM/Od0DH/src/interop/pointer.jl:88
 [2] arrayset_bits
   @ ~/.julia/packages/CUDA/tVtYo/src/device/array.jl:134
 [3] #arrayset
   @ ~/.julia/packages/CUDA/tVtYo/src/device/array.jl:127
 [4] setindex!
   @ ~/.julia/packages/CUDA/tVtYo/src/device/array.jl:166
 [5] setindex!
   @ ~/.julia/packages/CUDA/tVtYo/src/device/array.jl:179
 [6] matrixMul
   @ ~/ddpJulia/matmul/src/matmul.jl:23
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/l8TxP/src/validation.jl:149
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/l8TxP/src/driver.jl:415 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/l8TxP/src/driver.jl:414 [inlined]
  [5] emit_llvm(job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, only_entry::Bool, validate::Bool)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/l8TxP/src/utils.jl:89
  [6] codegen(output::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/l8TxP/src/driver.jl:129
  [7] compile(target::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/l8TxP/src/driver.jl:106
  [8] compile
    @ ~/.julia/packages/GPUCompiler/l8TxP/src/driver.jl:98 [inlined]
  [9] #1037
    @ ~/.julia/packages/CUDA/tVtYo/src/compiler/compilation.jl:104 [inlined]
 [10] JuliaContext(f::CUDA.var"#1037#1040"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/l8TxP/src/driver.jl:58
 [11] compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/tVtYo/src/compiler/compilation.jl:103
 [12] actual_compilation(cache::Dict{Any, CUDA.CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/l8TxP/src/execution.jl:125
 [13] cached_compilation(cache::Dict{Any, CUDA.CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/l8TxP/src/execution.jl:103
 [14] macro expansion
    @ ~/.julia/packages/CUDA/tVtYo/src/compiler/execution.jl:318 [inlined]
 [15] macro expansion
    @ ./lock.jl:223 [inlined]
 [16] cufunction(f::typeof(matmul.matrixMul), tt::Type{Tuple{CUDA.CuDeviceMatrix{Float32, 1}, CUDA.CuDeviceMatrix{Float32, 1}, CUDA.CuDeviceMatrix{Float32, 1}}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/tVtYo/src/compiler/execution.jl:313
 [17] cufunction
    @ ~/.julia/packages/CUDA/tVtYo/src/compiler/execution.jl:310 [inlined]
 [18] macro expansion
    @ ~/.julia/packages/CUDA/tVtYo/src/compiler/execution.jl:104 [inlined]
 [19] macro expansion
    @ ~/.julia/packages/GPUCompiler/l8TxP/src/reflection.jl:213 [inlined]
 [20] MatMul(yTensor::PyObject, xInTensor::PyObject, wTensor::PyObject, dev::Int64)
    @ matmul ~/ddpJulia/matmul/src/matmul.jl:37
 [21] invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Base ./essentials.jl:729
 [22] invokelatest(::Any, ::Any, ::Vararg{Any})
    @ Base ./essentials.jl:726
 [23] _pyjlwrap_call(f::Function, args_::Ptr{PyCall.PyObject_struct}, kw_::Ptr{PyCall.PyObject_struct})
    @ PyCall ~/.julia/packages/PyCall/SBNSg/src/callback.jl:28
 [24] pyjlwrap_call(self_::Ptr{PyCall.PyObject_struct}, args_::Ptr{PyCall.PyObject_struct}, kw_::Ptr{PyCall.PyObject_struct})
    @ PyCall ~/.julia/packages/PyCall/SBNSg/src/callback.jl:44>
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 51142) of binary: /opt/miniconda3/envs/torch/bin/python
Traceback (most recent call last):
  File "/opt/miniconda3/envs/torch/lib/python3.10/site-packages/julia/pseudo_python_cli.py", line 308, in main
    python(**vars(ns))
  File "/opt/miniconda3/envs/torch/lib/python3.10/site-packages/julia/pseudo_python_cli.py", line 54, in python
    scope = runpy.run_module(module, run_name="__main__", alter_sys=True)
  File "/opt/miniconda3/envs/torch/lib/python3.10/runpy.py", line 224, in run_module
    return _run_module_code(code, init_globals, run_name, mod_spec)
  File "/opt/miniconda3/envs/torch/lib/python3.10/runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/opt/miniconda3/envs/torch/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/miniconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/opt/miniconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/opt/miniconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/opt/miniconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/miniconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/miniconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
main.py FAILED

Can you inspect how Julia is launched exactly in this mode (e.g. by looking at ps or showing Base.julia_cmd()) to see if any unexpected flags are introduced?

Thanks for responding!

Here are outputs.

python-jl main.py:
Base.julia_cmd() = /opt/julia-1.8.5/bin/julia -Cnative -J/opt/julia-1.8.5/lib/julia/sys.so -g1

python-jl -m torch.distributed.launch...:
Base.julia_cmd() = /opt/julia-1.8.5/bin/julia -Cnative -J/opt/julia-1.8.5/lib/julia/sys.so --compiled-modules=no -g1

Interesting, as both cases have the option compiled_modules=False. Any thoughts?

I was hoping for a --compile=min option or so, which could have explained the codegen difference. Not sure what’s up then, sorry. It’s possible that loading packages or defining methods regresses compilation of other code, e.g., Method definitions break native rand! kernel · Issue #1508 · JuliaGPU/CUDA.jl · GitHub. Maybe try replicating the distributed environment outside of python-jl, e.g, by inspecting Base.loaded_modules and loading those in Base.active_project() or so.