I am trying to train a neural network and make inference using this neural network with Lux (1.22.1)
AMDGPU (2.1.1) and julia (1.12.0). The AMDGPU version kernel driver is 6.3.6 (as reported by /sys/module/amdgpu/version)
I tried several versions of ROCM (6.0.3, 6.2.2 and 7.0.2). With the ROCM version 6.0.2 and 6.2.2 I have a CPU memory leak which seems to be fixed in version 7.0.2.
However, I get the following errors from time to time:
:0:rocdevice.cpp :3672: 11096368621572 us: Callback: Queue 0x1552e2400000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 0 MB
error in running finalizer: AMDGPU.HIP.HIPError(code=AMDGPU.HIP.hipError_t(0x000003e7), msg="unknown error (999)")
check at /tmp/mkdepot-barthale-14060769/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/hip/error.jl:145 [inlined]
macro expansion at /tmp/mkdepot-barthale-14060769/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/utils.jl:123 [inlined]
hipCtxGetCurrent at /tmp/mkdepot-barthale-14060769/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/hip/libhip.jl:4939
prepare_state at /tmp/mkdepot-barthale-14060769/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/tls.jl:194
prepare_state at /tmp/mkdepot-barthale-14060769/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/tls.jl:193 [inlined]
miopenDestroyConvolutionDescriptor at /tmp/mkdepot-barthale-14060769/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/dnn/libMIOpen.jl:352 [inlined]
#ConvolutionDescriptor##0 at /tmp/mkdepot-barthale-14060769/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/dnn/descriptors.jl:98
unknown function (ip: 0x1552e23b5a01) at (unknown file)
run_finalizer at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-12/src/gc-common.c:180
jl_gc_run_finalizers_in_list at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-12/src/gc-common.c:270
run_finalizers at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-12/src/gc-common.c:316
jl_promote_ci_to_current at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-12/src/gf.c:1804
finish_nocycle at ./../usr/share/julia/Compiler/src/typeinfer.jl:210
jfptr_finish_nocycle_118954.1 at /appl/local/csc/soft/math/julia/1.12.0/lib/julia/sys.so (unknown line)
typeinf at ./../usr/share/julia/Compiler/src/abstractinterpretation.jl:4502
typeinf_ext at ./../usr/share/julia/Compiler/src/typeinfer.jl:1259
Or this:
ERROR: LoadError: Failed to successfully execute function and free resources for it.
Reporting current memory usage:
- HIP pool used: 1.313 GiB.
- HIP pool reserved: 1.313 GiB.
- Hard memory limit: 51.188 GiB.
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:44
[2] alloc_or_retry!(f::AMDGPU.Runtime.Mem.var"#5#6"{AMDGPU.HIP.HIPStream, Int64, Base.RefValue{Ptr{Nothing}}}, isfailed::typeof(isnothing); stream::AMDGPU.HIP.HIPStream)
@ AMDGPU.Runtime.Mem /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/runtime/memory/utils.jl:34
[3] alloc_or_retry!
@ /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/runtime/memory/utils.jl:1 [inlined]
[4] AMDGPU.Runtime.Mem.HIPBuffer(bytesize::Int64; stream::AMDGPU.HIP.HIPStream)
@ AMDGPU.Runtime.Mem /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/runtime/memory/hip.jl:46
[5] HIPBuffer
@ /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/runtime/memory/hip.jl:38 [inlined]
[6] pool_alloc
@ /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/memory.jl:283 [inlined]
[7] (::AMDGPU.var"#26#27"{AMDGPU.Runtime.Mem.HIPBuffer, Float32, NTuple{4, Int64}, Int64})()
@ AMDGPU /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/array.jl:11
[8] cached_alloc(f::AMDGPU.var"#26#27"{AMDGPU.Runtime.Mem.HIPBuffer, Float32, NTuple{4, Int64}, Int64}, key::Tuple{UnionAll, AMDGPU.HIP.HIPDevice, DataType, Int64})
@ GPUArrays /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/GPUArrays/w335n/src/host/alloc_cache.jl:36
[9] AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}(::UndefInitializer, dims::NTuple{4, Int64})
@ AMDGPU /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/array.jl:9
This error message is also reported here:
These two errors also occurs with other tested ROCM versions.
Or I have this error:
MIOpen Error: HIP runtime error: out of memory. hip_check_error.hpp: 17in function: hip_check_error
ERROR: LoadError: MIOpenException:
- status: miopenStatusUnknownError
- description: Unknown error
Stacktrace:
[1] check
@ /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/dnn/MIOpen.jl:54 [inlined]
[2] macro expansion
@ /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/utils.jl:123 [inlined]
[3] miopenConvolutionBackwardData(handle::Ptr{AMDGPU.MIOpen.miopenHandle}, alpha::Base.RefValue{Float32}, dyDesc::Ptr{AMDGPU.MIOpen.miopenTensorDescriptor}, dy::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, wDesc::Ptr{AMDGPU.MIOpen.miopenTensorDescriptor}, w::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, convDesc::Ptr{AMDGPU.MIOpen.miopenConvolutionDescriptor}, algo::AMDGPU.MIOpen.miopenConvBwdDataAlgorithm_t, beta::Base.RefValue{Float32}, dxDesc::Ptr{AMDGPU.MIOpen.miopenTensorDescriptor}, dx::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, workSpace::AMDGPU.ROCArray{UInt8, 1, AMDGPU.Runtime.Mem.HIPBuffer}, workSpaceSize::UInt64)
@ AMDGPU.MIOpen /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/dnn/libMIOpen.jl:536
[4] ∇convolution_data!(∇x::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, ∇xdesc::AMDGPU.MIOpen.TensorDescriptor, dy::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, dydesc::AMDGPU.MIOpen.TensorDescriptor, w::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, wdesc::AMDGPU.MIOpen.TensorDescriptor, cdesc::AMDGPU.MIOpen.ConvolutionDescriptor, conv_args::AMDGPU.MIOpen.ConvolutionArgs)
@ AMDGPU.MIOpen /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/dnn/convolution.jl:221
[5] ∇convolution_data!(∇x::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, dy::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, w::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}; padding::Tuple{Int64, Int64}, stride::Tuple{Int64, Int64}, dilation::Tuple{Int64, Int64}, groups::Int64)
@ AMDGPU.MIOpen /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/dnn/convolution.jl:238
[6] ∇conv_data!(dx::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, dy::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, w::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, cdims::DenseConvDims{2, 2, 2, 4, 2})
@ NNlibAMDGPUExt /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/NNlib/1TYHL/ext/NNlibAMDGPUExt/conv.jl:48
[7] #∇conv_data#124
@ /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/NNlib/1TYHL/src/conv.jl:99 [inlined]
[8] ∇conv_data
@ /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/NNlib/1TYHL/src/conv.jl:95 [inlined]
[9] ∇conv_data
@ /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/LuxLib/R8Czx/src/impl/conv.jl:130 [inlined]
[10] conv_transpose
@ /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/Lux/NsWoa/src/layers/conv.jl:55 [inlined]
[11] var"##c::ConvTransposeinternal#465"(c::ConvTranspose{typeof(selu), Int64, Int64, Tuple{Int64, Int64}, Tuple{Int64, Int64}, NTuple{4, Int64}, Tuple{Int64, Int64}, Tuple{Int64, Int64}, Int64, typeof(glorot_uniform), typeof(zeros32), True, True}, x::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, ps::@NamedTuple{weight::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, bias::AMDGPU.ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}}, st::@NamedTuple{})
@ Lux /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/Lux/NsWoa/src/layers/conv.jl:437
[12] macro expansion
As far as I can tell, this error occurs only with ROCM 7.0.2.
The GPU is a MI250x with 64 GB of vRAM per GPU die (the HPC cluster LUMI).
Currently I am only using a single GPU (or rather a single Graphics Compute Die) for these tests.
Unfortunately, I was not able so far create a short reproducer. The error occurs between 1 and 11 hours after executions at different location in the program execution.
I already tried the options hard_memory_limit = "80 %" and eager_gc = true in my LocalPreferences.toml but without success.
This issue seems to be a memory leak of vRAM (or fragmentation).
Has somebody had a similar problem? For reference, this runs fine on CUDA with a GPU with only 24 GB of vRAM.
output of rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD EPYC 7A53 64-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7A53 64-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2000
BDFID: 0
Internal Node ID: 0
Compute Unit: 32
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 131343028(0x7d422b4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 131343028(0x7d422b4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 131343028(0x7d422b4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: AMD EPYC 7A53 64-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7A53 64-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 1
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2000
BDFID: 0
Internal Node ID: 1
Compute Unit: 32
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 132111288(0x7dfdbb8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 132111288(0x7dfdbb8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 132111288(0x7dfdbb8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 3
*******
Name: AMD EPYC 7A53 64-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7A53 64-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 2
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2000
BDFID: 0
Internal Node ID: 2
Compute Unit: 32
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 132063856(0x7df2270) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 132063856(0x7df2270) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 132063856(0x7df2270) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 4
*******
Name: AMD EPYC 7A53 64-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7A53 64-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 3
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2000
BDFID: 0
Internal Node ID: 3
Compute Unit: 32
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 132056244(0x7df04b4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 132056244(0x7df04b4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 132056244(0x7df04b4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 5
*******
Name: gfx90a
Uuid: GPU-8d86b408b832e3b2
Marketing Name: AMD Instinct MI250X
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 4
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 29704(0x7408)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1700
BDFID: 51456
Internal Node ID: 4
Compute Unit: 110
SIMDs per CU: 4
Shader Engines: 8
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: TRUE
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 2048(0x800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 78
SDMA engine uCode:: 8
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 67092480(0x3ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 67092480(0x3ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 67092480(0x3ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 4
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***