GPU memory issue on AMDGPU

I am trying to train a neural network and make inference using this neural network with Lux (1.22.1)
AMDGPU (2.1.1) and julia (1.12.0). The AMDGPU version kernel driver is 6.3.6 (as reported by /sys/module/amdgpu/version)

I tried several versions of ROCM (6.0.3, 6.2.2 and 7.0.2). With the ROCM version 6.0.2 and 6.2.2 I have a CPU memory leak which seems to be fixed in version 7.0.2.

However, I get the following errors from time to time:

:0:rocdevice.cpp            :3672: 11096368621572 us:  Callback: Queue 0x1552e2400000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 0 MB
error in running finalizer: AMDGPU.HIP.HIPError(code=AMDGPU.HIP.hipError_t(0x000003e7), msg="unknown error (999)")
check at /tmp/mkdepot-barthale-14060769/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/hip/error.jl:145 [inlined]
macro expansion at /tmp/mkdepot-barthale-14060769/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/utils.jl:123 [inlined]
hipCtxGetCurrent at /tmp/mkdepot-barthale-14060769/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/hip/libhip.jl:4939
prepare_state at /tmp/mkdepot-barthale-14060769/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/tls.jl:194
prepare_state at /tmp/mkdepot-barthale-14060769/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/tls.jl:193 [inlined]
miopenDestroyConvolutionDescriptor at /tmp/mkdepot-barthale-14060769/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/dnn/libMIOpen.jl:352 [inlined]
#ConvolutionDescriptor##0 at /tmp/mkdepot-barthale-14060769/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/dnn/descriptors.jl:98
unknown function (ip: 0x1552e23b5a01) at (unknown file)
run_finalizer at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-12/src/gc-common.c:180
jl_gc_run_finalizers_in_list at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-12/src/gc-common.c:270
run_finalizers at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-12/src/gc-common.c:316
jl_promote_ci_to_current at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-12/src/gf.c:1804
finish_nocycle at ./../usr/share/julia/Compiler/src/typeinfer.jl:210
jfptr_finish_nocycle_118954.1 at /appl/local/csc/soft/math/julia/1.12.0/lib/julia/sys.so (unknown line)
typeinf at ./../usr/share/julia/Compiler/src/abstractinterpretation.jl:4502
typeinf_ext at ./../usr/share/julia/Compiler/src/typeinfer.jl:1259

Or this:

ERROR: LoadError: Failed to successfully execute function and free resources for it.
Reporting current memory usage:
- HIP pool used: 1.313 GiB.
- HIP pool reserved: 1.313 GiB.
- Hard memory limit: 51.188 GiB.

Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:44
  [2] alloc_or_retry!(f::AMDGPU.Runtime.Mem.var"#5#6"{AMDGPU.HIP.HIPStream, Int64, Base.RefValue{Ptr{Nothing}}}, isfailed::typeof(isnothing); stream::AMDGPU.HIP.HIPStream)
    @ AMDGPU.Runtime.Mem /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/runtime/memory/utils.jl:34
  [3] alloc_or_retry!
    @ /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/runtime/memory/utils.jl:1 [inlined]
  [4] AMDGPU.Runtime.Mem.HIPBuffer(bytesize::Int64; stream::AMDGPU.HIP.HIPStream)
    @ AMDGPU.Runtime.Mem /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/runtime/memory/hip.jl:46
  [5] HIPBuffer
    @ /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/runtime/memory/hip.jl:38 [inlined]
  [6] pool_alloc
    @ /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/memory.jl:283 [inlined]
  [7] (::AMDGPU.var"#26#27"{AMDGPU.Runtime.Mem.HIPBuffer, Float32, NTuple{4, Int64}, Int64})()
    @ AMDGPU /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/array.jl:11
  [8] cached_alloc(f::AMDGPU.var"#26#27"{AMDGPU.Runtime.Mem.HIPBuffer, Float32, NTuple{4, Int64}, Int64}, key::Tuple{UnionAll, AMDGPU.HIP.HIPDevice, DataType, Int64})
    @ GPUArrays /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/GPUArrays/w335n/src/host/alloc_cache.jl:36
  [9] AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}(::UndefInitializer, dims::NTuple{4, Int64})
    @ AMDGPU /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/array.jl:9

This error message is also reported here:

These two errors also occurs with other tested ROCM versions.

Or I have this error:

MIOpen Error: HIP runtime error: out of memory. hip_check_error.hpp: 17in function: hip_check_error
ERROR: LoadError: MIOpenException:
- status: miopenStatusUnknownError
- description: Unknown error

Stacktrace:
  [1] check
    @ /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/dnn/MIOpen.jl:54 [inlined]
  [2] macro expansion
    @ /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/utils.jl:123 [inlined]
  [3] miopenConvolutionBackwardData(handle::Ptr{AMDGPU.MIOpen.miopenHandle}, alpha::Base.RefValue{Float32}, dyDesc::Ptr{AMDGPU.MIOpen.miopenTensorDescriptor}, dy::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, wDesc::Ptr{AMDGPU.MIOpen.miopenTensorDescriptor}, w::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, convDesc::Ptr{AMDGPU.MIOpen.miopenConvolutionDescriptor}, algo::AMDGPU.MIOpen.miopenConvBwdDataAlgorithm_t, beta::Base.RefValue{Float32}, dxDesc::Ptr{AMDGPU.MIOpen.miopenTensorDescriptor}, dx::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, workSpace::AMDGPU.ROCArray{UInt8, 1, AMDGPU.Runtime.Mem.HIPBuffer}, workSpaceSize::UInt64)
    @ AMDGPU.MIOpen /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/dnn/libMIOpen.jl:536
  [4] ∇convolution_data!(∇x::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, ∇xdesc::AMDGPU.MIOpen.TensorDescriptor, dy::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, dydesc::AMDGPU.MIOpen.TensorDescriptor, w::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, wdesc::AMDGPU.MIOpen.TensorDescriptor, cdesc::AMDGPU.MIOpen.ConvolutionDescriptor, conv_args::AMDGPU.MIOpen.ConvolutionArgs)
    @ AMDGPU.MIOpen /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/dnn/convolution.jl:221
  [5] ∇convolution_data!(∇x::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, dy::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, w::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}; padding::Tuple{Int64, Int64}, stride::Tuple{Int64, Int64}, dilation::Tuple{Int64, Int64}, groups::Int64)
    @ AMDGPU.MIOpen /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/AMDGPU/np0dr/src/dnn/convolution.jl:238
  [6] ∇conv_data!(dx::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, dy::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, w::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, cdims::DenseConvDims{2, 2, 2, 4, 2})
    @ NNlibAMDGPUExt /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/NNlib/1TYHL/ext/NNlibAMDGPUExt/conv.jl:48
  [7] #∇conv_data#124
    @ /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/NNlib/1TYHL/src/conv.jl:99 [inlined]
  [8] ∇conv_data
    @ /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/NNlib/1TYHL/src/conv.jl:95 [inlined]
  [9] ∇conv_data
    @ /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/LuxLib/R8Czx/src/impl/conv.jl:130 [inlined]
 [10] conv_transpose
    @ /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/Lux/NsWoa/src/layers/conv.jl:55 [inlined]
 [11] var"##c::ConvTransposeinternal#465"(c::ConvTranspose{typeof(selu), Int64, Int64, Tuple{Int64, Int64}, Tuple{Int64, Int64}, NTuple{4, Int64}, Tuple{Int64, Int64}, Tuple{Int64, Int64}, Int64, typeof(glorot_uniform), typeof(zeros32), True, True}, x::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, ps::@NamedTuple{weight::AMDGPU.ROCArray{Float32, 4, AMDGPU.Runtime.Mem.HIPBuffer}, bias::AMDGPU.ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}}, st::@NamedTuple{})
    @ Lux /tmp/mkdepot-barthale-13865763/julia-depot-sif-FlowMatching-julia-1.12.0/packages/Lux/NsWoa/src/layers/conv.jl:437
 [12] macro expansion

As far as I can tell, this error occurs only with ROCM 7.0.2.

The GPU is a MI250x with 64 GB of vRAM per GPU die (the HPC cluster LUMI).
Currently I am only using a single GPU (or rather a single Graphics Compute Die) for these tests.

Unfortunately, I was not able so far create a short reproducer. The error occurs between 1 and 11 hours after executions at different location in the program execution.

I already tried the options hard_memory_limit = "80 %" and eager_gc = true in my LocalPreferences.toml but without success.

This issue seems to be a memory leak of vRAM (or fragmentation).

Has somebody had a similar problem? For reference, this runs fine on CUDA with a GPU with only 24 GB of vRAM.

output of rocminfo
ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD EPYC 7A53 64-Core Processor    
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD EPYC 7A53 64-Core Processor    
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2000                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            32                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    131343028(0x7d422b4) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    131343028(0x7d422b4) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    131343028(0x7d422b4) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    AMD EPYC 7A53 64-Core Processor    
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD EPYC 7A53 64-Core Processor    
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2000                               
  BDFID:                   0                                  
  Internal Node ID:        1                                  
  Compute Unit:            32                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    132111288(0x7dfdbb8) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    132111288(0x7dfdbb8) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    132111288(0x7dfdbb8) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 3                  
*******                  
  Name:                    AMD EPYC 7A53 64-Core Processor    
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD EPYC 7A53 64-Core Processor    
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2000                               
  BDFID:                   0                                  
  Internal Node ID:        2                                  
  Compute Unit:            32                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    132063856(0x7df2270) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    132063856(0x7df2270) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    132063856(0x7df2270) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 4                  
*******                  
  Name:                    AMD EPYC 7A53 64-Core Processor    
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD EPYC 7A53 64-Core Processor    
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    3                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2000                               
  BDFID:                   0                                  
  Internal Node ID:        3                                  
  Compute Unit:            32                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    132056244(0x7df04b4) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    132056244(0x7df04b4) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    132056244(0x7df04b4) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 5                  
*******                  
  Name:                    gfx90a                             
  Uuid:                    GPU-8d86b408b832e3b2               
  Marketing Name:          AMD Instinct MI250X                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    4                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      8192(0x2000) KB                    
  Chip ID:                 29704(0x7408)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1700                               
  BDFID:                   51456                              
  Internal Node ID:        4                                  
  Compute Unit:            110                                
  SIMDs per CU:            4                                  
  Shader Engines:          8                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    TRUE                               
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    2048(0x800)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 78                                 
  SDMA engine uCode::      8                                  
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    67092480(0x3ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    67092480(0x3ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    67092480(0x3ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 4                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***   

Hi, thanks for sharing.

Unsure if it’s AMDGPU or actually underlying ROCm / HIP / MIOPEN which are the root of the memory errors. Maybe trying to reproduce the example from [Feature]: Allow specifying maximum size for memory pool · Issue #3422 · ROCm/hip · GitHub and if it still fails, re-opening the issue on ROCm? I could give it a shot on a RX 7800 XT as well…

(note that on LUMI, it may be that ROCm != 6.0.3 may not be fully operational iirc)