Memory error while using shared arrays


#1

Hi all,

I have a fairly complicated bit of code involving shared arrays. I’m having problems with a memory error although can’t seem to get a minimum working example. Pseudo-code of what I am doing follows:

# Set some parameter size_of_x
# Create object x here which is of type Vector{Vector{Float64}} with size controlled by size_of_x
for n = 1:N
    # Do some work here to update x
    # Create y which is of type Vector{SharedVector{Float64}} by transforming x
    # Note, y is much smaller than x and the size of y is independent of size_of_x
    @sync @distributed for k = 1:K
        # Call some functions on y
    end
end

When size_of_x is small, this code works fine and produces sensible results, and watching my system resources, appears to be using all available CPUs on the distributed loop. But when size_of_x is large, the first iteration of the outer loop works, but on the second iteration, I get the following error:

ERROR: On worker 3:
SystemError: memory mapping failed: Cannot allocate memory
#parse#338 at ./parse.jl:217
parse at ./parse.jl:217
print_shmem_limits at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/SharedArrays/src/SharedArrays.jl:614
shm_mmap_array at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/SharedArrays/src/SharedArrays.jl:641
#6 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/SharedArrays/src/SharedArrays.jl:128
#109 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/Distributed/src/process_messages.jl:265
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/Distributed/src/process_messages.jl:56
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/Distributed/src/process_messages.jl:65
#102 at ./task.jl:262

While the loop is running, I keep an eye on my system resources using Ubuntu’s System Monitor, and RAM appears to go nowhere near the limits (I apparently don’t consume more than 25% of what is available).

I realize this is not a whole lot to go off, but I’m struggling to get a reproducible example, and was just mostly wondering if anyone knows what could be causing an error like this? Is the problem really something to do with system RAM, or is it some other type of memory limit here?

Note, I’ve read the section in the docs on parallel work (several times) :slight_smile:

Cheers,

Colin


#2

Colin, sticking my neck out here.
What does ipcs -l give you?

Also I like to runt his command to look at memory usage watch cat /proc/meminfo


#3

My reading of the code is that is shoudl print what the limits are.
I think you may have hit a bug!
Ah though - you are using version 0.7 We are told 0.7 and 1.0 should be equivalent.
Could you try version 1.0.3 maybe?

On a Debian 9 system with 64 gigs of RAM:

sysctl -a | grep shm

kernel.shm_rmid_forced = 0
kernel.shmall = 4294967296
kernel.shmmax = 68719476736
kernel.shmmni = 4096

Could you run this sysctl comand on your system please?

From 1.0.3 stdlib/SharedArrays.jl

  shmmax_MB = div(parse(Int, split(read(`sysctl $(pfx).shmmax`, String))[end]), 1024*1024)
    page_size = parse(Int, split(read(`getconf PAGE_SIZE`, String))[end])
    shmall_MB = div(parse(Int, split(read(`sysctl $(pfx).shmall`, String))[end]) * page_size, 1024*1024)

    println("System max size of single shmem segment(MB) : ", shmmax_MB,
        "\nSystem max size of all shmem segments(MB) : ", shmall_MB,
        "\nRequested size(MB) : ", div(slen, 1024*1024),
        "\nPlease ensure requested size is within system limits.",
        "\nIf not, increase system limits and try again."

#4

Hi John,

Thanks for responding.

colin@colin-Z270-HD3:~$ ipcs -l

------ Messages Limits --------
max queues system wide = 32000
max size of message (bytes) = 8192
default max size of queue (bytes) = 16384

------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18014398509481980
min seg size (bytes) = 1

------ Semaphore Limits --------
max number of arrays = 32000
max semaphores per array = 32000
max semaphores system wide = 1024000000
max ops per semop call = 500
semaphore max value = 32767
colin@colin-Z270-HD3:~$ sudo sysctl -a | grep shm
kernel.shm_next_id = -1
kernel.shm_rmid_forced = 0
kernel.shmall = 18446744073692774399
kernel.shmmax = 18446744073692774399
kernel.shmmni = 4096
sysctl: reading key "net.ipv6.conf.all.stable_secret"
sysctl: reading key "net.ipv6.conf.default.stable_secret"
sysctl: reading key "net.ipv6.conf.enp0s31f6.stable_secret"
sysctl: reading key "net.ipv6.conf.lo.stable_secret"
vm.hugetlb_shm_group = 0

Sorry, I should have mentioned I was still using v0.7 (I’m still getting used to the global scope REPL rules). Using v1.03 I’m getting the same error message (i.e. without the additional info that it looks like I should be getting). For the sake of completeness, the error message on v1.03 is:

ERROR: On worker 4:
SystemError: memory mapping failed: Cannot allocate memory
#parse#332 at ./parse.jl:228
parse at ./parse.jl:228
print_shmem_limits at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/SharedArrays/src/SharedArrays.jl:614
shm_mmap_array at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/SharedArrays/src/SharedArrays.jl:641
#6 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/SharedArrays/src/SharedArrays.jl:128
#109 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/process_messages.jl:265
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/process_messages.jl:56
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/process_messages.jl:65
#102 at ./task.jl:259

I messed around with cat /proc/meminfo during a run, and it seemed to agree with what I was seeing on the System Monitor GUI, i.e. I wasn’t anywhere near a RAM limit. This is what it looks like just after the error is thrown:

colin@colin-Z270-HD3:~$ cat /proc/meminfo
MemTotal:       32825740 kB
MemFree:        17563020 kB
MemAvailable:   22087464 kB
Buffers:          582972 kB
Cached:          5296624 kB
SwapCached:            0 kB
Active:         11078900 kB
Inactive:        3402388 kB
Active(anon):    8619208 kB
Inactive(anon):  1208156 kB
Active(file):    2459692 kB
Inactive(file):  2194232 kB
Unevictable:          16 kB
Mlocked:              16 kB
SwapTotal:       2097148 kB
SwapFree:        2097148 kB
Dirty:              1524 kB
Writeback:             0 kB
AnonPages:       8601748 kB
Mapped:          1406180 kB
Shmem:           1225676 kB
Slab:             568396 kB
SReclaimable:     335676 kB
SUnreclaim:       232720 kB
KernelStack:       16320 kB
PageTables:        80188 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    18510016 kB
Committed_AS:   17742636 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      272424 kB
DirectMap2M:     9050112 kB
DirectMap1G:    24117248 kB

Thanks again John. Any advice you can offer is greatly appreciated.

Colin


#5

Can you tell me which version of Ubuntu and which kernel you hve (uname -r)
I have access to different Debian versions at work. I dont think ther kernel is relevant though, just asking for completeness. I can try to reproduce this.


#6

Will check the kernel for you tomorrow when I get to work. In the meantime, it’s a fresh install of Ubuntu 18.04 LTS (I only made the bootable USB a week ago).

I’ll also have another go at making a MWE tomorrow too.

Cheers,

Colin


#7
 colin@colin-Z270-HD3:~$ uname -r
4.15.0-43-generic