Crashing shared arrays

I tried to fill a SharedArray using the @distributed macro but it keeps crashing. The MCE (Minimum Crashing Example) is shown below. Am I doing something wrong?

using Distributed, SharedArrays

const s = 10_000
const a = 3
const n = 120_000
const A = SharedArray{Float64}(s, a, n)

addprocs(8)

@everywhere function fill_me!(A, i, s, a)
    A[:, :, i] = rand(s, a)
end

function process()
    @sync @distributed for i in 1:n
        fill_me!(A, i, s, a)
    end
end

@time process()

Crash trace:

signal (7): Bus error
in expression starting at no file:0
getindex at ./array.jl:729 [inlined]
getindex at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/SharedArrays/src/SharedArrays.jl:498 [inlined]
_getindex at ./abstractarray.jl:950 [inlined]
getindex at ./abstractarray.jl:927 [inlined]
findprev at ./array.jl:1866
hash at ./abstractarray.jl:2159
hash at ./hashing.jl:18
unknown function (ip: 0x7fe980b62f60)
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1842
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2197
serialize_global_from_main at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/clusterserialize.jl:166
#8 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/clusterserialize.jl:101 [inlined]
foreach at ./abstractarray.jl:1866 [inlined]
serialize at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/clusterserialize.jl:101
serialize_type_data at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Serialization/src/Serialization.jl:520
serialize at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Serialization/src/Serialization.jl:557
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2197
serialize_type_data at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Serialization/src/Serialization.jl:540
serialize_type at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Serialization/src/Serialization.jl:564
serialize_any at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Serialization/src/Serialization.jl:634
serialize at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Serialization/src/Serialization.jl:617 [inlined]
serialize_msg at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/messages.jl:90
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1842
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2197
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1571 [inlined]
jl_f__apply at /buildworker/worker/package_linux64/build/src/builtins.c:556
jl_f__apply_latest at /buildworker/worker/package_linux64/build/src/builtins.c:594
#invokelatest#1 at ./essentials.jl:742 [inlined]
invokelatest at ./essentials.jl:741 [inlined]
send_msg_ at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/messages.jl:185
#remotecall#146 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/messages.jl:134 [inlined]
remotecall at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/remotecall.jl:349
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1842
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2197
#remotecall#147 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/remotecall.jl:361
unknown function (ip: 0x7fe980b38e0e)
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2197
remotecall at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/remotecall.jl:361
unknown function (ip: 0x7fe980b38d37)
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2197
spawnat at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/macros.jl:15
unknown function (ip: 0x7fe980b601c7)
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1842
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2197
spawn_somewhere at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/macros.jl:17
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1842
jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2348
macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/macros.jl:46 [inlined]
#167 at ./task.jl:244
unknown function (ip: 0x7fe980b5f461)
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1842
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2197
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1571 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:572
unknown function (ip: 0xffffffffffffffff)
Allocations: 13669819 (Pool: 13667722; Big: 2097); GC: 28
Bus error (core dumped)

Shouldn’t you be adding procs before creating the shared array?
The following works for me on Julia 1.1.1, Win10:

julia> using Distributed, SharedArrays

julia> addprocs(2)
2-element Array{Int64,1}:
 2
 3

julia> A = SharedArray{Float64}(10_000, 3, 120_000);

julia> @everywhere function fill_me!(A, i, s, a)
           A[:, :, i] = rand(s, a)
       end

julia> @everywhere begin
       a = 3
       s = 10_000
       end

julia> function process()
           @sync @distributed for i in 1:120_000
               fill_me!(A, i, s, a)
           end
       end
process (generic function with 1 method)

julia> process()
Task (done) @0x0000000010719430

julia> A
10000×3×120000 SharedArray{Float64,3}:
[:, :, 1] =
 0.709636   0.241925   0.148104
 0.923282   0.360563   0.0580337
 0.918791   0.0355779  0.923521
 0.605844   0.690527   0.690818
 0.535784   0.79656    0.938486
 0.616607   0.860386   0.881078

(Although for some reason only one process seems to be doing any work!?)

(EDIT: It might be that this is memory constrained as rand() is too cheap? Doing an even simpler version on 8 processes with just @sync @distributed for i in eachindex(A); A[i] = rand(); end also only maxes out one core, while my memory utilisation is at 98%)

Yes, that solve one issue.

However, it’s still crashing with Bus Error on Linux. Apparently, it’s a problem with large size. For example, if I reduce 10x (from 120_000 to 12_000) then it finishes without any problem. I wonder if I’m hitting some kind of system limit…

I also observed that only one process is doing all the work. That seem strange…

I have htop running, and it seems to be failing at around the 16 GB mark… :confused:

Sounds reasonable. I’ll try testing with something more computationally intensive.

For the record, the physical limit is the size of /dev/shm which is 16 GiB. That explains why it failed then.