BusError: Are SharedArray s on virtual machines possible?

question

#1

I have code that allocates SharedArrays on a machine. this works well on the computer in front of me, i.e. the physical machine in my office. I am having trouble runnign the same code though on a virtual machine inside an opennebula cluster. I know that I have access to 16GB RAM on that VM, which is the same as my computer has, so I don’t think I’m running out of memory. Is there anything that prevents me from using SharedArrays on a VM (could it be that the memory of that VM is not inside a single physical machine?)

Here is the error I get:

julia> sm = bk.runSim();

signal (7): Bus error

signal (7): Bus error

signal (7): Bus error

signal (7): Bus error
while loading no file, in expression starting on line 0
while loading no file, in expression starting on line 0
while loading no file, in expression starting on line 0
while loading no file, in expression starting on line 0
macro expansion at ./cartesian.jl:62 [inlined]
macro expansion at ./cartesian.jl:62 [inlined]
macro expansion at ./cartesian.jl:62 [inlined]
macro expansion at ./cartesian.jl:62 [inlined]
macro expansion at ./multidimensional.jl:431 [inlined]
macro expansion at ./multidimensional.jl:431 [inlined]
_unsafe_batchsetindex! at ./multidimensional.jl:423
macro expansion at ./multidimensional.jl:431 [inlined]
macro expansion at ./multidimensional.jl:431 [inlined]
_unsafe_batchsetindex! at ./multidimensional.jl:423
_unsafe_batchsetindex! at ./multidimensional.jl:423
_unsafe_batchsetindex! at ./multidimensional.jl:423
_setindex! at ./multidimensional.jl:372 [inlined]
_setindex! at ./multidimensional.jl:372 [inlined]
setindex! at ./abstractarray.jl:840 [inlined]
setindex! at ./abstractarray.jl:840 [inlined]
#15 at /root/git/bk/bk.jl/src/model.jl:133
_setindex! at ./multidimensional.jl:372 [inlined]
#15 at /root/git/bk/bk.jl/src/model.jl:133
_setindex! at ./multidimensional.jl:372 [inlined]
unknown function (ip: 0x7fd9226016e4)
setindex! at ./abstractarray.jl:840 [inlined]
unknown function (ip: 0x7f6b466d1c84)
setindex! at ./abstractarray.jl:840 [inlined]
#15 at /root/git/bk/bk.jl/src/model.jl:133
#15 at /root/git/bk/bk.jl/src/model.jl:133
unknown function (ip: 0x7fc8f67906b4)
unknown function (ip: 0x7f7377514964)
jl_call_method_internal at /home/centos/buildbot/slave/package_tarball64/build/src/julia_internal.h:210 [inlined]
jl_apply_generic at /home/centos/buildbot/slave/package_tarball64/build/src/gf.c:1950
jl_call_method_internal at /home/centos/buildbot/slave/package_tarball64/build/src/julia_internal.h:210 [inlined]
jl_apply_generic at /home/centos/buildbot/slave/package_tarball64/build/src/gf.c:1950
jl_call_method_internal at /home/centos/buildbot/slave/package_tarball64/build/src/julia_internal.h:210 [inlined]
jl_apply_generic at /home/centos/buildbot/slave/package_tarball64/build/src/gf.c:1950
jl_call_method_internal at /home/centos/buildbot/slave/package_tarball64/build/src/julia_internal.h:210 [inlined]
jl_apply_generic at /home/centos/buildbot/slave/package_tarball64/build/src/gf.c:1950
jl_apply at /home/centos/buildbot/slave/package_tarball64/build/src/julia.h:1392 [inlined]
jl_apply at /home/centos/buildbot/slave/package_tarball64/build/src/julia.h:1392 [inlined]
jl_f__apply at /home/centos/buildbot/slave/package_tarball64/build/src/builtins.c:547
jl_f__apply at /home/centos/buildbot/slave/package_tarball64/build/src/builtins.c:547
#649 at ./multi.jl:1428
#649 at ./multi.jl:1428
run_work_thunk at ./multi.jl:1001
run_work_thunk at ./multi.jl:1001
run_work_thunk at ./multi.jl:1010 [inlined]
run_work_thunk at ./multi.jl:1010 [inlined]
#617 at ./event.jl:68
#617 at ./event.jl:68
unknown function (ip: 0x7f6b466c3eef)
unknown function (ip: 0x7fc8f67828ff)
jl_apply at /home/centos/buildbot/slave/package_tarball64/build/src/julia.h:1392 [inlined]
jl_f__apply at /home/centos/buildbot/slave/package_tarball64/build/src/builtins.c:547
#649 at ./multi.jl:1428
jl_call_method_internal at /home/centos/buildbot/slave/package_tarball64/build/src/julia_internal.h:210 [inlined]
jl_apply_generic at /home/centos/buildbot/slave/package_tarball64/build/src/gf.c:1950
jl_call_method_internal at /home/centos/buildbot/slave/package_tarball64/build/src/julia_internal.h:210 [inlined]
jl_apply_generic at /home/centos/buildbot/slave/package_tarball64/build/src/gf.c:1950
run_work_thunk at ./multi.jl:1001
run_work_thunk at ./multi.jl:1010 [inlined]
#617 at ./event.jl:68
unknown function (ip: 0x7fd9225f394f)
jl_call_method_internal at /home/centos/buildbot/slave/package_tarball64/build/src/julia_internal.h:210 [inlined]
jl_apply_generic at /home/centos/buildbot/slave/package_tarball64/build/src/gf.c:1950
jl_apply at /home/centos/buildbot/slave/package_tarball64/build/src/julia.h:1392 [inlined]
jl_apply at /home/centos/buildbot/slave/package_tarball64/build/src/julia.h:1392 [inlined]
jl_apply at /home/centos/buildbot/slave/package_tarball64/build/src/julia.h:1392 [inlined]
start_task at /home/centos/buildbot/slave/package_tarball64/build/src/task.c:254
start_task at /home/centos/buildbot/slave/package_tarball64/build/src/task.c:254
jl_f__apply at /home/centos/buildbot/slave/package_tarball64/build/src/builtins.c:547
jl_apply at /home/centos/buildbot/slave/package_tarball64/build/src/julia.h:1392 [inlined]
unknown function (ip: 0xffffffffffffffff)
unknown function (ip: 0xffffffffffffffff)
start_task at /home/centos/buildbot/slave/package_tarball64/build/src/task.c:254
Allocations: 13295343 (Pool: 13293960; Big: 1383); GC: 21
Allocations: 13317134 (Pool: 13315764; Big: 1370); GC: 21
unknown function (ip: 0xffffffffffffffff)
Allocations: 13317228 (Pool: 13315857; Big: 1371); GC: 21
#649 at ./multi.jl:1428
run_work_thunk at ./multi.jl:1001
run_work_thunk at ./multi.jl:1010 [inlined]
#617 at ./event.jl:68
unknown function (ip: 0x7f7377506bbf)
jl_call_method_internal at /home/centos/buildbot/slave/package_tarball64/build/src/julia_internal.h:210 [inlined]
jl_apply_generic at /home/centos/buildbot/slave/package_tarball64/build/src/gf.c:1950
jl_apply at /home/centos/buildbot/slave/package_tarball64/build/src/julia.h:1392 [inlined]
start_task at /home/centos/buildbot/slave/package_tarball64/build/src/task.c:254
unknown function (ip: 0xffffffffffffffff)
Allocations: 13183002 (Pool: 13181624; Big: 1378); GC: 21
Worker 7 terminated.
Worker 2 terminated.ERROR (unhandled task failure): EOFError: read end of file
ERROR: ProcessExitedException()
 in yieldto(::Task, ::ANY) at ./event.jl:136
 in wait() at ./event.jl:169
 in wait(::Condition) at ./event.jl:27
 in wait(::Channel{Any}) at ./channels.jl:92
 in fetch(::Channel{Any}) at ./channels.jl:63
 in #remotecall_wait#631(::Array{Any,1}, ::Function, ::Function, ::Base.Worker, ::SharedArray{Float64,8}, ::Vararg{SharedArray{Float64,8},N}) at ./multi.jl:1091
 in remotecall_wait(::Function, ::Base.Worker, ::SharedArray{Float64,8}, ::Vararg{SharedArray{Float64,8},N}) at ./multi.jl:1086
 in #remotecall_wait#634(::Array{Any,1}, ::Function, ::Function, ::Int64, ::SharedArray{Float64,8}, ::Vararg{SharedArray{Float64,8},N}) at ./multi.jl:1105
 in remotecall_wait(::Function, ::Int64, ::SharedArray{Float64,8}, ::Vararg{SharedArray{Float64,8},N}) at ./multi.jl:1105
 in (::Base.##828#830{SharedArray{Float64,8},bk.##15#33})() at ./task.jl:360

...and 3 other exceptions.

 in sync_end() at ./task.jl:311
 in macro expansion at ./task.jl:327 [inlined]
 in initialize_shared_array(::SharedArray{Float64,8}, ::Bool, ::Function, ::Array{Int64,1}) at ./sharedarray.jl:209
 in #SharedArray#806(::Function, ::Array{Int64,1}, ::Type{T}, ::Type{Float64}, ::Tuple{Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64}) at ./sharedarray.jl:102
 in (::Core.#kw#Type)(::Array{Any,1}, ::Type{SharedArray}, ::Type{Float64}, ::Tuple{Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64}) at ./<missing>:0
 in bk.Vfun(::Tuple{Int64,Int64,Int64,Int64,Int64,Int64}, ::Int64, ::bk.Param) at /root/git/bk/bk.jl/src/model.jl:133
 in bk.Model(::bk.Param) at /root/git/bk/bk.jl/src/model.jl:517
 in runSim(::Dict{Any,Any}) at /root/git/bk/bk.jl/src/simulation.jl:322
 in runSim() at /root/git/bk/bk.jl/src/simulation.jl:314

#2

There is probably not enough information here to respond. Is this an actual virtualized system, or provisioned hardware with real CPU? Break the problem down and debug it in small pieces rather than trying to run the whole codebase. For example, try just allocating a shared array of the appropriate size from Julia. Then try smaller ones and find the limit.

could it be that the memory of that VM is not inside a single physical machine?

I think that’s unlikely on normal system (mostly found on high-end HPC).


#3

Thanks. Yes I found a limit in terms of addprocs now. 4 works, 5 doesn’t. So in the end it is indeed running out of memory.
I asked the sysadmin. He got back to me with 2 links: hypervisor and that they are using vmware for it. Looking at this it indeed sounds like there are resources elsewhere that get mapped to my VM, but I am not totally sure about that.


#4

Another piece of information that might help you is that SharedArray is backed by mmap. According to http://man7.org/linux/man-pages/man2/mmap.2.html Signal 7 e.g. SIGBUS means:


       SIGBUS Attempted access to a portion of the buffer that does not
              correspond to the file (for example, beyond the end of the
              file, including the case where another process has truncated
              the file).

You could use strace to try to see what is happening. What kernel version are you running and what vmware version. What is the version of open-vm-tools? The only thing related to VMware and SIGBUS that I could find is: https://bugs.launchpad.net/ubuntu/+source/open-vm-tools/+bug/1579544


#5

I’ve run into this same signal (7): bus error using Docker. With the official Julia docker (v0.6.1), I can get that error by just trying to fill a 100MB SharedArray.

addprocs(8);
a = SharedArray{UInt8,1}(Int(10e8));
@parallel for i in 1:length(a)
  a[i] = UInt8(1)
end

This is on a workstation with 32GB of RAM & 8 CPUs. When I run that script outside of Docker, I don’t get the error – I can increase the size of that array near my entire RAM’s capacity without a problem.

Ideally, I want to create 10+GB SharedArrays inside a Docker. Can anyone give me pointers on what I should try next? I’m really novice at digging through the strace. What should I be looking for?


#6

Hmm for me it was clearly being out of memory on that compute node which caused this. Is docker imposing any limits on your memory consumption?


#7

Oh wow – that’s exactly it, @floswald! A docker container defaults to 64MB of shared memory. You need to adjust it with the parameter --shm-size, e.g. docker run -it --shm-size=8G julia.

Thanks a ton for the help!


#8

Awesome! Good stuff.