Weird error: `SystemError: memory mapping failed: Too many open files in system`

I am trying to run something in parallel. Suppose I have the following module loaded and that I am using ClusterManagers.jl to connect to a cluster.

@everywhere module ldf 
    using some packages...
    function read_df()
        validstates = ("...")
        #create a return vector
        df_of_states = Array{DataFrame, 1}(undef, 50)
        for (i, vs) in enumerate(validstates)
            df_of_states[i] = CSV.File(fn, header=false) |> DataFrame
        end 
        return df_of_states
    end
    export read_df

    const dfs = read_df()
    export dfs
end

The line const dfs = read_df() makes it so that when the @everywhere is run on all the workers, the csv files are read and stored in a const variable. I want to make use these dataframes in other functions. The size of each dataframes is 20000 x 141.

So now suppose on node001 where there are 32 workers launched, each worker with a separate copy of ldf, each creating a variable called dfs and reading those files… I get the following error

 On worker 2:
│    SystemError: memory mapping failed: Too many open files in system
│    #systemerror#51 at ./error.jl:168
│    #systemerror#50 at ./error.jl:167 [inlined]
│    systemerror at ./error.jl:167 [inlined]
│    #mmap#1 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Mmap/src/Mmap.jl:209
│    #mmap#14 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Mmap/src/Mmap.jl:251 [inlined]
│    mmap at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Mmap/src/Mmap.jl:251 [inlined]
│    parsetape at /home/affans/.julia/packages/CSV/vyG0T/src/file.jl:474
│    #File#28 at /home/affans/.julia/packages/CSV/vyG0T/src/file.jl:252
│    read_df at /home/affans/lancetid_actualinfection_post.jl:33
│    top-level scope at /home/affans/lancetid_actualinfection_post.jl:39
│    eval at ./boot.jl:331

It’s a fairly obvious error… I am opening up too many files, but I was wondering what the root cause of this is? It’s not out of memory error (as each worker has almost 4gb of RAM to work with, with a total of 128gb).

The solution for me here is to move the read_df() back to the head node and just pass to the workers the relevant data… but just wanted to get some insight on this.

1 Like

I’m assuming you are on linux. You might want to check /proc/sys/fs/file-max. That should be the maximum number of file descriptors can be open (in the system) at once. (On my system that is 9223372036854775807 but maybe it’s smaller on yours.)

You should also check /proc/sys/fs/file-nr the first number is the current number of file descriptors in use.

2 Likes

Can you log into one of your compute nodes, or submit a simple batch job and run

ulimit -a

I just did this on my Fedora laptop
open files (-n) 1024

1 Like