How to debug killed julia session?

I have a script that I’m trying to run that keeps getting killed after running for a bit and generating some amount of output. I have a list of identifiers for groups of files, and the program does

  1. read list of identifiers
  2. Loop, for each identifier
    a. load each (gzipped) file that matches identifier, store records in memory
    b. take random subset of records, write to new (gzipped) file
    c. repeat (b) a few times with different size subsets

When I run this, either as a script or from the REPL, it completes for some number of identifiers, and then

┌ Info: Done!
└   sample = "G78597"
[ Info: Subsampling for G78598
[1]    3712669 killed     julia --project --threads 16
[1]    3712669 killed     julia --project --threads 16

I had the (2) loop running with Threads.@threads, but removing that and/or running with a single thread doesn’t seem to make a difference. I can share the code if that would be helpful, but I’m more curious about what I can do to debug this. The fact that the loop completes for at least some of the files means that the typical source of bugs for me aren’t the problem, and the lack of error message has me stuck.

Also, it’s not a problem with the particular files being read in (as far as I can tell) - eg if I re-run the script only including “G78597” above, it runs ok

What system are you on? Linux? Smells like an oomkill. You can check dmesg.

Yeah

[12101132.369544] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/ssh.service,task=julia,pid=3712669,uid=1000
[12101132.369568] Out of memory: Killed process 3712669 (julia) total-vm:134998224kB, anon-rss:125802208kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:247740kB oom_score_adj:0

Yup - that was easy :laughing:

1 Like

So now the question is why it’s not deterministic - it should be. But this is at least a place to start, thanks!

So, watching htop, the memory use creeps up and up, but doesn’t ever seem to get cleared. I’d expect GC to remove previously stored sets of records, is there something that would be preventing this?

I have

        records = FASTQ.Record[]
        for fastq in filter(f-> occursin(sample, f), fastqs)
            file = joinpath(rawfastq_path, fastq)
            append!(records, FASTQ.Reader(GzipDecompressorStream(open(file))))
        end

inside the loop, but that records vector should be unique to each thread, and GC’able after that instance of the loop is done, right?

Alright, some number of previous statements I made must be false - I’m now running on just 6 threads, and I’m seeing memory almost max out, then drop back down, so GC must be happening after all. :man_shrugging:

Sadly, OOM isn’t deterministic in Linux. I mean in all systems amount of memory is fixed, but depends on what else is running how much you can use before you run out. In Linux it’s worse than that non-determinism, because some random process can get killed, not the last “one” to fill up the total memory. It’s not totally random, since Linux tried to make some educated guess on what to kill, but it feels like it. I’m not sure what’s the best solution is; here is one: