How to debug killed julia session?

kevbonham · February 3, 2021, 5:00pm

I have a script that I’m trying to run that keeps getting killed after running for a bit and generating some amount of output. I have a list of identifiers for groups of files, and the program does

read list of identifiers
Loop, for each identifier
a. load each (gzipped) file that matches identifier, store records in memory
b. take random subset of records, write to new (gzipped) file
c. repeat (b) a few times with different size subsets

When I run this, either as a script or from the REPL, it completes for some number of identifiers, and then

┌ Info: Done!
└   sample = "G78597"
[ Info: Subsampling for G78598
[1]    3712669 killed     julia --project --threads 16
[1]    3712669 killed     julia --project --threads 16

I had the (2) loop running with Threads.@threads, but removing that and/or running with a single thread doesn’t seem to make a difference. I can share the code if that would be helpful, but I’m more curious about what I can do to debug this. The fact that the loop completes for at least some of the files means that the typical source of bugs for me aren’t the problem, and the lack of error message has me stuck.

Also, it’s not a problem with the particular files being read in (as far as I can tell) - eg if I re-run the script only including “G78597” above, it runs ok

mbauman · February 3, 2021, 5:03pm

What system are you on? Linux? Smells like an oomkill. You can check dmesg.

kevbonham · February 3, 2021, 5:05pm

Yeah

[12101132.369544] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/ssh.service,task=julia,pid=3712669,uid=1000
[12101132.369568] Out of memory: Killed process 3712669 (julia) total-vm:134998224kB, anon-rss:125802208kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:247740kB oom_score_adj:0

Yup - that was easy

kevbonham · February 3, 2021, 5:07pm

So now the question is why it’s not deterministic - it should be. But this is at least a place to start, thanks!

kevbonham · February 3, 2021, 5:29pm

So, watching htop, the memory use creeps up and up, but doesn’t ever seem to get cleared. I’d expect GC to remove previously stored sets of records, is there something that would be preventing this?

I have

        records = FASTQ.Record[]
        for fastq in filter(f-> occursin(sample, f), fastqs)
            file = joinpath(rawfastq_path, fastq)
            append!(records, FASTQ.Reader(GzipDecompressorStream(open(file))))
        end

inside the loop, but that records vector should be unique to each thread, and GC’able after that instance of the loop is done, right?

kevbonham · February 3, 2021, 7:20pm

Alright, some number of previous statements I made must be false - I’m now running on just 6 threads, and I’m seeing memory almost max out, then drop back down, so GC must be happening after all.

Palli · February 3, 2021, 8:00pm

Sadly, OOM isn’t deterministic in Linux. I mean in all systems amount of memory is fixed, but depends on what else is running how much you can use before you run out. In Linux it’s worse than that non-determinism, because some random process can get killed, not the last “one” to fill up the total memory. It’s not totally random, since Linux tried to make some educated guess on what to kill, but it feels like it. I’m not sure what’s the best solution is; here is one:

Topic		Replies	Views
Killed julia process with threads General Usage question	2	1333	March 4, 2020
Julia gets killed in terminal when script is run New to Julia	3	777	August 16, 2022
Julia killed with Out of memory error on Linux -- runs fine on MacOS General Usage memory , os	8	858	September 12, 2023
Running JuMP using KNITRO: killed Optimization (Mathematical) question	3	1025	March 5, 2018
How to debug slow memory leak from threads? General Usage	0	538	December 29, 2019

How to debug killed julia session?

Related topics