I have a script that I’m trying to run that keeps getting killed after running for a bit and generating some amount of output. I have a list of identifiers for groups of files, and the program does
read list of identifiers
Loop, for each identifier
a. load each (gzipped) file that matches identifier, store records in memory
b. take random subset of records, write to new (gzipped) file
c. repeat (b) a few times with different size subsets
When I run this, either as a script or from the REPL, it completes for some number of identifiers, and then
┌ Info: Done!
└ sample = "G78597"
[ Info: Subsampling for G78598
[1] 3712669 killed julia --project --threads 16
[1] 3712669 killed julia --project --threads 16
I had the (2) loop running with Threads.@threads, but removing that and/or running with a single thread doesn’t seem to make a difference. I can share the code if that would be helpful, but I’m more curious about what I can do to debug this. The fact that the loop completes for at least some of the files means that the typical source of bugs for me aren’t the problem, and the lack of error message has me stuck.
Also, it’s not a problem with the particular files being read in (as far as I can tell) - eg if I re-run the script only including “G78597” above, it runs ok
So, watching htop, the memory use creeps up and up, but doesn’t ever seem to get cleared. I’d expect GC to remove previously stored sets of records, is there something that would be preventing this?
I have
records = FASTQ.Record[]
for fastq in filter(f-> occursin(sample, f), fastqs)
file = joinpath(rawfastq_path, fastq)
append!(records, FASTQ.Reader(GzipDecompressorStream(open(file))))
end
inside the loop, but that records vector should be unique to each thread, and GC’able after that instance of the loop is done, right?
Alright, some number of previous statements I made must be false - I’m now running on just 6 threads, and I’m seeing memory almost max out, then drop back down, so GC must be happening after all.
Sadly, OOM isn’t deterministic in Linux. I mean in all systems amount of memory is fixed, but depends on what else is running how much you can use before you run out. In Linux it’s worse than that non-determinism, because some random process can get killed, not the last “one” to fill up the total memory. It’s not totally random, since Linux tried to make some educated guess on what to kill, but it feels like it. I’m not sure what’s the best solution is; here is one: