Memory issues using command line argument

Atasattari · June 20, 2023, 6:03am

Hi,
I’m trying to run a script on a cluster for different input parameters. Instead of submitting scripts with hard-coded values, I’m trying to use command-line arguments. My code works fine when the values are hard-coded, but when I use command line arguments the code works for a while and runs out of memory. Any suggestion on what the issue could be?

Hard-coded script:

include("/mcmc_fits/mcmc_bat_julia.jl")

global i=1
while i<=500
    global i 
    nominal = (offset1=99., lambda =1.4066, resolution=5.)
    try
        energy_range = (low=60. ,high=500., )
        posterior, arr = prepare_posterior(nominal,energy_range,5000)
        result = bat_sample(posterior,MCMCSampling(mcalg = MetropolisHastings(), nsteps = 10^4, nchains = 8))
        file_address = "run_5k_1resolution_"*string(i)*".h5"
        bat_write(file_address, result.result)
	h5open("test.h5", "cw") do file
        	write(file, "run_"*string(i), arr)
        end
        i+=1
    catch e
	println(e)
    end
end

The one that uses ARGS to get command line inputs:

include("/mcmc_fits/mcmc_bat_julia.jl")
function make_data(str_N_events, N_data_sets)
	N_events = parse(Int,str_N_events)
	i=1
	while i<=N_data_sets
    		nominal = (offset1=99., lambda =1.4066, resolution=5.)
    		try
        		energy_range = (low=60. ,high=500., )
 			event_count = rand(Distributions.Poisson(N_events::Integer))
        		posterior, arr = prepare_posterior(nominal, energy_range, event_count)
			result = bat_sample(posterior,MCMCSampling(mcalg = MetropolisHastings(), nsteps = 10^4, nchains = 8))
			file_address = "run_"*string(N_events::Integer)*"_1resolution_"*string(i)*".h5"
        		bat_write(file_address, result.result)
        		h5open("test.h5", "cw") do file
        		        write(file, "run_"*string(i), arr)
			end
        
			i+=1
    		catch e
			println(e)
    		end
		thr=0.1
		rand(Distributions.Uniform(0,1)) < thr && GC.gc();
	end
end

make_data(ARGS[1],500)

Submission scripts:

#!/bin/bash
#SBATCH --time=11:59:00
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=1024MB
export JULIA_NUM_THREADS=$SLURM_CPUS_PER_TASK

module load StdEnv/2020 julia/1.8.5

julia mcmc_set_count_1resolution_constrained.jl 5000

Sukera · June 20, 2023, 7:21am

Hello!

Consider using functions, as well as avoiding global variables. For more information, see the Performance Tips section in the manual:

https://docs.julialang.org/en/v1/manual/performance-tips/

Sevi · June 20, 2023, 7:22am

Welcome to the forum @Atasattari

My first guess would be that there are some repeated allocations which cause the required memory to grow. This is not a problem per se (and it’s impossible to tell without knowing the code that the functions from your include call internally), but I encountered some issues with Slurm and memory management before, similar to what is described in this thread:

I guess this is also why you have this line?

rand(Distributions.Uniform(0,1)) < thr && GC.gc();

Since you report the script is running fine in the first version, here are a few things that might help to limit the memory consumption (although I doubt that the impact will be that large):

Your function make_data is compiled only for a string, but you could just do the parse(Int, ...) when you call the function, such that it is already an Int when passed.
Remove the N_events::Integer annotations. Not sure what they are intended to do, but they are not necessary – in the worst case, they will convert some variables to a different type in every iteration of the loop.
Avoid try catch if possible?

Atasattari · June 20, 2023, 3:06pm

I saw the Garbage collection thread before, but the suggestion didn’t fix my issue.
The hard-coded version has the try catch too. I’m running an MC process, and there is a small chance for the argument in the loop to fail. I don’t want that to break the code. I tried the other suggestions, but the job still dies after O(60) iterations of the loop.

-Error:

[ Info: Initializing new RNG of type Random123.Philox4x{UInt64, 10}
[ Info: MCMCChainPoolInit: trying to generate 8 viable MCMC chain(s).
[ Info: Selected 8 MCMC chain(s).
[ Info: Begin tuning of 8 MCMC chain(s).
/var/spool/slurmd/job5623421/slurm_script: line 16:  5193 Killed                  julia /mcmc_fits/mcmc_set_count_1resolution_constrained.jl 5000
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=5623421.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

Adjusted code:

include("/mcmc_bat_julia.jl")
function run_pseudo_experiments(N_events, N_data_sets)
        i=1
        while i<=N_data_sets
                nominal = (offset1=99., lambda =1.4066, resolution=5.)
                try
                        energy_range = (low=60. ,high=500., )
                        event_count = rand(Distributions.Poisson(N_events))
                        posterior, arr = prepare_posterior(nominal, energy_range, event_count)

                        result = bat_sample(posterior,MCMCSampling(mcalg = MetropolisHastings(), nsteps = 10^4, nchains = 8))
                        file_address = "run_"*string(N_events)*"_1resolution_"*string(i)*".h5"
                        bat_write("/"*file_address, result.result)
                        h5open("samples_run_"*string(N_events)*"_1resolution_constrained.h5", "cw") do file
                                write(file, "run_"*string(i), arr)
                        end

                        i+=1
                catch e
                        println(e)
                end
                thr=0.1
                rand(Distributions.Uniform(0,1)) < thr && GC.gc();
        end
end

run_pseudo_experiments(parse(Int,ARGS[1]),500)

The code imported is pretty simple. I did some performance optimization checking the allocations in an interactive session and everything looked fine. However, I’m not a Julia expert. Is there something obvious that I missed?
I put the code below. Thanks for the help.

using Random, LinearAlgebra, Statistics, Distributions, StatsBase
using BAT, IntervalSets, ValueShapes, TypedTables
import SpecialFunctions
using  QuadGK
using HDF5

function my_erf(x,coeff,base)
    return coeff / 2 * (1 + SpecialFunctions.erf(x)) + base
end

function my_step(x,coeff,base)
    return coeff/2 * (1 + sign(x)) + base
end

function f(offset1,offset2,resolution, k,base , x)
    step1_coeff = 6+k
    step2_coeff = 2-k
    if resolution> 0.0000001
        step1_x = ((x - offset1)/(sqrt(2) * resolution))
        step1 = my_erf(step1_x,step1_coeff,base)

        step2_x = (x - offset2)/(sqrt(2)*resolution)
        step2 = my_erf(step2_x,step2_coeff,0.)
    else
        step1_x = (x - offset1)
        step1 = my_step(step1_x,step1_coeff,base)

        step2_x = (x - offset2)
        step2 = my_step(step2_x,step2_coeff,0)
    end
    return step1+step2
end

function vectorized_f(p::NamedTuple{(:offset1, :lambda, :resolution)}, x)
    n  = length(x)
    a = Array{Float64}(undef, length(x))
    offset2 = p.offset1*p.lambda
    @simd for i in 1:n
        @inbounds a[i] = f(p.offset1,offset2,p.resolution, 0., 5.6, x[i])
    end
    return a
end

function get_integral(p::NamedTuple{(:offset1, :lambda, :resolution)},low, high)
    offset2 = p.offset1*p.lambda
    total,error = quadgk(x -> f(p.offset1, offset2, p.resolution, 0., 5.6, x),low,high,rtol=1e-8)
    return total
end

function safe_log(x)
    return log.(x .+ 1e-323)
end

function sampler(p::NamedTuple{(:offset1, :lambda, :resolution)}, n, low, high)
    offset2 = p.offset1*p.lambda
    max = f(p.offset1,offset2,p.resolution,0,5.6, high)
    #undef = just get the memory, don't fill anything.
    arr = Array{Float64}(undef, n)
    i = 1
    while i <= n
        y = rand() * max
        x = rand() * (high-low) + low
        if y <= f(p.offset1,offset2,p.resolution,0,5.6, x)
            arr[i] = x
            i+=1
        end
    end
    return arr
end

function prepare_posterior(true_par_values::NamedTuple{(:offset1, :lambda, :resolution)},energy_range,expected_n)
    arr = sampler(true_par_values,expected_n,energy_range[:low],energy_range[:high])
    n_lower = expected_n - 5*sqrt(expected_n)
    n_upper = expected_n + 5*sqrt(expected_n)
    function likelihood(params)
        integral_value = get_integral(params, energy_range[:low]::Float64,energy_range[:high]::Float64)
        f_values = vectorized_f(params, arr::Vector{Float64})
        function_contribution = reduce(+,safe_log(f_values ./integral_value))
        return LogDVal(function_contribution)
    end
    prior = NamedTupleDist(
                offset1 = Uniform(50, 150),
                lambda = Normal(1.4066, 0.0022),
                resolution = Uniform(0.1,50))
    return PosteriorDensity(likelihood, prior),arr
end

Sevi · June 20, 2023, 6:29pm

Yeah, that’s what I thought… But did you also try with --heap-size-hint as mentioned in the thread? (setting it to something lower than the hard memory limit from Slurm, otherwise it doesn’t seem to help, in my experience)

Atasattari:

function vectorized_f(p::NamedTuple{(:offset1, :lambda, :resolution)}, x)
    n  = length(x)
    a = Array{Float64}(undef, length(x))
    offset2 = p.offset1*p.lambda
    @simd for i in 1:n
        @inbounds a[i] = f(p.offset1,offset2,p.resolution, 0., 5.6, x[i])
    end
    return a
end

The likelihood function creates a new array every time it is called, which I assume is very often if you use some kind of Monte Carlo scheme? Does the code still work if you create a non-allocating version of that function? (I am not very familiar with the internals ot BAT.jl.)

E.g. vectorized_f!(result, ...) where you create the result array once and then just pass it to the function.

Topic		Replies	Views
Memory issues Performance	3	484	April 15, 2020
Julia Execution get out of memory error General Usage	3	4705	August 5, 2017
Accumulating memory in nested loops General Usage performance	10	2229	April 17, 2017
Unexpected OutOfMemory error on HPC General Usage hpc , memory , clustering	17	1731	April 8, 2020
Memory Error Performance	9	1230	October 24, 2019

Memory issues using command line argument

Related topics