Memory issues using command line argument

Hi,
I’m trying to run a script on a cluster for different input parameters. Instead of submitting scripts with hard-coded values, I’m trying to use command-line arguments. My code works fine when the values are hard-coded, but when I use command line arguments the code works for a while and runs out of memory. Any suggestion on what the issue could be?

Hard-coded script:

include("/mcmc_fits/mcmc_bat_julia.jl")

global i=1
while i<=500
    global i 
    nominal = (offset1=99., lambda =1.4066, resolution=5.)
    try
        energy_range = (low=60. ,high=500., )
        posterior, arr = prepare_posterior(nominal,energy_range,5000)
        result = bat_sample(posterior,MCMCSampling(mcalg = MetropolisHastings(), nsteps = 10^4, nchains = 8))
        file_address = "run_5k_1resolution_"*string(i)*".h5"
        bat_write(file_address, result.result)
	h5open("test.h5", "cw") do file
        	write(file, "run_"*string(i), arr)
        end
        i+=1
    catch e
	println(e)
    end
end

The one that uses ARGS to get command line inputs:

include("/mcmc_fits/mcmc_bat_julia.jl")
function make_data(str_N_events, N_data_sets)
	N_events = parse(Int,str_N_events)
	i=1
	while i<=N_data_sets
    		nominal = (offset1=99., lambda =1.4066, resolution=5.)
    		try
        		energy_range = (low=60. ,high=500., )
 			event_count = rand(Distributions.Poisson(N_events::Integer))
        		posterior, arr = prepare_posterior(nominal, energy_range, event_count)
			result = bat_sample(posterior,MCMCSampling(mcalg = MetropolisHastings(), nsteps = 10^4, nchains = 8))
			file_address = "run_"*string(N_events::Integer)*"_1resolution_"*string(i)*".h5"
        		bat_write(file_address, result.result)
        		h5open("test.h5", "cw") do file
        		        write(file, "run_"*string(i), arr)
			end
        
			i+=1
    		catch e
			println(e)
    		end
		thr=0.1
		rand(Distributions.Uniform(0,1)) < thr && GC.gc();
	end
end

make_data(ARGS[1],500)

Submission scripts:

#!/bin/bash
#SBATCH --time=11:59:00
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=1024MB
export JULIA_NUM_THREADS=$SLURM_CPUS_PER_TASK

module load StdEnv/2020 julia/1.8.5

julia mcmc_set_count_1resolution_constrained.jl 5000

Hello!

Consider using functions, as well as avoiding global variables. For more information, see the Performance Tips section in the manual:

https://docs.julialang.org/en/v1/manual/performance-tips/

Welcome to the forum @Atasattari :slight_smile:

My first guess would be that there are some repeated allocations which cause the required memory to grow. This is not a problem per se (and it’s impossible to tell without knowing the code that the functions from your include call internally), but I encountered some issues with Slurm and memory management before, similar to what is described in this thread:

I guess this is also why you have this line?

rand(Distributions.Uniform(0,1)) < thr && GC.gc();

Since you report the script is running fine in the first version, here are a few things that might help to limit the memory consumption (although I doubt that the impact will be that large):

  • Your function make_data is compiled only for a string, but you could just do the parse(Int, ...) when you call the function, such that it is already an Int when passed.
  • Remove the N_events::Integer annotations. Not sure what they are intended to do, but they are not necessary – in the worst case, they will convert some variables to a different type in every iteration of the loop.
  • Avoid try catch if possible?
  • I saw the Garbage collection thread before, but the suggestion didn’t fix my issue.

  • The hard-coded version has the try catch too. I’m running an MC process, and there is a small chance for the argument in the loop to fail. I don’t want that to break the code. I tried the other suggestions, but the job still dies after O(60) iterations of the loop.

-Error:

[ Info: Initializing new RNG of type Random123.Philox4x{UInt64, 10}
[ Info: MCMCChainPoolInit: trying to generate 8 viable MCMC chain(s).
[ Info: Selected 8 MCMC chain(s).
[ Info: Begin tuning of 8 MCMC chain(s).
/var/spool/slurmd/job5623421/slurm_script: line 16:  5193 Killed                  julia /mcmc_fits/mcmc_set_count_1resolution_constrained.jl 5000
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=5623421.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

Adjusted code:

include("/mcmc_bat_julia.jl")
function run_pseudo_experiments(N_events, N_data_sets)
        i=1
        while i<=N_data_sets
                nominal = (offset1=99., lambda =1.4066, resolution=5.)
                try
                        energy_range = (low=60. ,high=500., )
                        event_count = rand(Distributions.Poisson(N_events))
                        posterior, arr = prepare_posterior(nominal, energy_range, event_count)

                        result = bat_sample(posterior,MCMCSampling(mcalg = MetropolisHastings(), nsteps = 10^4, nchains = 8))
                        file_address = "run_"*string(N_events)*"_1resolution_"*string(i)*".h5"
                        bat_write("/"*file_address, result.result)
                        h5open("samples_run_"*string(N_events)*"_1resolution_constrained.h5", "cw") do file
                                write(file, "run_"*string(i), arr)
                        end

                        i+=1
                catch e
                        println(e)
                end
                thr=0.1
                rand(Distributions.Uniform(0,1)) < thr && GC.gc();
        end
end

run_pseudo_experiments(parse(Int,ARGS[1]),500)
  • The code imported is pretty simple. I did some performance optimization checking the allocations in an interactive session and everything looked fine. However, I’m not a Julia expert. Is there something obvious that I missed?
    I put the code below. Thanks for the help.
using Random, LinearAlgebra, Statistics, Distributions, StatsBase
using BAT, IntervalSets, ValueShapes, TypedTables
import SpecialFunctions
using  QuadGK
using HDF5

function my_erf(x,coeff,base)
    return coeff / 2 * (1 + SpecialFunctions.erf(x)) + base
end

function my_step(x,coeff,base)
    return coeff/2 * (1 + sign(x)) + base
end

function f(offset1,offset2,resolution, k,base , x)
    step1_coeff = 6+k
    step2_coeff = 2-k
    if resolution> 0.0000001
        step1_x = ((x - offset1)/(sqrt(2) * resolution))
        step1 = my_erf(step1_x,step1_coeff,base)

        step2_x = (x - offset2)/(sqrt(2)*resolution)
        step2 = my_erf(step2_x,step2_coeff,0.)
    else
        step1_x = (x - offset1)
        step1 = my_step(step1_x,step1_coeff,base)

        step2_x = (x - offset2)
        step2 = my_step(step2_x,step2_coeff,0)
    end
    return step1+step2
end

function vectorized_f(p::NamedTuple{(:offset1, :lambda, :resolution)}, x)
    n  = length(x)
    a = Array{Float64}(undef, length(x))
    offset2 = p.offset1*p.lambda
    @simd for i in 1:n
        @inbounds a[i] = f(p.offset1,offset2,p.resolution, 0., 5.6, x[i])
    end
    return a
end

function get_integral(p::NamedTuple{(:offset1, :lambda, :resolution)},low, high)
    offset2 = p.offset1*p.lambda
    total,error = quadgk(x -> f(p.offset1, offset2, p.resolution, 0., 5.6, x),low,high,rtol=1e-8)
    return total
end

function safe_log(x)
    return log.(x .+ 1e-323)
end

function sampler(p::NamedTuple{(:offset1, :lambda, :resolution)}, n, low, high)
    offset2 = p.offset1*p.lambda
    max = f(p.offset1,offset2,p.resolution,0,5.6, high)
    #undef = just get the memory, don't fill anything.
    arr = Array{Float64}(undef, n)
    i = 1
    while i <= n
        y = rand() * max
        x = rand() * (high-low) + low
        if y <= f(p.offset1,offset2,p.resolution,0,5.6, x)
            arr[i] = x
            i+=1
        end
    end
    return arr
end

function prepare_posterior(true_par_values::NamedTuple{(:offset1, :lambda, :resolution)},energy_range,expected_n)
    arr = sampler(true_par_values,expected_n,energy_range[:low],energy_range[:high])
    n_lower = expected_n - 5*sqrt(expected_n)
    n_upper = expected_n + 5*sqrt(expected_n)
    function likelihood(params)
        integral_value = get_integral(params, energy_range[:low]::Float64,energy_range[:high]::Float64)
        f_values = vectorized_f(params, arr::Vector{Float64})
        function_contribution = reduce(+,safe_log(f_values ./integral_value))
        return LogDVal(function_contribution)
    end
    prior = NamedTupleDist(
                offset1 = Uniform(50, 150),
                lambda = Normal(1.4066, 0.0022),
                resolution = Uniform(0.1,50))
    return PosteriorDensity(likelihood, prior),arr
end

Yeah, that’s what I thought… But did you also try with --heap-size-hint as mentioned in the thread? (setting it to something lower than the hard memory limit from Slurm, otherwise it doesn’t seem to help, in my experience)

The likelihood function creates a new array every time it is called, which I assume is very often if you use some kind of Monte Carlo scheme? Does the code still work if you create a non-allocating version of that function? (I am not very familiar with the internals ot BAT.jl.)

E.g. vectorized_f!(result, ...) where you create the result array once and then just pass it to the function.