HPC Slurm Logging Peer Check

Hello,

I am looking for some help understanding what is probably a familiar problem to those who run Julia regularly on an HPC with Slurm. I’ll explain the issue and then share what I’ve found. Hopefully someone can help me interpret this and understand the issue.

I have a long running script instrumented with @infos so I can see information like a summary of the run echoed at the beginning and then regular progress updates (say once every 1000 iterations). I’d like these output to a file during the run.

My slurm script contains the following directives

#SBATCH --job-name=my_run
#SBATCH --output=my_run.o-%j
#SBATCH --error=my_run.e-%j

I understand that the @info statements are getting output to stderr and indeed when the run finally finishes I see the output in the my_run.e- file. But this doesn’t solve my problem as I’d like to see it during the run to check on how it is going.

From what I have read here and here it seems this is an issue with Slurm buffering. The buffer doesn’t get filled and so it waits till the end of the run to flush it, something like that (please correct me if I am wrong).

My first attempt to fix this issue followed the approach described here. I added the following function to my package and then call it before starting the main function.

logger = TeeLogger(
    MinLevelLogger(FileLogger(logfile), Logging.Debug),
    MinLevelLogger(ConsoleLogger(stderr, Logging.BelowMinLevel), Logging.Info)
)
global_logger(logger)

I understand that FileLogger is supposed to flush every time a message is written. That doesn’t appear to be the case though. The log file gets created but I never see anything show up until the end of the run. Also, it seems ConsoleLogger also does not automatically flush with each message (I didn’t see any kwargs described that might enable it to, did I miss one?).

I also tried what was described here and submitted my run as follows

stdbuf -i0 -o0 -e0 sbatch myrun.slurm

But that didn’t make a difference either.

Finally, what did work was after every @info I added a flush(stderr) and now finally it writes to the my_run.e- file in real time. I’m a bit disappointed though as adding that after every single @info is cumbersome. Is that really the best way to do this? Or is there a better/smarter way to make it so every single @info automatically is flushed?

Another alternative I could define my own function I guess that does the following and then replace all the @info calls.

function print_info(msg)
@info "My run info is %s" msg
flush(stderr)
end

I think I got this resolved.

Instead of ConsoleLogger I switched to FormatLogger from LoggingExtras.jl. Once I did that, I started getting real time updates during the slurm job run.

Also regarding FileLogger, I must have been being impatient because I ran it again a bit ago and started getting messages sent to the log file in real time as well.

4 Likes

That is a good tip. Maybe it could be documented somewhere in ClusterManagers.jl

As an aside, this is a pretty common finding in HPC. If you use PBSPro I think you need to look for job ‘peeking’

I use a separate log file that I open and write to using println and it outputs immediately. It is cumbersome to use. Maybe I should look into a logging package.

@tbeason This is a function of your cluster filesystem setup. The file is being written by the first compute host allocated to you. I guess you are writing to a shared filesystem.

With PBS you can define rules for how the output files are copied back to main/shared storage. In the jargon this is ‘staging’