I guess at the moment one way to cope with this would be to have a part of your script before the main task. Take the list of compute nodes which is $SLURM_JOB_NODELIST
Loop through these nodes and make sure each one responds.
Then only use those which are up. That is a bit of a hack.
You of course say some careful exception handlign - so you have thought about this.