Meaning no serial portion (or very small?). The serial portion of your program always limits parallel speedup, according to Amdahl’s law (and Gustafson’s law gets around it in a way, also for Celeste).
But you need a fast language also for the parallel part, because otherwise you’re throwing that much more hardware at the problem, which you do not want to do.
I believe even Python (and bash) is used on supercomputers, but I guess in a limited capacity, e.g. as a glue language and/or the heavy lifting actually done by e.g. C libraries it uses, so “using Python” misleading, also for supercomputers. Since a good rule of thumb is only 10% of code is run 90% of the time, just a small portion needs to be implemented in non-Python, but even those 90% can be a problem if in the serial portion, it has to be down to e.g. 10% and then you’re still limited to 10x parallel-speedup. Which isn’t great because then you could just beat Python with Julia on an single-core machine.
Having good garbage collector (or some way around it). I recall reading something about Celeste and GC, that the GC wasn’t optimal, then, so improving it would help even more. I don’t think they used GPUs back then with Julia; if you would then the GC on them (GPUCompiler.jl supports that, e.g. for CUDA).
Julia’s GC isn’t parallel, or at least wasn’t back then. There are some recent PR regarding to the Julia GC and I haven’t kept up, but I believe parallel is coming. Otherwise it’s stop-the-word when the GC kicks in. Note, that means the memory in your process, but if you do distributed/MPI then it’s only for one core and it’s memory space. Celeste was before the threading-work of Julia [EDIT: experimental thread support came in Julia 0.5, I’m not sure if it was actually used in Celeste.jl, still I see in the code “enable pre-allocated thread-safe pool”)], if I recall correctly, and with thread it would apply to all the threads running in the address-space of your process.
Interesting, I see there:
we followed Julia’s documented best practices for performance programming. This included making code typestable, eliminating use of any global variables, eliding bounds checks for known-length arrays (@inbounds), and carefully tuning core HPCG math kernels for performance bottlenecks, e.g. by instrumenting the garbage collector
How is the GC instrumented?
I also see there a new language I hadn’t heard of Regent, and some other mentioned for HPC, including Erlang.