Blog: Using Julia on the HPC

Hi all!

Recently, I gave a talk at a HPC conference on using Julia to take advantage of the low barrier to entry for high-performance parallel code at every level (threading, multiprocessing, GPU etc). I was asked to write a follow up blog post on the same topic. While I do not talk about everything, it may be of interest - you can view the blog post here.

There is also an accompanying GitHub repo with all the code here.

p.s. In the blog, I do a performance comparison with C++, but I am no C++ programmer, and it would be useful if someone could comment on whether I am being fair in the comparison.

23 Likes

I really wish ClusterManagers.jl is more robust and self-contained and has multiple way of connecting worker/main node under different network conditions

3 Likes

I am a complete rookie in Julia but know a bit more about C++. I have skimmed through the blog post and related GitHub project, I found them quite interesting, thanks for your effort!
My only question is related to optimisation levels, at least for CPU execution. Did you use defaults or maximum levels in both cases? IMHO, the latter would be a really fair comparison of perf capabilites between C++ vs. Julia.

A curiosity:

Why do you need to fill with zeros inside the function, if you are initializing to all zeros before the call? (and later you do the same by initializing the CUDA array and the Darray).

Am I missing something?

Anyway very nice article, I enjoyed the read!

Thanks for the interest.

For the C++ compilation I just used g++ -O3 main.cpp. Is there anything that I missed?

Ah yes! That’s a good point, you haven’t missed anything. In my original talk, I moved all allocations outside the benchmarked functions, but I chose to include the allocation line in the blog post, as I thought it was easier to understand. So having the input set to zero was a relic from the first iteration. I suppose I wanted random_walk! to work with “unzeroed” memory.

Hopefully, resetting to zero doesn’t bias the results too much (especially when T=100 as in the benchmarks), but a keen observation nonetheless!

1 Like

Yes, it seems like it wasn’t straightforward for some of my colleagues to use it.

I have a script which handles the cluster manager stuff using the environment variables set by SLURM which takes a main script and an “include” file to run on each worker before the main script, but it would be nice if I didn’t need this.

1 Like

You are simply comparing the speed of RNG between two languages. Based on my personal experience, Julia threading is never as fast as OpenMP.

fast in what sense? you may want GitHub - JuliaSIMD/Polyester.jl: The cheapest threads you can find! if you’re looking for low-overhead kind of “fast”?

There is no benchmark against OpenMP there. Also, if it is so good, then why not replace Threads.@threads with it?

what do YOU mean by “fast” then, what benchmark are you even looking for.

if OpenMP is so good why are you using Julia /jk


Base stuff (Threads stdlib) needs to strike some kind of balance, Polyester is good for low over-head stuff when you have particularly small tasks

1 Like

I no longer use Julia. I have migrated back to R/Python and C++. OpenMP is really a masterpiece. Not sure if Julia threading has caught up.

1 Like

No, at least for C++ part that really sets the baseline.

In my experience, my workloads are usually big enough that the overhead of Julia multithreading becomes negligible. If it is a problem, there are alternatives to the base implementation which are faster and more suitable (as mentioned by @jling).

I don’t know when you last used Julia, but the Threads library has been improving on every Julia release, so maybe it’s worth a try to compare overheads with OpenMP, I would be interested to see that benchmark.

I introduce threading here, as threading is not even available in languages like Python and MATLAB. I only include the C++ comparison to “hook” some people in, who may be only be interested only if Julia is fast like C, not slow like Python.

1 Like

But this comparison is incorrect. Julia uses the Xoshiro256++ algorithm for its RNG.

1 Like

A valid point, however, I am simply comparing the defaults, as I said, what would be likely from a new PhD student just trying to get some simulations working. My main aim, is not to show that it is “faster”, just that in Julia, it is easy to reach very high levels of performance for very little effort.

I think this is of interest to programmers and researchers that have only used Python or MATLAB, and would find learning Julia more appealing than C/C++ etc, knowing that they aren’t forced into using those languages if they want parallelism or high performance.

1 Like

If I have threads which also create threads, is OpenMP as easy as Julia if you want the threads to be scheduled so that you never run more threads than the number of CPU cores? (Parallel computing noob here. Would love to read more about comparisons of the various options like OpenMP, IntelTBB, Polyester.jl etc.)

I’m sorry, but this is plain false. Matlab automatically exploits threading on many many builtin functionality (ffts, matrix products, generic array operations…): you may want to take a look on their dedicated page.

You could argue that a serious user would want fine control over what’s actually happening (although I’d respond that big part of matlab’s success has always been about its very good default handling of many things, from algorithm choice to these kind of things, to wisdom for fftw stuff and so on… so yes, but you’d better be a true expert to actually beat their careful default setup). In that case you just need to invoke one of the many constructs provided by the parallel computing toolbox, where you can go from a naive parfor loop to a fine control over the number of workers and how distributed memory is managed. Similarly you can easily initialize an array on the GPU and let Matlab dispatch therein all your array operations.

1 Like

I’m not familiar with matlab internals, but mentioning only linear algebra operations and fft makes me think threading comes only from the underlying libraries used and isn’t a core feature of the language. Is that the case? The documentation page you linked doesn’t seem to be very clear about this.

Yes, the comparison with C++ is about the random number generator.

Actually, I think that if one forces Julia to use Mersenne Twister it ends up being slower than the C++ code.

rng = MersenneTwister(1234)

function simple_monte_carlo(n, T)
	x = zeros(n)	
	for i in eachindex(x)
		for t in 1:T	
			x[i] += Random.randn(rng)	
		end	
	end
	return x
end

Alternatively, one could swap MersenneTwister for Xoshiro256++ in C++ and report the results in the GitHub repo.

As for a threading efficiency benchmark vs OpenMP, I did run a benchmark a while ago and was very positively surprised by Julia actually being able to benefit from hyperthreading and obtaining an almost 5X vs a single-threaded version in a 4-core PC, something I had never seen with OpenMP where at most I use to get ~3.6X. Testing Julia blog post.

What has been your experience? Do you have a MWE of where Julia threading falls short? Maybe that could be related to the fact that OpenMP has partial support for SIMD, and in Julia that is implemented on a separate lib (LoopVectorization)?