Blog: Using Julia on the HPC

the reason this is slower is that rng is an untyped global variable. what are the results of you make rng an argument to the function?

Ah, good point.

function simple_monte_carlo(n, T, rng)
	x = zeros(n)	
	for i in eachindex(x)
		for t in 1:T	
			x[i] += Random.randn(rng, Float64)	
		end	
	end
	return x
end

rng = MersenneTwister(1234)
@btime simple_monte_carlo(262144, 100, rng)

Wow, this is faster than C++ againā€¦ !

Anyway, this was still the RNG, with C++ std::random being slow indeed.

Replacing that with boost yields a much better performance for the C++ code.

Hereā€™s a simplified Cpp code (without repeats or CSV output):

#include <iostream>
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/uniform_real.hpp>
#include <chrono>
#include <vector>

using namespace std::chrono;

int main()
{
    boost::mt19937 rng;
    boost::uniform_real<> randn(0,1);

    int n = 262144;
    int T = 100;

    auto start = high_resolution_clock::now();
    
    std::vector<double> x(n);
    for (size_t i = 0; i < n; i++)
    {
        for (size_t t = 0; t < T; t++)
        {
            x[i] += randn(rng);
        }
    }

    auto stop = high_resolution_clock::now();
    double duration = (double)duration_cast<milliseconds>(stop - start).count();
    std::cout << duration << std::endl;

    return 0;
}

Compiled with -O3, this code is about 50% faster than Julia. But we are just benchmarking RNG libraries here so this is rather unconclusive.

2 Likes

Thanks for the core here! I have a quick question - is uniform_real generating a number between 0 and 1? In the problem, we are generating Gaussian numbers, which I imagine has to be slower.

The reason that I stick to the inbuilt RNG is because it is now thread safe by default, making it much easier to move into threading.

I tried to see if I could find the better rng implementation in C++ easily, but I wasnā€™t up for implementing it myself.

If anyone would like to add an additional implementation then please feel free to make a PR.

1 Like

Hi, thanks for your input on this! I should have been clear, that I mean threading on the user level, with a for loop, not in the internal operations.

The parfor always seemed like multiprocessing, not lightweight threading, as it spins up other processes and has huge startup latency. Once you have spun up the other processes it works well, but on my machine in the past (32 threads), it takes around 12 minutes to start up and uses 20GB of RAM before any code is run, which is why I donā€™t include it in the comparison.

Iā€™ll give a look at the blog post. To be honest, I didnā€™t have a good experience with Julia for HPC. Iā€™ve tried Distributed.jl and DistributedArrays.jl to make a Monte Carlo simulation for a physics project: it was incredibly slow. It could be that I didnā€™t know what I was doing, and that my data structure was bad, but I really didnā€™t know what to do, and I just ended up switching to Rust. I was able to make the code parallel just fine (even on different nodes of a cluster) and it was really easy, which is cool. But it was just too slow. Iā€™m not sure if it is my algorithmā€™s fault (but again, I donā€™t know what to improve), or if it Distributed/DistributedArrays acting as bottleneck. If anyone want to see my code, here is the repo:

  • Lattice.jl contains the code regarding the data structure
  • CabibboMarinari.jl contains the code regarding the actual algorithm
  • here is the script launching the simulation
1 Like

In my experience, going to distributed code straight away can be quite difficult. I usually first optimise the serial implementation parts as much as possible (remove allocations etc - all the tips in Performance Tips Ā· The Julia Language), and then work out which parts can be parallelised effectively. Throughout, I tend to benchmark a lot to see where the overhead is.

If you ever want to go back to Julia and donā€™t get the performance you would like, post a MWE and ask for optimisation help. Youā€™ll get a lot of pointers and help and it should be looking a lot better very quickly, and hopefully, youā€™ll get a lot of valuable tips throughout.

This is basically what I did. Iā€™ve followed all the instructions about performance, and benchmarked the code. In fact, most of the time was spent on the most expensive function, which is what I expected, and I found no other way to make it faster. The only thing Iā€™ve a couple of doubts on is using a DistributedArray of StaticArrays. Iā€™m not sure having nested array is a good idea, but then again, they are StaticArrays and should be handled quite efficiently, so Iā€™m not sure.

Iā€™m sure there are great parallel/simd,etc RNG libraries in C++, which are also probably harder to find and use.

You made a good point in this thread that Julia has a better default library in this area.

Have you tried SlurmClusterManager? I really enjoyed working with it.

1 Like

Thanks for this! It looks quite good. I have a script which I use to do what this package does, but itā€™s definitely better in a package.

Well, you can leave out ffts (they are just wrappers of FFTW) and maybe other domain specific things (signal and image processing tools, etc). But I do not really see a difference about array operarions. Matlab stands for matrix lab and users are basically expected to write everything in matrix operation form (loops are discouraged everywhere in the docs and the forum). You may like it or not (I assume the Julia community loves loops, generally, and respect that) but itā€™s just how matlab idiomatic code is supposed to be. Of course it would call some BLAS and LAPACK implementation internally (a tweaked MKL I believe) but why the user would care? The core language is based on matrix operations and those are automatically multi threaded, this is ideal for user experience (probably not so much for library developer experience, which is not really a target for Mathworks).

Yeah they like to keep their secrets (and this is surely the greatest flaw in matlab experience for manyā€¦ at least in the view of a passionate FOSS advocate).

Ok, but again, in a language that discourages the use of loops (in favor of vectorized operations) and is very clear about the intent in the docs this becomes much less relevant. Surely if one wants to code with loops (or at least be able to choose) being able to get those multithreaded is great, but if we are speaking about general user experience with multithreading support I think matlab does just fine. Then of course many many algorithms are more readable if written with loops and thatā€™s where the ā€œvectorize allā€ mantra really fails, about that I agree.

Anyway, I do not have time for that right now but if someone could try compare Julia parallelism options with Modern Fortranā€™s do concurrent, coarrays & co. (maybe even with the older !$openmp directive) it could be interesting to me. The nice similarity to Juliaā€™s apprach therein is that you just define the semantics for the algorithm in the source code (like you do in your function), but then the specifics of the parallel protocol and the hardware offload are managed by the compiler choice (and possibly some compilation flag). Namely Nvidia compiler would offload coarray code to CUDA, while Intel compiler to their CPUs/GPUs (here flags matter I think), or GCC to openMP / MPI (again flags, or maybe they even try to infer what do to).

I still never really explored myself the possibilities (itā€™s quite new and in the codebase I navigate daily we still use manual MPI directives) but it seems quite a lot interesting and somehow similar in spirit to what you described in your blog post. :slight_smile:

1 Like

I ran benchmarks before. Polyester.jl and LoopVectorization.jl were both much faster in loops where low overhead mattered.

LIbraries using PolyesterWeave will not over-subscribe threads when theyā€™re used together.
These include Polyester.jl, LoopVectorization.jl, Octavian.jl, TriangularSolve.jl, and RecursiveFactorization.jl.

Itā€™s limited by its simplicity, though, in that it doesnā€™t do any load balancing.
Iā€™ve noticed that this means Octavian.jl suffers a much bigger performance drop than MKL.jl on a system that is otherwise busy. But, if youā€™re calling it from Polyester.@batch threads, Octavianā€™s threads will be aware of this unlike MKLā€™s, so the former may do better/automatically run single threaded instead of oversubscribing.

A past version of Polyeter did support reserving threads per thread it started, so that each of those could start a certain number of threads of their own. E.g., with 16 threads total, you could run 4 threads that each have up to 3 extra threads of their own.
That sort of thing couldā€™ve been useful for MCMC, where you run some number of parallel chains, and then want each thread to have parallelism to accelerate the chains. But no one used that functionality AFAIK, and I didnā€™t feel like maintaining it at the time ā€“ figured I could add it back if there was a need.

You or anyone else are welcome to run benchmarks.

My point is that as far as I understand the Matlab language itself does not have any support whatsoever for multithreading (perhaps Iā€™m wrong, but documentation is ambiguous on the topic). The external libraries Matlab uses for some operations do. This is a pretty clear distinction to me. You canā€™t write atbitrary multi-threaded code.

2 Likes

Is w = A*v a core language expression or a library call to you? That is multithreaded.

Library call clearly, there is nothing language-specific in what you wrote, you just rely on whatever MKL does under the hood. There is nothing in the syntax which signals the code has to run on multiple threads, thatā€™d be a language feature. Again, it looks to me like in matlab you canā€™t manage how arbitrary code uses multiple threads, you can only hope itā€™ll end up calling some external multi-threaded libraries.

1 Like

Ok then we have different views about what qualifies as multithreading support in the language. I get what limitation you see, as a developer, though.
My point was that no matter how they achieve it, a matrix multiplication is a native language feature in matlab as much as a for loop. Itā€™s part of their language specification and of the documented semantics. Itā€™s not like using Eigen3, Armadillo or whatever in C++. So I would agree on the part where you say that the user has no fine control about how the multithreading is done, but Iā€™ll stay there: saying that Matlab has no language support for multithreading is misleading imho, it communicates the message that you cannot write native code that exploits threads, but again, a matrix product is native code in Matlab, itā€™s in the name.

Anyway here I found a much clearer list (though outdated now, itā€™s from 2007) list of what is automatically multithreaded in matlab (just to substantiate my statement in the previous post):

I guess we have different ideas of what constitutes the syntax of a language, and Iā€™m not going to continue a semantic discussion.

1 Like

Fine for me, let me just remark that I never brought syntax in, Iā€™m just talking about supported features in the language. Surely there is no syntax that Iā€™m aware of to invoke multithreading, itā€™s just ā€œonā€.
Now that you are asking I googled around and found that there actually is was some syntax to determine how much threads would be used, or to call a script / function in single thread mode (this apparently is still supported). It is documented here:

I have to say that their clear ā€œoutsourcingā€ to OS control is in line with your viewpoint.

You keep conflating ā€œlanguageā€ with ā€œthe matlab application which bundles the interpreter of a language and some other external librariesā€. Much like there is often confusion between the Julia language and its only implementation, which comes with a compiler and a bunch of other libraries.