Multi-threading on a 2 CPU system

Hi all,
As a new Julia programmer, I’m fascinated by its syntax, code examples availability, its community, and, more importantly, its performance!
Now that I’m having my first experience on Thread Parallelism, I found that its harder to finding clear examples and information about this topic.
As a newcomer, I was very interested in using the @threads to rapidly transform a sequential a loop in a parallel loop “without many changes in the original code”, but the more I read about this subject more I realize that this is not the “silver bullet” I initially thought it was.
I have implemented 11 algorithms to resolve 14 math problems with 5 different dimensions (or number of variables) that are executed 51 times, as implemented in the 4 nested loops as shown below:

#Main loop
dt = @elapsed begin
    @threads for p in problems
        @threads for v in num_vars
            _p = Problem( p.f, p.min, p.max, v ) # copy problem struct
            @threads for algo in algorithms
                population = pop_mult * _p.num_variables
                println( "Alg.: ", algo |> string, " -> ", _p |> string, " | Pop. Size: ", population, " | Num. Iter.: " , iter_mult * v, " (", num_runs , " runs)", " | Start: ", now() )
                @threads for _ in 1:num_runs 
                    res = algo( _p, population, iter_mult * v, 1.0e-12 )                       
                    push!( results, res )
                end
            end
        end
    end
end

As there are 770 different combinations of algorithms, problems, and dimensions, I decided to parallelize the execution by including @threads command on all loops.

I found out that this was the most performant combination by running a smaller subset with 3 algorithms, 3 problems, and 3 dimensions (with smaller dimensions), and got the following results:

NOTE: The "notation" [a-b-c-d] refers to the positions of the "for" loops in the code above, and the value 1 means loop with the @threads command while 0 represents a simple sequential loop.

Then I ran the program on a 2 CPU Epyc system with 64 core / 128 threads each (256 threads total), 256 GB RAM, and in a Linux system, but after 5 days of execution I got a “Killed” message from the system.

Now I started a new run limited to 64 threads to limit the amount of memory used, also since I found out that by not using all cores on multithreaded operations I gain some additional performance (maybe due to multithreading overhead?).

My questions are:

  • Is the @threads loop a possible cause for stopping the execution? (As an additional information, I have successfully run a smaller example of the same code on a 6-core system that took almost 4 days but I believe that the @threads combo was [1-0-0-0]);
  • What others, more efficient ways, to parallelize this type of executions?
  • Any recommendation on code examples for correctly implement parallel code with Julia;
  • What special considerations should I have when using multiple threaded executions across multi-CPU systems (if any);

Thank you in advanced!

2 Likes

There seems to be a lot going on so Im not sure what exactly is the issue but maybe a few things I noticed:
First of all, just to make sure, you are getting the exact same result for your parallelized code as for your single-threaded code, right?
This here

looks like each thread is pushing to the same vector of results, which would cause a race condition unless I am misunderstanding your code. So the first thing you should do is to find out whether your parallel code truly produces correct results, before you go for further optimizations.

Then, your code with 8 cpu threads is only about 2 times as fast as the one with 1 thread.
I would not expect too much improvement going for a higher number of total cores, which leads to my first guess for the issue:

  • Is it possible that the code on 64 cores simply runs a lot longer than you thought it would?

Is it possible to try a significantly smaller example run on the machine with 64 cores? Basically, just to see whether the loop actually hangs?
My guess is that the program simply takes longer than you think it will. If the job that is smaller takes already 4 days then it might be that your bigger job just takes too long. As mentioned before it doesn’t look like your code becomes significantly faster than with two threads.

3 Likes

Regarding tips for parallelization:

As I understand it, you have 11*14*5*51 jobs which are completely independent? By that I mean they do not need to access the same memory? In this case, the function pmap from the Distributed standard library might be just what you want.
It should of course also be possible to do this with threads, but then it would probably be best to benchmark the code further by checking whether it allocates a lot, and which part in particular takes long.

The way you have your loops structured might be problematic.

@threads for ..
    @threads for..

Nested @threads loops is probably a little hard to reason about what happens. If you restructured your code to have one top level parallel loop you would likely get better performance and more intuitive understanding of the parallel execution.

I have not checked the results thoroughly, but they are within the expected range (as they are based on random values)

Yes I’m “bringing back the results” of each execution to the main thread to the “result” vector so I can save all at once to a files at the end of the run.

I did change the code to not save the executions in the main thread, also removed all variables to avoid any type of possible race conditions (modified main loop below) but the performance was basically the same (within the margin of error) – see table under the code below.

#Main loop
dt = @elapsed begin
    @threads for p in problems
        @threads for v in num_vars
            @threads for algo in algorithms
                @threads for _ in 1:num_runs                        
                    algo( Problem( p.f, p.min, p.max, v ), pop_mult * v, iter_mult * v, 1.0e-12 )
                end
            end
        end
    end
end


NOTE: Run with simplified main loop above

Is this low scalability related to the nature of the algorithms I’m running (optimization algorithm for minimization of a function that simply putted, are loops with operation with random numbers vectors) or my implementation for Multi-threading?

I do not have direct access to the machine that is running the algorithms, I send the source code and thy run it on the machine.

But I did previously successful finished a limited complete run (limited to a maximum of 1000 cycles per combination of algorithms, problems, and dimensions) on the 6 core AMD machine that took about 3.5 days to finish using the loop @threads combination of [0-0-0-1] (only the most inner loop was parallelized).

As the pretended number of cycles per combination is variable and up to 200 000 cycles, we have decided to run it on the 2x 64 core CPU, hopping that by using all 256 threads the execution time would be similar. As the variable num_runs in the most inner loop of the code (for _ in 1:num_runs) was limited to 51, we decided to parallelize all the other loops so we could take advantage of all the available cores.

Yes, each job is in completely independent only reporting back the results to the main thread to be push to a vector. But as I say in the previous reply, removing this last part had practically no effect on the performance.

I have not tested the pmap function, ill give it a try then ill report back.

One aspect that often limits the scalability of multithreading is the garbage collection. So please check how much of the execution time of each of your problems is needed for that. If one thread needs 10% for garbage collection 5 parallel threads might already need 50% of the cpu time for garbage collection because all threads are stopped while any of them needs to collect garbage.

In such a situation you have two options, either reduce the memory allocations or use multitasking instead of multithreading.

1 Like

The necessity to create nested @threads loops was to try to use the most of the 256 available threads as I said in a reply to @Salmon, the largest loop was limited to 51.

If you make one big top level flattened loop that is parallel and the deconstruct your flattened index inside the loop you should be able to do it.

That makes sense and does explains the fact that by freeing up 1 or 2 threads from the run I got better performance!

Can you point me to some information in how to improve and reduce memory allocation in Julia?
I will give a try to implement multithreading an see if there is any difference.

  1. if you have loops, pre-allocate arrays
  2. use views instead of copying of arrays
  3. use the .= operator and the other dot operators for elementwise assignements
  4. for arrays with less than 100 elements use static arrays (GitHub - JuliaArrays/StaticArrays.jl: Statically sized arrays for Julia)

This document is very long, but useful and also has a section about pre-allocating output: Performance tips - julia-doc

Did a run with “one big top level flattened loop” and the with simplified main loop above (with no push of the results) but there is no significant effect…


NOTE: Run with simplified main loop

Did you wrap your loop in a function to benchmark?

1 Like

The tips that @ufechner7 gave are all good. I would like to add that you should definitely check for type instabilities, i.e. whether the compiler cannot figure out the type of some object in on of your often called functions and needs to resolve this at runtime (runtime dispatch). These usually lead to many allocations and poor scaling.
An excellent way to check for runtime dispatches is to the @profview macro in case you’re using vscode, which highlights runtime dispatchs in red. If there are any red parts of the bottom part of your flame graph you can gain significant improvement by fixing that.
Another great option is also the macro @code_warntype, which gives you an overview of types that the compiler figured out when given a function.

Using @time or BenchmarkTools @btime instead of @elapsed is also nice since it tells you the percentage of time spent in the garbage collector.
I have seen multithreaded performance increasing quite a bit even going from 5% GC time to 1%.

It is not about performance but it is nonetheless important that you make sure there are no race conditions. My guess is that you are not observing any corruption since each individual calculation takes a while so its unlikely that two are finished at the same time. However, if that ever happens, it will probably result in data loss, i.e. the results of one of the runs will be missing. My advise is to pre-allocate the results vector with the length length(problems)*length(numvars)*length(algorithms)*numruns and simply write the result of the calculation to the correct entry.
An easy way is to do that as proposed by @fft and to flatten the loop.

let 
    probs = ["a","b"]
    numvars = [1,2]
    algorithms = [sqrt,log10,exp]
    runs = 1:10
    
    individualRuns = collect((p,v,a,r) for p in probs for v in numvars for a in algorithms for r in runs)

    results = zeros(length(individualRuns))

    Threads.@threads for i in eachindex(individualRuns)
        (p,v,algo,_) = individualRuns[i]
        results[i] = algo(v)
    end
    results
end

Edit: This has nothing to do with performance by the way. First you need to eliminate allocations in your algo function

2 Likes

Another thing: are the execution times of individual elements very different? Then using FLoops.jl with something like WorkStealingEx from FoldsThreads.jl on a single loop could be worth trying.

1 Like