Why with @threads, the execution time is worse?

I am running the below code on 6 physical core computer:

using SparseArrays;
using LinearAlgebra;
using .Threads
using BenchmarkTools

dt = 0.001;     
tmin = 0;  
tmax = 5;
timeSim = tmin:dt:tmax;

NodeVoltage = zeros(3,2); 
NodeVoltageRecord = zeros(3*size(timeSim,1),2);
A_Mat = spzeros(9,9); 
for i in 1:9
  A_Mat[i,i] = 1;
end
I_Mat = ones(9,3); 

@btime begin
global iter = 0;
  for t in timeSim
    global iter += 1;

    global V_Mat = A_Mat \ I_Mat;
    
    for i in 1:3:6
      j = Int((i-1)/3)+1;
      global NodeVoltage[:,j]= [V_Mat[i,1]; V_Mat[i+1,2]; V_Mat[i+2,3]];#V_Mat[i:i+2]
    end #for i in 1:3:6
    global NodeVoltageRecord[(3*iter-2):3*iter,:] = NodeVoltage;
    
  end #for t in timeSim
end #@btime begin
7.382 ms (153499 allocations: 8.45 MiB)

But when I use @threads, the execution time becomes worse than without @threads. Any idea?

using SparseArrays;
using LinearAlgebra;
using .Threads
using BenchmarkTools

dt = 0.001;     
tmin = 0;  
tmax = 5;
timeSim = tmin:dt:tmax;

NodeVoltage = zeros(3,2); 
NodeVoltageRecord = zeros(3*size(timeSim,1),2);
A_Mat = spzeros(9,9); 
for i in 1:9
  A_Mat[i,i] = 1;
end
I_Mat = ones(9,3); 


@btime begin
global iter = 0;
  for t in timeSim
    global iter += 1;

    global V_Mat = A_Mat \ I_Mat;
    
    @threads for i in 1:3:6
      j = Int((i-1)/3)+1;
      global NodeVoltage[:,j]= [V_Mat[i,1]; V_Mat[i+1,2]; V_Mat[i+2,3]];#V_Mat[i:i+2]
    end #for i in 1:3:6
    global NodeVoltageRecord[(3*iter-2):3*iter,:] = NodeVoltage;
    
  end #for t in timeSim
end #@btime begin

julia> Threads.nthreads()
6
23.504 ms (308554 allocations: 23.71 MiB)

For one thing, I would always through things in a function before jumping to any conclusions. Performance sensitive code should never be in the global scope.

4 Likes

My original code is much complicated and when I transfer it to a function, it has world age.
Anyway, I went to a function with the same results:

function solverMANATestT()

global iter = 0;
  for t in timeSim
    global iter += 1;

    global V_Mat = A_Mat \ I_Mat;
    
    for i in 1:3:6
      j = Int((i-1)/3)+1;
      global NodeVoltage[:,j]= [V_Mat[i,1]; V_Mat[i+1,2]; V_Mat[i+2,3]];#V_Mat[i:i+2]
    end #for i in 1:3:6
    global NodeVoltageRecord[(3*iter-2):3*iter,:] = NodeVoltage;
    
  end #for t in timeSim

end #function solverMANATestT()
...
using SparseArrays;
using LinearAlgebra;
using .Threads
using BenchmarkTools
include("./../3Phase/solverMANATestT.jl");

dt = 0.001;     
tmin = 0;  
tmax = 5;
timeSim = tmin:dt:tmax;

NodeVoltage = zeros(3,2); 
NodeVoltageRecord = zeros(3*size(timeSim,1),2);
A_Mat = spzeros(9,9); 
for i in 1:9
  A_Mat[i,i] = 1;
end
I_Mat = ones(9,3); 
@btime solverMANATestT()
7.234 ms (153499 allocations: 8.45 MiB)
function solverMANATestTP()

global iter = 0;
  for t in timeSim
    global iter += 1;

    global V_Mat = A_Mat \ I_Mat;
    
    @threads for i in 1:3:6
      j = Int((i-1)/3)+1;
      global NodeVoltage[:,j]= [V_Mat[i,1]; V_Mat[i+1,2]; V_Mat[i+2,3]];#V_Mat[i:i+2]
    end #for i in 1:3:6
    global NodeVoltageRecord[(3*iter-2):3*iter,:] = NodeVoltage;
    
  end #for t in timeSim

end #function solverMANATestTP()
...
using SparseArrays;
using LinearAlgebra;
using .Threads
using BenchmarkTools
include("./../3Phase/solverMANATestTP.jl");

dt = 0.001;     
tmin = 0;  
tmax = 5;
timeSim = tmin:dt:tmax;

NodeVoltage = zeros(3,2); 
NodeVoltageRecord = zeros(3*size(timeSim,1),2);
A_Mat = spzeros(9,9); 
for i in 1:9
  A_Mat[i,i] = 1;
end
I_Mat = ones(9,3); 
@btime solverMANATestTP()
julia> Threads.nthreads()
6
24.297 ms (308558 allocations: 23.71 MiB)

I think you have to get rid of all global to know what is going on. They are the devil. If you need to update things in place, pass them in as function arguments instead.

2 Likes

I get rid of all global variables ( as below) and the tend of results are still the same:

function solverMANATestT(tmin)
  iter = 0;
  for t in timeSim
      iter += 1;

    V_Mat = A_Mat \ I_Mat;
    
    for i in 1:3:6
      j = Int((i-1)/3)+1;
      NodeVoltage[:,j]= [V_Mat[i,1]; V_Mat[i+1,2]; V_Mat[i+2,3]];#V_Mat[i:i+2]
    end #for i in 1:3:6
    NodeVoltageRecord[(3*iter-2):3*iter,:] = NodeVoltage;
    
  end #for t in timeSim

end #function solverMANATestT()

using SparseArrays;
using LinearAlgebra;
using .Threads
using BenchmarkTools
include("./../3Phase/solverMANATestT.jl");

dt = 0.001;     
tmin = 0;  
tmax = 5;
timeSim = tmin:dt:tmax;

NodeVoltage = zeros(3,2); 
NodeVoltageRecord = zeros(3*size(timeSim,1),2);
A_Mat = spzeros(9,9); 
for i in 1:9
  A_Mat[i,i] = 1;
end
I_Mat = ones(9,3); 
@btime solverMANATestT(tmin)
6.925 ms (134517 allocations: 8.16 MiB)

function solverMANATestTP(tmin)
  iter = 0;
  for t in timeSim
      iter += 1;

    V_Mat = A_Mat \ I_Mat;
    
    @threads for i in 1:3:6
      j = Int((i-1)/3)+1;
      NodeVoltage[:,j]= [V_Mat[i,1]; V_Mat[i+1,2]; V_Mat[i+2,3]];#V_Mat[i:i+2]
    end #for i in 1:3:6
    NodeVoltageRecord[(3*iter-2):3*iter,:] = NodeVoltage;
    
  end #for t in timeSim

end #function solverMANATestTP()

using SparseArrays;
using LinearAlgebra;
using .Threads
using BenchmarkTools
include("./../3Phase/solverMANATestTP.jl");

dt = 0.001;     
tmin = 0;  
tmax = 5;
timeSim = tmin:dt:tmax;

NodeVoltage = zeros(3,2); 
NodeVoltageRecord = zeros(3*size(timeSim,1),2);
A_Mat = spzeros(9,9); 
for i in 1:9
  A_Mat[i,i] = 1;
end
I_Mat = ones(9,3); 
@btime solverMANATestTP(tmin)
julia> Threads.nthreads()
6
22.710 ms (259562 allocations: 23.04 MiB)

note that timeSim is a global, as are A_Mat and I_Mat. These are just captured in a closure.

A few last things for getting rid of globals:

  • Your solverMANATestT must be accessing I_mat, A_mat, etc. as globals (i.e. it is a closure). I would ensure that everthing is self-contained so there are no accidental closures.
  • I think it is essential that you make sure your functions work in a functional form without any accidentally captured variables if you care about performance.
  • The call to @btime solverMANATestTP(tmin) will actually use tmin as a global, so you have to do @btime solverMANATestTP($tmin) to know what is going on.

You will know you are done with this all when you can check that there are no type instabilities.

… now just to be clear, I am not saying that this stuff will magically solve the performance differences - since threading doesn’t always help, but these are things you need to do for performant code anyways.

In all likelihood, when you get your code clean enough and get rid of the globals, you may find things like GitHub - JuliaSIMD/LoopVectorization.jl: Macro(s) for vectorizing loops. or GitHub - mcabbott/Tullio.jl: ⅀ perform magic that you wouldn’t be able to figure out yourself. Or maybe not. They certainly were designed for these sorts of tight kernels though.

Thanks for your explanations. I make it as you suggested, but the results are still the same.

function solverMANATestT()
dt = 0.001;     
tmin = 0;  
tmax = 5;
timeSim = tmin:dt:tmax;
NodeVoltage = zeros(3,2); 
NodeVoltageRecord = zeros(3*size(timeSim,1),2);
A_Mat = spzeros(9,9); 
for i in 1:9
  A_Mat[i,i] = 1;
end
I_Mat = ones(9,3); 
  iter = 0;
  for t in timeSim
      iter += 1;
    V_Mat = A_Mat \ I_Mat;
    for i in 1:3:6
      j = Int((i-1)/3)+1;
      NodeVoltage[:,j]= [V_Mat[i,1]; V_Mat[i+1,2]; V_Mat[i+2,3]];#V_Mat[i:i+2]
    end #for i in 1:3:6
    NodeVoltageRecord[(3*iter-2):3*iter,:] = NodeVoltage;
  end #for t in timeSim
end #function solverMANATestT()
using SparseArrays;
using LinearAlgebra;
using .Threads
using BenchmarkTools
include("./../3Phase/solverMANATestT.jl");
@btime solverMANATestT()
2.348 ms (60025 allocations: 6.87 MiB)
function solverMANATestTP()
dt = 0.001;     
tmin = 0;  
tmax = 5;
timeSim = tmin:dt:tmax;
NodeVoltage = zeros(3,2); 
NodeVoltageRecord = zeros(3*size(timeSim,1),2);
A_Mat = spzeros(9,9); 
for i in 1:9
  A_Mat[i,i] = 1;
end
I_Mat = ones(9,3); 
  iter = 0;
  for t in timeSim
      iter += 1;
    V_Mat = A_Mat \ I_Mat;
    @threads for i in 1:3:6
      j = Int((i-1)/3)+1;
      NodeVoltage[:,j]= [V_Mat[i,1]; V_Mat[i+1,2]; V_Mat[i+2,3]];#V_Mat[i:i+2]
    end #for i in 1:3:6
    NodeVoltageRecord[(3*iter-2):3*iter,:] = NodeVoltage;
  end #for t in timeSim
end #function solverMANATestTP()
using SparseArrays;
using LinearAlgebra;
using .Threads
using BenchmarkTools
include("./../3Phase/solverMANATestTP.jl");
@btime solverMANATestTP()
julia> Threads.nthreads()
6
18.122 ms (215086 allocations: 22.21 MiB)

Perfect! I think now you are at the point where you can start to analyze things more carefully. But my guess is that you have more to do (e.g. doing things in place, doublchecking type stability, moving the kernel into its own funciton, etc.) You might end up with a significant improvement.

With all of that said, if your loop really is 1:3:6 then it is entirely possible that this just doesn’t work very well for threading because of the overhead. This doesn’t look like a natural place for parallel computations to me, but I could be wrong. Regardless, if you ensure that everything is type stable you can can try out the LoopVectorization and Tullio style features, and they may let you toggle on SIMD/threads/etc. if the kernels support it. The julia slack performance channel is pretty helpful once you have things in a clean state, but at this point I think more competent people than me can probably provide guidance.

I think you won’t gain anything from sparse matrices this small. And that essentially all the time is in computing \. Yet neither A_Mat nor I_Mat change in the loop, so you need only do this once. I presume that your real case is not like that?

function solverMANATestT_1()
  dt = 0.001     
  tmin = 0  
  tmax = 5
  timeSim = tmin:dt:tmax
  NodeVoltage = zeros(3,2) 
  NodeVoltageRecord = zeros(3*size(timeSim,1),2)
  A_Mat = zeros(9,9);  # at this size, sparse won't help anything
  for i in 1:9
    A_Mat[i,i] = 1
  end
  I_Mat = ones(9,3) 
  iter = 0
  for t in timeSim  # thousands of loops
    iter += 1
    V_Mat = A_Mat \ I_Mat  # here neither A_Mat nor I_Mat changes
    for i in 1:3:6  # if this loop was the bottleneck, it should be over j...
      j = Int((i-1)/3)+1  # ... to avoid division. Index at  3j-2, 3j-1, 3j
      NodeVoltage[:,j] .= (V_Mat[i,1], V_Mat[i+1,2], V_Mat[i+2,3]) # tuple not vector
    end #for i in 1:3:6
    NodeVoltageRecord[(3*iter-2):3*iter,:] .= NodeVoltage
  end #for t in timeSim
  NodeVoltageRecord
end

# 1.662 ms (50021 allocations: 7.48 MiB)  original
@btime solverMANATestT_1();
# 1.002 ms (20009 allocations: 2.90 MiB) just dense not sparse
#   782.375 μs (10007 allocations: 2.14 MiB) with tuple not vector RHS
#    82.333 μs (5 allocations: 235.58 KiB) without \
1 Like

Thanks for your advices. I didnt get the meaning of (1- doing things in place; 2-doublchecking type stability; 3-moving the kernel into its own function, etc.) . Could you please give me examples or links for them?

Yes, my real code is partially similar to this code, in which the A_Mat and I_Mat are changing over time (values not dimension). However, number of iteration are the same in this case. So, when I run it, I was surprised of having a lower speed with threads.
I have to say that, when using @threads with very large iterations, there is a gain. So, I am wondering how to make a decision to use @threads based on the number of iteration which may change during my main program.

At this point you should listen to @mcabbott who may also be able to tell you how to use his mind-blowing packages for performance if there is any hope of it being useful here.

1 Like

You can arrange to only use @threads sometimes, but I think the major point is that the loop you’re threading is just copying 3 numbers. It needs a lot more work than that inside to benefit.

If you can thread an outer loop instead, that’s more likely to help. But if each iteration depends on the last time step, then you can’t.

On large enough matrices \ itself is multi-threaded. If 9x9 is the true size, this is too little work, but other tricks may help:

julia> using SparseArrays, StaticArrays

julia> @btime x\y  setup=(x=sprand(9,9,0.9); y=randn(9));
  6.567 μs (65 allocations: 31.28 KiB)

julia> @btime x\y  setup=(x=rand(9,9); y=randn(9));
  755.800 ns (3 allocations: 992 bytes)

julia> @btime x\y  setup=(x=@SMatrix rand(9,9); y=@SVector randn(9));
  356.635 ns (0 allocations: 0 bytes)

So, I can arrange based on the number of iteration and the work inside the loop, right?
In conclusion, in my case the use of @threads is lowering the speed because the threads are having overhead?
Another issues is, I know that @time returns the (execution time + compilation time) of a code. while @btime returns only the (execution time) of a code. So, if I used @time instead of @btime in my previous code, then each the showed time will be of (execution time for a single thread + compilation time for the code)*number of threads, right?

Yes. Well you can write an if statement. I can’t find it now but someone had a drop-in replacement @threads min_iter for i in ..., possibly me, it’s not so hard. Otherwise packages like ThreadsX.jl, KissThreading.jl, Tullio.jl all have ways to set the number of iterations you think worth dividing up.

There’s extra compilation no doubt, but no idea if it’s linear in Nthreads. There’s also time just launching threads. If the step you’re threading is just copying numbers, it may never pay off, it’s just memory; it needs real work, something that takes at least a few microseconds, every time.

1 Like

Thank you very much for your explanations. It seems that this drop-in replacement would be really helpful. Could you please direct me how to find it or to build it up?

Sounds like you want Polyester.jl, which has a minimum batch size argument, e.g.

@batch minbatch=20 for i = 1:3:6
    ...
end

Polyester currently has issues with non-unit-step ranges, but you can work around that with StaticArrays.jl:

@batch minbatch=20 for i in SA[1:3:6...] 

Does @batch have the same function of @threads ?

It’s generally similar, but I’ll note three key differences:

  • @batch has a much lower overhead and allocates less memory
  • @threads for-loops can be nested, @batch cannot
  • @batch is on the bleeding edge of threading performance, @threads is stable and mature
1 Like

Thank you @stillyslalom for your explanation. I have compared between the @batch and the method here:https://discourse.julialang.org/t/optional-macro-invocation/18588/21

function ss()
dt = 0.001;     
tmin = 0;  
tmax = 5;
timeSim = tmin:dt:tmax;
A_Mat = spzeros(9,9); 
for i in 1:9
  A_Mat[i,i] = 1;
end
I_Mat = ones(9,3);
  iter = 0;
  for t in timeSim
      iter += 1;
      n=101
      #@threadsif n>100 for i in 1:n #by using it the time is  182 ms
      @batch minbatch=15 for i in 1:n   #by using it the time is  879.872 ms
      V_Mat = A_Mat \ I_Mat;
    end
end
end
using SparseArrays;
using LinearAlgebra;
using .Threads
using Polyester
using BenchmarkTools
@btime ss()

Any reason why @batch results in a really big number comparing with conditional thread @threadsif ?