Parallelization on the CPU isn't effective

Hello, everyone. I’m following the tutorial introduction to GPU programming in Julia (Introduction · CUDA.jl), and the first example in the tutorial discuss parallelization on CPU. Didn’t find a specific section for CPU, so thought General Usage would be addequate. The code is

using Test
using BenchmarkTools

N=2^20
x = fill(1.0f0, N)  # a vector filled with 1.0 (Float32)
y = fill(2.0f0, N)

function sequential_add!(y, x)
    for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

function parallel_add!(y, x)
    Threads.@threads for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

@btime sequential_add!($y, $x)
@btime parallel_add!($y, $x)

The results given by the tutorial for the benchmarking with 4 threads of function sequential_add is 487.303 μs (0 allocations: 0 bytes), and for the function parallel_add is 259.587 μs (13 allocations: 1.48 KiB), showing that parallel computation is faster.

When I run the code on my PC, the execution time is roughly the same (387.601 μs (0 allocations: 0 bytes) for parallel and 386.501 μs (20 allocations: 3.39 KiB)), and running Threads.nthreads(), the result is 4, therefore I am also using 4 threads.

This should be a problem, right? Am I doing something wrong while compiling, or wrongly interpreting the use of threads? Also, there is some package that I need to install before parallelizing?

That’s strange. Here is what I get:

julia> @btime sequential_add!($y, $x)
  250.849 μs (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  81.248 μs (21 allocations: 1.83 KiB)

To answer your question, no you don’t need to install a package or do anything special expect for making sure to start Julia with 4 threads, i.e. julia -t4.

1 Like

OP reports nthreads() being 4. So another explanation would be that some of these threads are not available;)

Yeah, I guess in principle it could be a coincidence (e.g. caused by other OS threads keeping the cores busy). However, the timings are so similar that it’s way more likely that the benchmarks have been accidentally run with nthreads() == 1.

Perhaps the number of threads was 4 at some point but the OP has restarted Julia while experimenting around and forgotten to start Julia with 4 threads again (just guessing of course). @leopdsf could you ensure that this was not the case? Perhaps you could post a screenshot of your terminal like this, i.e. including the command to start Julia:

1 Like

Hello @carstenbauer and @goerch, thanks in advance for answering. Here are the prints:


Hm, can you post the output of Sys.CPU_THREADS please?

I don’t see a reference to Threads.nthreads()?

Yeah, I also missed it my screenshot above. Would be good to explicitly confirm it, but he is clearly starting Julia with -t4 (and the command line argument has higher precedence than any potententially set env variable).

1 Like

Here they are:
P3

Wow, that looks really bad. Is your system doing something else? What does your system monitor show? What does versioninfo() show?

The first is in portuguese, trying to put it in english


p5

1 Like

Thanks for your help. Here is my info running the example from an included file

  144.400 μs (0 allocations: 0 bytes)
  37.000 μs (30 allocations: 3.62 KiB)

and versioninfo() giving

Julia Version 1.7.0-rc3
Commit 3348de4ea6 (2021-11-15 08:22 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 6

So maybe the simplest option would be to rerun the test on a newer version of Julia?

It’s not the Julia version. (I would know but to be sure I’ve explicitly run the example in 1.5, 1.6 and 1.7, all with the expected speedup.) But of course, it can’t hurt to update at least to 1.6.

1 Like

So is it a package then?

if you got to task manager and check the by cpu graph do all cores get loaded or just one?

OP just showed it to us above.

But that’s a generalized graph, not sure if 100% there means a single core loaded to the max or all cores loaded to the max. I get 100% and like 380% on my computer indicating 4 cores working .

1 Like

You could try to setup this example in a separate environment 4. Working with Environments · Pkg.jl to exclude influence of installed packages.

The 4 cores are working, as shown in the print. Updating to 1.6.3 also didn’t help. I will try using the environments now. For now, thank you very much guys!

1 Like

Grasping at straws here, but check Windows power settings. Maybe there is a setting that limits CPU usage, possibly “Power saver” in the image below.

from All Windows 10 Power Options Explained

1 Like