Parallelization on the CPU isn't effective

leopdsf · November 19, 2021, 7:48pm

Hello, everyone. I’m following the tutorial introduction to GPU programming in Julia (Introduction · CUDA.jl), and the first example in the tutorial discuss parallelization on CPU. Didn’t find a specific section for CPU, so thought General Usage would be addequate. The code is

using Test
using BenchmarkTools

N=2^20
x = fill(1.0f0, N)  # a vector filled with 1.0 (Float32)
y = fill(2.0f0, N)

function sequential_add!(y, x)
    for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

function parallel_add!(y, x)
    Threads.@threads for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

@btime sequential_add!($y, $x)
@btime parallel_add!($y, $x)

The results given by the tutorial for the benchmarking with 4 threads of function sequential_add is 487.303 μs (0 allocations: 0 bytes), and for the function parallel_add is 259.587 μs (13 allocations: 1.48 KiB), showing that parallel computation is faster.

When I run the code on my PC, the execution time is roughly the same (387.601 μs (0 allocations: 0 bytes) for parallel and 386.501 μs (20 allocations: 3.39 KiB)), and running Threads.nthreads(), the result is 4, therefore I am also using 4 threads.

This should be a problem, right? Am I doing something wrong while compiling, or wrongly interpreting the use of threads? Also, there is some package that I need to install before parallelizing?

carstenbauer · November 19, 2021, 7:54pm

That’s strange. Here is what I get:

julia> @btime sequential_add!($y, $x)
  250.849 μs (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  81.248 μs (21 allocations: 1.83 KiB)

To answer your question, no you don’t need to install a package or do anything special expect for making sure to start Julia with 4 threads, i.e. julia -t4.

goerch · November 19, 2021, 7:58pm

OP reports nthreads() being 4. So another explanation would be that some of these threads are not available;)

carstenbauer · November 19, 2021, 8:08pm

Yeah, I guess in principle it could be a coincidence (e.g. caused by other OS threads keeping the cores busy). However, the timings are so similar that it’s way more likely that the benchmarks have been accidentally run with nthreads() == 1.

Perhaps the number of threads was 4 at some point but the OP has restarted Julia while experimenting around and forgotten to start Julia with 4 threads again (just guessing of course). @leopdsf could you ensure that this was not the case? Perhaps you could post a screenshot of your terminal like this, i.e. including the command to start Julia:

leopdsf · November 19, 2021, 8:16pm

Hello @carstenbauer and @goerch, thanks in advance for answering. Here are the prints:

carstenbauer · November 19, 2021, 8:20pm

Hm, can you post the output of Sys.CPU_THREADS please?

goerch · November 19, 2021, 8:21pm

I don’t see a reference to Threads.nthreads()?

carstenbauer · November 19, 2021, 8:22pm

Yeah, I also missed it my screenshot above. Would be good to explicitly confirm it, but he is clearly starting Julia with -t4 (and the command line argument has higher precedence than any potententially set env variable).

leopdsf · November 19, 2021, 8:23pm

Here they are:

goerch · November 19, 2021, 8:24pm

Wow, that looks really bad. Is your system doing something else? What does your system monitor show? What does versioninfo() show?

leopdsf · November 19, 2021, 8:30pm

The first is in portuguese, trying to put it in english

goerch · November 19, 2021, 8:35pm

Thanks for your help. Here is my info running the example from an included file

  144.400 μs (0 allocations: 0 bytes)
  37.000 μs (30 allocations: 3.62 KiB)

and versioninfo() giving

Julia Version 1.7.0-rc3
Commit 3348de4ea6 (2021-11-15 08:22 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 6

So maybe the simplest option would be to rerun the test on a newer version of Julia?

carstenbauer · November 19, 2021, 8:37pm

It’s not the Julia version. (I would know but to be sure I’ve explicitly run the example in 1.5, 1.6 and 1.7, all with the expected speedup.) But of course, it can’t hurt to update at least to 1.6.

goerch · November 19, 2021, 8:38pm

So is it a package then?

gbaraldi · November 19, 2021, 8:46pm

if you got to task manager and check the by cpu graph do all cores get loaded or just one?

goerch · November 19, 2021, 8:47pm

OP just showed it to us above.

gbaraldi · November 19, 2021, 8:49pm

But that’s a generalized graph, not sure if 100% there means a single core loaded to the max or all cores loaded to the max. I get 100% and like 380% on my computer indicating 4 cores working .

goerch · November 19, 2021, 8:50pm

You could try to setup this example in a separate environment 4. Working with Environments · Pkg.jl to exclude influence of installed packages.

leopdsf · November 19, 2021, 10:21pm

The 4 cores are working, as shown in the print. Updating to 1.6.3 also didn’t help. I will try using the environments now. For now, thank you very much guys!

Jeff_Emanuel · November 19, 2021, 10:29pm

Grasping at straws here, but check Windows power settings. Maybe there is a setting that limits CPU usage, possibly “Power saver” in the image below.

from All Windows 10 Power Options Explained

Topic		Replies	Views
Parallelizaton on GPU slower than on CPU...? Performance gpu	10	2333	January 21, 2020
Multi-threading on a 2 CPU system New to Julia multithreading	15	1085	February 2, 2023
Why julia is not using all my CPU? General Usage	18	3825	April 25, 2020
Decrease in performance using Threads.@threads in Linux Julia at Scale	16	1992	July 23, 2019
Help wanted: benchmarking multi-threaded CPU performance Offtopic hardware	20	933	May 13, 2024

Parallelization on the CPU isn't effective

Related topics