Parallel computing with *

austin-putz · December 29, 2022, 3:35pm

So I was testing julia with different number of threads using $ julia -t 1 or $ julia -t 7. I run versioninfo() and it confirms that I do in fact have 1 or 7 cores on a basic desktop setup. I tested a decent sized matrix and ran a simple matrix multiplication, which I thought would already run in parallel naturally (based on a stack overflow answer). However, I get identical benchmarks with @benchmark or @time so I’m very confused now. The code was very simple, a matrix of 0,1, or 2’s (2,000 x 50,000) called M and I did G = M * M' and got 223 seconds almost exactly for both runs. I assumed it would be ~7 times faster with 7 cores.

Just confused and looking for general help on why this didn’t happen. Is -t 1 being ignored and using all cores anyway?

lmiq · December 29, 2022, 3:45pm

This probably answers your question: Frequently Asked Questions · DifferentialEquations.jl

But the short answer is that BLAS multi threading is controlled independently of the Julia multi threading.

austin-putz · December 29, 2022, 4:02pm

Thank you for your answer. So I did find this option, but I just tested again. I started up with 1 vs 7 threads. I then set with BLAS.set_num_threads(1) or (7) and get identical results again (~223 seconds). I still don’t get what julia is doing. Any more ideas?

lmiq · December 29, 2022, 4:22pm

This is what I get here:

julia> A = rand(10^4, 10^4);

julia> BLAS.set_num_threads(1)

julia> @time A * A';
 20.189948 seconds (3 allocations: 762.940 MiB)

julia> BLAS.set_num_threads(4)

julia> @time A * A';
  7.785689 seconds (3 allocations: 762.940 MiB, 1.11% gc time)

austin-putz · December 29, 2022, 4:33pm

hummm, I’ll try again. Thank you.

austin-putz · December 29, 2022, 4:54pm

Wow, so I simplified again, almost exactly your code and cannot find a difference in time… I don’t know what I’m doing wrong here. No matter what combination of julia -t I use or BLAS.set_num_threads() it comes up with the same time. I’m very baffled by this.

Next question is how can I determine by the run how many cores it used? Is there a macro for that? Maybe the Profile.jl package?

austin-putz · December 29, 2022, 5:06pm

@Imiq When I used your example it worked… yet again, I try my example and no difference. This is very frustrating. Not sure what is going on.
Screen Shot 2022-12-29 at 11.06.01 AM

austin-putz · December 29, 2022, 5:10pm

I commented below here with my example where I don’t see a difference… Any ideas? Maybe due to the type of matrix?

jling · December 29, 2022, 5:13pm

what version of Julia are you using? paste result of versioninfo() would be helpful

austin-putz · December 29, 2022, 5:15pm

Would appreciate any help. Thanks.
Screen Shot 2022-12-29 at 11.15.05 AM

lmiq · December 29, 2022, 5:20pm

What type of matrix is M? How exactly are you generating it?

austin-putz · December 29, 2022, 5:25pm

Matrix{Int64} (alias for Array{Int64, 2})

Screen Shot 2022-12-29 at 11.25.30 AM

lmiq · December 29, 2022, 5:34pm

Maybe BLAS doesn’t have methods for boolean matrices. Try using Octavian maybe? (It is sensible to the number of Julia threads).

DNF · December 29, 2022, 5:38pm

This is why minimal working examples are useful to get quick and helpful resolution of your issues:

austin-putz · December 29, 2022, 5:40pm

thanks…

stevengj · December 29, 2022, 5:48pm

Definitely not. You only get BLAS acceleration for floating-point types.

I would just convert it to Matrix{Float32} (which is exact for integers up to 16777216) or Matrix{Float64} (which is exact for integers up to 9007199254740992). These will benefit from optimized multithreaded BLAS, SIMD, etcetera.

For your matrix sizes, Float32 is almost 200x faster on my machine:

julia> M = rand(0:2, 2000,50000);

julia> using BenchmarkTools;

julia> @btime $M * $M';
  126.120 s (5 allocations: 30.55 MiB)

julia> A = Matrix{Float32}(M);

julia> M * M' == A * A'  # Float32 is exact for integers of this size
true

julia> @btime $A * $A';
  688.669 ms (2 allocations: 15.26 MiB)

austin-putz · December 29, 2022, 6:10pm

Wow, I had no idea. This doesn’t make any intuitive sense to me… There must be a package that implements multi-threading for integers?Thank you for your help.

austin-putz · December 29, 2022, 6:20pm

Yup, amazing. Thank you very much. Works great. Had no idea, I will look in the documentation and see if that is apparent, I maybe missed something.

austin-putz · December 29, 2022, 6:20pm

I appreciate your help! Solved below.

stevengj · December 29, 2022, 6:22pm

The basic reasons are:

linear algebra ultimately requires floating-point arithmetic in almost all practical cases, even if you are starting with matrices of integers, so most of the optimization effort has been devoted to the floating-point case
the most mature and popular linear-algebra library, LAPACK and BLAS (used by Julia, Matlab, Scipy, R, …), is originally based on Fortran 77, which doesn’t support type-generic code, so it made sense to only support floating-point types
even if you can write type-generic code, as in Julia (or using metaprogramming in C or Fortran), fully optimizing matrix multiplication requires a lot of low-level kernel specializations that tend to be type specific. Float32/64 and ComplexF32/64 are inevitably going to get the most attention.

Topic		Replies	Views
Performance issue with multithreaded computation with matrix operations at its heart (Threads.@threads vs. BLAS threads) Performance blas , parallel , multithreading , linearalgebra , threads	7	391	November 13, 2023
Julia seems to be running multithreaded when I don't want it to Julia at Scale	67	1357	March 2, 2023
Same multi-threaded code, scaling observed only on some machines Performance	2	67	August 14, 2024
Matrix vector multiplication Performance question	4	895	September 27, 2020
Julia Threads vs BLAS threads Internals & Design	16	10907	July 26, 2018

Parallel computing with *

Related topics