Parallel computing with *

So I was testing julia with different number of threads using $ julia -t 1 or $ julia -t 7. I run versioninfo() and it confirms that I do in fact have 1 or 7 cores on a basic desktop setup. I tested a decent sized matrix and ran a simple matrix multiplication, which I thought would already run in parallel naturally (based on a stack overflow answer). However, I get identical benchmarks with @benchmark or @time so I’m very confused now. The code was very simple, a matrix of 0,1, or 2’s (2,000 x 50,000) called M and I did G = M * M' and got 223 seconds almost exactly for both runs. I assumed it would be ~7 times faster with 7 cores.

Just confused and looking for general help on why this didn’t happen. Is -t 1 being ignored and using all cores anyway?

This probably answers your question: Frequently Asked Questions · DifferentialEquations.jl

But the short answer is that BLAS multi threading is controlled independently of the Julia multi threading.

4 Likes

Thank you for your answer. So I did find this option, but I just tested again. I started up with 1 vs 7 threads. I then set with BLAS.set_num_threads(1) or (7) and get identical results again (~223 seconds). I still don’t get what julia is doing. Any more ideas?

This is what I get here:

julia> A = rand(10^4, 10^4);

julia> BLAS.set_num_threads(1)

julia> @time A * A';
 20.189948 seconds (3 allocations: 762.940 MiB)

julia> BLAS.set_num_threads(4)

julia> @time A * A';
  7.785689 seconds (3 allocations: 762.940 MiB, 1.11% gc time)
1 Like

hummm, I’ll try again. Thank you.

Wow, so I simplified again, almost exactly your code and cannot find a difference in time… I don’t know what I’m doing wrong here. No matter what combination of julia -t I use or BLAS.set_num_threads() it comes up with the same time. I’m very baffled by this.

Next question is how can I determine by the run how many cores it used? Is there a macro for that? Maybe the Profile.jl package?

@Imiq When I used your example it worked… yet again, I try my example and no difference. This is very frustrating. Not sure what is going on.
Screen Shot 2022-12-29 at 11.06.01 AM

I commented below here with my example where I don’t see a difference… Any ideas? Maybe due to the type of matrix?

what version of Julia are you using? paste result of versioninfo() would be helpful

Would appreciate any help. Thanks.
Screen Shot 2022-12-29 at 11.15.05 AM

What type of matrix is M? How exactly are you generating it?

Matrix{Int64} (alias for Array{Int64, 2})

Screen Shot 2022-12-29 at 11.25.30 AM

Maybe BLAS doesn’t have methods for boolean matrices. Try using Octavian maybe? (It is sensible to the number of Julia threads).

This is why minimal working examples are useful to get quick and helpful resolution of your issues:

1 Like

thanks…

Definitely not. You only get BLAS acceleration for floating-point types.

I would just convert it to Matrix{Float32} (which is exact for integers up to 16777216) or Matrix{Float64} (which is exact for integers up to 9007199254740992). These will benefit from optimized multithreaded BLAS, SIMD, etcetera.

For your matrix sizes, Float32 is almost 200x faster on my machine:

julia> M = rand(0:2, 2000,50000);

julia> using BenchmarkTools;

julia> @btime $M * $M';
  126.120 s (5 allocations: 30.55 MiB)

julia> A = Matrix{Float32}(M);

julia> M * M' == A * A'  # Float32 is exact for integers of this size
true

julia> @btime $A * $A';
  688.669 ms (2 allocations: 15.26 MiB)
6 Likes

Wow, I had no idea. This doesn’t make any intuitive sense to me… There must be a package that implements multi-threading for integers?Thank you for your help.

Yup, amazing. Thank you very much. Works great. Had no idea, I will look in the documentation and see if that is apparent, I maybe missed something.

I appreciate your help! Solved below.

The basic reasons are:

  • linear algebra ultimately requires floating-point arithmetic in almost all practical cases, even if you are starting with matrices of integers, so most of the optimization effort has been devoted to the floating-point case
  • the most mature and popular linear-algebra library, LAPACK and BLAS (used by Julia, Matlab, Scipy, R, …), is originally based on Fortran 77, which doesn’t support type-generic code, so it made sense to only support floating-point types
  • even if you can write type-generic code, as in Julia (or using metaprogramming in C or Fortran), fully optimizing matrix multiplication requires a lot of low-level kernel specializations that tend to be type specific. Float32/64 and ComplexF32/64 are inevitably going to get the most attention.
1 Like