So I was testing julia with different number of threads using $ julia -t 1 or $ julia -t 7. I run versioninfo() and it confirms that I do in fact have 1 or 7 cores on a basic desktop setup. I tested a decent sized matrix and ran a simple matrix multiplication, which I thought would already run in parallel naturally (based on a stack overflow answer). However, I get identical benchmarks with @benchmark or @time so I’m very confused now. The code was very simple, a matrix of 0,1, or 2’s (2,000 x 50,000) called M and I did G = M * M' and got 223 seconds almost exactly for both runs. I assumed it would be ~7 times faster with 7 cores.

Just confused and looking for general help on why this didn’t happen. Is -t 1 being ignored and using all cores anyway?

Thank you for your answer. So I did find this option, but I just tested again. I started up with 1 vs 7 threads. I then set with BLAS.set_num_threads(1) or (7) and get identical results again (~223 seconds). I still don’t get what julia is doing. Any more ideas?

Wow, so I simplified again, almost exactly your code and cannot find a difference in time… I don’t know what I’m doing wrong here. No matter what combination of julia -t I use or BLAS.set_num_threads() it comes up with the same time. I’m very baffled by this.

Next question is how can I determine by the run how many cores it used? Is there a macro for that? Maybe the Profile.jl package?

Definitely not. You only get BLAS acceleration for floating-point types.

I would just convert it to Matrix{Float32} (which is exact for integers up to 16777216) or Matrix{Float64} (which is exact for integers up to 9007199254740992). These will benefit from optimized multithreaded BLAS, SIMD, etcetera.

For your matrix sizes, Float32 is almost 200x faster on my machine:

julia> M = rand(0:2, 2000,50000);
julia> using BenchmarkTools;
julia> @btime $M * $M';
126.120 s (5 allocations: 30.55 MiB)
julia> A = Matrix{Float32}(M);
julia> M * M' == A * A' # Float32 is exact for integers of this size
true
julia> @btime $A * $A';
688.669 ms (2 allocations: 15.26 MiB)

Wow, I had no idea. This doesn’t make any intuitive sense to me… There must be a package that implements multi-threading for integers?Thank you for your help.

linear algebra ultimately requires floating-point arithmetic in almost all practical cases, even if you are starting with matrices of integers, so most of the optimization effort has been devoted to the floating-point case

the most mature and popular linear-algebra library, LAPACK and BLAS (used by Julia, Matlab, Scipy, R, …), is originally based on Fortran 77, which doesn’t support type-generic code, so it made sense to only support floating-point types

even if you can write type-generic code, as in Julia (or using metaprogramming in C or Fortran), fully optimizing matrix multiplication requires a lot of low-level kernel specializations that tend to be type specific. Float32/64 and ComplexF32/64 are inevitably going to get the most attention.