Parallel computing with *

what version of Julia are you using? paste result of versioninfo() would be helpful

Would appreciate any help. Thanks.
Screen Shot 2022-12-29 at 11.15.05 AM

What type of matrix is M? How exactly are you generating it?

Matrix{Int64} (alias for Array{Int64, 2})

Screen Shot 2022-12-29 at 11.25.30 AM

Maybe BLAS doesn’t have methods for boolean matrices. Try using Octavian maybe? (It is sensible to the number of Julia threads).

This is why minimal working examples are useful to get quick and helpful resolution of your issues:

1 Like


Definitely not. You only get BLAS acceleration for floating-point types.

I would just convert it to Matrix{Float32} (which is exact for integers up to 16777216) or Matrix{Float64} (which is exact for integers up to 9007199254740992). These will benefit from optimized multithreaded BLAS, SIMD, etcetera.

For your matrix sizes, Float32 is almost 200x faster on my machine:

julia> M = rand(0:2, 2000,50000);

julia> using BenchmarkTools;

julia> @btime $M * $M';
  126.120 s (5 allocations: 30.55 MiB)

julia> A = Matrix{Float32}(M);

julia> M * M' == A * A'  # Float32 is exact for integers of this size

julia> @btime $A * $A';
  688.669 ms (2 allocations: 15.26 MiB)

Wow, I had no idea. This doesn’t make any intuitive sense to me… There must be a package that implements multi-threading for integers?Thank you for your help.

Yup, amazing. Thank you very much. Works great. Had no idea, I will look in the documentation and see if that is apparent, I maybe missed something.

I appreciate your help! Solved below.

The basic reasons are:

  • linear algebra ultimately requires floating-point arithmetic in almost all practical cases, even if you are starting with matrices of integers, so most of the optimization effort has been devoted to the floating-point case
  • the most mature and popular linear-algebra library, LAPACK and BLAS (used by Julia, Matlab, Scipy, R, …), is originally based on Fortran 77, which doesn’t support type-generic code, so it made sense to only support floating-point types
  • even if you can write type-generic code, as in Julia (or using metaprogramming in C or Fortran), fully optimizing matrix multiplication requires a lot of low-level kernel specializations that tend to be type specific. Float32/64 and ComplexF32/64 are inevitably going to get the most attention.
1 Like

Okay thank you very much. I appreciate your help. Not a programmer here, just assumed wrong on this one. Hopefully the docs include this information. I have probably been doing this wrong for some time now. Thanks!

Does Octavian.jl (or a similar package) not parallelize integer matmuls?

If you take a closer look at the post i linked, you would see that it is recommended that you copy and paste text instead of images of code. The reason being that it is easy to copy-paste from text (and thereby run and test it), but not from images. You get proper formatting by wrapping the code in triple backticks, as demonstrated in that post.

1 Like

I tested multiplying two Boolean matrices and it was very slow. I think it just fall back to some naive algorithm.


GitHub - MasonProtter/Gaius.jl: Divide and Conquer Linear Algebra says

Currently, fast, native matrix-multiplication is only implemented between matrices of types Matrix{<:Union{Float64, Float32, Int64, Int32, Int16}}, and StructArray{Complex}. Support for other other commonly encountered numeric struct types such as Rational and Dual numbers is planned.


I have not tried Octavian yet… I’ll do that tonight. On your 2nd post, interesting, it was an Int64 type. I just assumed it would work in parallel, but the other guy explained why not. Had no idea. Thanks for this link.

iirc Octavian will do integers. the optimizations are pretty much all the same (except for the lack of fma)

1 Like

Apparently not (ignore compilation times, doesn’t make a difference for the result):

julia> using Octavian

julia> A = rand(Int, 10^4, 10^3);

julia> @time A * A';
 36.394812 seconds (2.60 M allocations: 888.709 MiB, 0.43% gc time, 1.63% compilation time)

julia> A = rand(Float64, 10^4, 10^3);

julia> @time A * A';
  2.114822 seconds (2.73 M allocations: 895.197 MiB, 3.93% gc time, 31.80% compilation time)

just to be sure:

julia> A = rand(Int, 5*10^3, 5*10^3);

julia> @time Octavian.matmul(A,A');
  7.988792 seconds (3 allocations: 190.735 MiB, 0.08% gc time)

julia> A = rand(Float64, 5*10^3, 5*10^3);

julia> @time Octavian.matmul(A,A');
  2.022297 seconds (3 allocations: 190.735 MiB, 4.02% gc time)