Why is the matrix multiplication with integer matrices much slower than with float ones?

DY_K · June 27, 2023, 3:09pm

The following is the simple function for the test.

function vec_prod(n)
output = 0;
vec = collect(1:n)
mat = Array{Float64}(undef, n, n)

mat .= vec .* vec'

matt1 = mat .* mat
matt2 = mat * mat

output = dot(matt1, matt2)
return output

end

@btime vec_prod(50)
14.800 μs (7 allocations: 59.31 KiB)
1.1343603629882813e17

If I change mat = Array{Int64}(undef, n, n),
59.800 μs (10 allocations: 90.06 KiB)
113436036298828125

From my investigation, I figured out that the speed bottleneck for the case of {Int64} was matt2 = mat * mat. If I remove this matrix multiplication part, like switching that part with matt2 = matt1,
{Float64}
4.571 μs (5 allocations: 39.70 KiB)
4.312110892222225e15
{Int64}
3.786 μs (5 allocations: 39.70 KiB)
4312110892222225
The case of {Int64} is faster.

I’ve tested with pre-defining matt2 and/or using mul! of LinearAlgebra. The results were the same.
Why is the matrix multiplication with integer matrices much slower than with float ones? It looks like this is some kind of bug.

Oscar_Smith · June 27, 2023, 3:14pm

The reasoning is that floating point matmul goes to BLAS while the integer one is a fallback method. The fallback could always be made faster though.

nsajko · June 27, 2023, 3:25pm

Try Octavian or Tullio. I guess they’re faster than whatever’s in the stdlib.

https://juliahub.com/ui/Packages/Tullio/PIgzC/0.3.5

https://juliahub.com/ui/Packages/Octavian/RyOyv/0.3.24

stevengj · June 27, 2023, 4:32pm

To add a bit more explanation — decades of work have gone into optimizing floating-point matrix multiplication, because this is the most important case in most applications, resulting in BLAS libraries that approach the theoretical peak performance of your CPU for large matrices. Julia and most other languages/libraries (e.g. Numpy in Python, R, Matlab) call one of these BLAS libraries for multiplying floating-point libraries (and in consequence, operations on large dense floating-point matrices run at basically the same speed in all languages).

Matrix multiplication for other numeric types is not nearly so well optimized — this is true in both Julia and in other languages (e.g. numpy). There are native Julia libraries like Octavian (mentioned by @nsajko above) that can generate heavily optimized matrix products for a wider variety of numeric types.

However, in most applications, you should probably simply use floating-point matrices for linear algebra. There are basically two cases:

If all of the integers (including intermediate steps) have magnitudes < 2^{53}, then integer multiplications/additions are exact in Float64 arithmetic — so floating-point matrix multiplication will give exactly the same results as integer matrix multiplication, just more quickly.
If you generate values > 2^{53}, then you are risking integer overflow with Int and you might get undesired results like (2^40)^2 == 0 (unless you switch to a wider integer type, which is even slower). Float64 quantities will incur roundoff errors in this case, but the errors are much smaller and better behaved than integer overflow.

Elrod · June 27, 2023, 5:43pm

I think integers are generally best indexing, when you want modular arithmetic, and when you want to play with bits (like in a random number generator).

If you want to do arithmetic on integers, Float64 will often be a better choice. As said above, it is exact for all values \le 2^53, so rounding isn’t a concern.

But what if you do want to check for it?
Credit to @c42f for much of this example:

const FE_INVALID = 0x1
const FE_DIVBYZERO = 0x4
const FE_OVERFLOW = 0x8
const FE_UNDERFLOW = 0x10
const FE_INEXACT = 0x20

fpexceptions() = ccall(:fegetexcept, Cint, ())

function setfpexceptions(f, mode)
  prev = ccall(:feenableexcept, Cint, (Cint,), mode)
  try
    f()
  finally
    ccall(:fedisableexcept, Cint, (Cint,), mode & ~prev)
  end
end

fpinexactexceptions(f::F) where {F} = setfpexceptions(f, FE_INEXACT)

x::Float64 = 1e12
fpinexactexceptions() do
  (x + 65.0) - x
end
x = 1e18
fpinexactexceptions() do
  (x + 65.0) - x
end

I get

julia> x::Float64 = 1e12
1.0e12

julia> fpinexactexceptions() do
         (x + 65.0) - x
       end
65.0

julia> x = 1e18
1.0e18

julia> fpinexactexceptions() do
         (x + 65.0) - x
       end
ERROR: DivideError: integer division error
Stacktrace:
 [1] +(x::Float64, y::Float64)
   @ Base ./float.jl:409 [inlined]
 [2] (::var"#3#4")()
   @ Main ./REPL[12]:2 [inlined]
 [3] setfpexceptions(f::var"#3#4", mode::UInt8)
   @ Main ./REPL[7]:4
 [4] fpinexactexceptions(f::var"#3#4")
   @ Main ./REPL[8]:1
 [5] top-level scope
   @ REPL[12]:1

DivideError: integer division error isn’t exactly a good error message. I think that’s the only exception turned on by default, so when an exception gets triggered, that’s the message we see.

But here, we know that the actual cause was

julia> (x + 65.0) - x # oops
128.0

which was cheaply and conveniently caught!

Turning on exceptions is cheap, you only have to do it once around your computation that must be exact.
And that’s it – the computations themselves should run at full speed; no exception means no overflows, and you can have some trust in the answers.

Elrod · June 27, 2023, 5:55pm

But as the author of Octavian, I may as well still add a benchmark using this PR branch:

julia> @time using Octavian
  0.246463 seconds (398.32 k allocations: 20.927 MiB, 4.20% compilation time)

julia> A = rand(-1_000:1_000, 200, 200);

julia> B = rand(-1_000:1_000, 200, 200);

julia> C = similar(A);

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 420 samples with 1 evaluation.
 Range (min … max):  2.340 ms …  2.555 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.394 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.378 ms ± 29.314 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▃█▃▃▂                         ▂█▆▇▅▄                      
  ▄██████▅▅▅▃▂▂▂▃▁▂▃▁▁▁▁▁▁▁▁▁▂▂▃▃▅███████▅▄▄▂▁▃▂▁▁▄▃▁▁▁▁▁▁▁▃ ▄
  2.34 ms        Histogram: frequency by time        2.44 ms <

 Memory estimate: 30.77 KiB, allocs estimate: 4.

julia> @benchmark matmul!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  90.901 μs … 112.142 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     94.696 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   94.340 μs ±   1.490 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                    ▁▆█▆▂                       
  ▂▂▂▃▄▅▅▄▃▂▂▂▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▃▅█████▇▄▃▂▂▂▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃ ▃
  90.9 μs         Histogram: frequency by time         97.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> 2.369e-3 / 91.302e-6
25.946857681102276

julia> Af64 = Float64.(A); Bf64 = Float64.(B); Cf64 = similar(Af64);

julia> @benchmark mul!($Cf64, $Af64, $Bf64)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  14.839 μs … 34.865 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     15.278 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   15.407 μs ±  1.052 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▂▆█▅▁                                                    
  ▂▂▃▅█████▆▅▃▂▂▂▂▂▁▁▁▂▂▁▁▂▂▂▂▂▂▂▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  14.8 μs         Histogram: frequency by time        18.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark matmul!($Cf64, $Af64, $Bf64)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  23.794 μs …  33.747 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     24.623 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   24.702 μs ± 534.346 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▁▃▄▆▇▇█▇▆▄▂▁                                         
  ▂▂▂▂▃▄▄▅▇█████████████▆▅▄▄▃▃▃▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▄
  23.8 μs         Histogram: frequency by time           27 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> Cf64 == C
true

Octavian ramps thread usage up slower than MKL, apparently. It does better single threaded at this size:

julia> using LinearAlgebra; BLAS.set_num_threads(1);

julia> @benchmark mul!($Cf64, $Af64, $Bf64)
BenchmarkTools.Trial: 4972 samples with 1 evaluation.
 Range (min … max):  186.730 μs … 278.905 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     200.794 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   196.320 μs ±   8.450 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █▅                                ▂▆▂                         
  ▅██▆▂▁▂▄▅▄▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▃███▃▂▂▂▄▅▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  187 μs           Histogram: frequency by time          214 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark matmul_serial!($Cf64, $Af64, $Bf64)
BenchmarkTools.Trial: 5877 samples with 1 evaluation.
 Range (min … max):  163.874 μs … 232.942 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     164.972 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   166.591 μs ±   5.089 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▇█▆▂  ▄▅▄▁                          ▂▂                       ▁
  █████▆█████▅▄▁▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▄▅███▇▅▅▅▇█▇▆▄▅▅▃▆▃▅▅▄▅▆▄▃▅ █
  164 μs        Histogram: log(frequency) by time        190 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

For integers, some of the performance benefit of course also comes from multithreading:

julia> @benchmark matmul_serial!($C, $A, $B)
BenchmarkTools.Trial: 1365 samples with 1 evaluation.
 Range (min … max):  672.498 μs … 839.838 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     738.069 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   728.982 μs ±  28.663 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▅                         ▆█▁                                 
  ██▁▁▁▁▃▁▅▄▃▅▄▃▄▅▄▃▄▁▄▃▅▄▄▆▆███▆▆▅▆▇▇▅▃▄▃▃▄▃▅▃▃▄▅▃▁▃▄▃▁▅▅▆▃▃▅▅ █
  672 μs        Histogram: log(frequency) by time        814 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Elrod · June 27, 2023, 6:02pm

julia> foreachf!(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> f(args...), Base.OneTo(N))
foreachf! (generic function with 1 method)

julia> using LinuxPerf

julia> @time @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
       foreachf!(mul!, 100000, Cf64, Af64, Bf64)
       end
  2.814530 seconds (33.44 k allocations: 2.648 MiB, 0.49% gc time, 0.69% compilation time)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               1.38e+11   60.0%  #  2.8 cycles per ns
┌ instructions             2.37e+11   60.0%  #  1.7 insns per cycle
│ branch-instructions      1.67e+10   60.0%  #  7.1% of insns
└ branch-misses            1.36e+07   60.0%  #  0.1% of branch insns
┌ task-clock               4.96e+10  100.0%  # 49.6 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    1.91e+10   20.0%  # 21.0% of dcache loads
│ L1-dcache-loads          9.10e+10   20.0%
└ L1-icache-load-misses    7.22e+07   20.0%
┌ dTLB-load-misses         4.70e+03   20.0%  #  0.0% of dTLB loads
└ dTLB-loads               9.09e+10   20.0%
┌ iTLB-load-misses         4.39e+05   40.0%  # 54.7% of iTLB loads
└ iTLB-loads               8.03e+05   40.0%
                 aggregated from 18 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @time @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
       foreachf!(matmul!, 100000, Cf64, Af64, Bf64)
       end
  2.483556 seconds (33.44 k allocations: 2.648 MiB, 0.77% compilation time)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               1.22e+11   60.0%  #  2.8 cycles per ns
┌ instructions             1.87e+11   60.0%  #  1.5 insns per cycle
│ branch-instructions      6.60e+09   60.0%  #  3.5% of insns
└ branch-misses            3.50e+07   60.0%  #  0.5% of branch insns
┌ task-clock               4.39e+10  100.0%  # 43.9 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    1.74e+10   20.0%  # 28.3% of dcache loads
│ L1-dcache-loads          6.14e+10   20.0%
└ L1-icache-load-misses    1.80e+06   20.0%
┌ dTLB-load-misses         5.85e+02   20.0%  #  0.0% of dTLB loads
└ dTLB-loads               6.14e+10   20.0%
┌ iTLB-load-misses         4.31e+03   40.0%  # 32.9% of iTLB loads
└ iTLB-loads               1.31e+04   40.0%
                 aggregated from 18 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Apparently not. Both used 18 threads, but Octavian required less wall time, less CPU time, fewer instructions, but did have 2.5x the branch misses.
For larger arrays, MKL does better. Octavian’s kernels have a problem limiting scaling to huge matrices.

Topic		Replies	Views
Int numerical calculation speed slower than Float? Performance	12	1762	February 17, 2020
Parallel computing with * Performance question	27	1116	December 29, 2022
Speed comparison matrix multiplication in Julia Performance question , linearalgebra , optimization , tullio	45	3267	August 19, 2021
Fast small integer matrix multiplication Performance	11	483	September 9, 2022
Drastic performance hit matrix multiply different types. Internal cast julia vs numpy? Numerics	15	2446	November 4, 2018

Why is the matrix multiplication with integer matrices much slower than with float ones?

Related topics