Int numerical calculation speed slower than Float?

Why the calculation speed of “Int” matrix is much slower than “Float” matrix?

IDE: Jupyter
Julia version: 1.1.0

Code for Float:
s=1000
d1=rand(s,s)
d2=rand(s,s)
@time (d1*d2);
Result: 0.032192 seconds (6 allocations: 7.630 MiB)

Code for Int:
s=1000
d3=rand(Int,s,s)
d4=rand(Int, s,s)
@time (d3*d4);
Result: 1.131875 seconds (12 allocations: 7.630 MiB)

Float matrices use blas. Int uses a generic fallback. Making the fallback method multithreaded for large matrices would fix much of the problem.

2 Likes

I think that for floats, BLAS is used, while for integers it is native Julia code. The latter could probably be made faster, but it is not a common use case so it is waiting for someone to do it.

(also, please quote code)

Do you think it would be worth it for mixed type matmul to convert arguments before multiplying? I think that should speed things up a lot (with some memory downsides). The other big thing the fallback needs is better cache aware looping.

No, I would not convert. First, integers have specific overflow semantics in Julia different from float, so I am not sure what is intended and what isn’t.

Second (and more importantly), you really have to go out of your way to get a matrix with a non-concrete element type when writing idiomatic code, so I am not sure it is a common use case. I would leave it up to the user to promote if that is needed.

We should have a specialized very effective Int multiplication kernel though.

4 Likes

One way to do this is LoopVectorization’s example, which is faster but still not as fast as floats:

julia> C1 = Matrix{Int}(undef, M, N); A = rand(1:100, M, K); B = rand(1:100, K, N);

julia> C2 = similar(C1); C3 = similar(C1);

julia> @btime mygemmavx!($C1, $A, $B)
  77.412 μs (0 allocations: 0 bytes)

julia> @btime mygemm!($C2, $A, $B)
  245.869 μs (0 allocations: 0 bytes)

julia> @btime mul!($C3, $A, $B); # julia's generic_matmul
  164.278 μs (6 allocations: 336 bytes)

compared to Float64:

julia> @btime mygemmavx!($C1, $A, $B)
  14.599 μs (0 allocations: 0 bytes)

julia> @btime mygemm!($C2, $A, $B)
  290.296 μs (0 allocations: 0 bytes)

julia> @btime mul!($C3, $A, $B); # openblas, not MKL
  22.635 μs (0 allocations: 0 bytes)

Note BTW that in your example, rand(Int, ...) produces lots of large numbers, which will overflow:

julia> (float(d3) * float(d4)) .- (d3 * d4) |> extrema
(-4.2418818662328814e39, 4.4115065145744565e39)

julia> extrema(A)
(1, 100)

julia> (float(A) * float(B)) .- (A * B) |> extrema
(0.0, 0.0)
4 Likes

Using

julia> M, K, N = 72, 75, 71;

My results with Float64 are:

julia> BLAS.set_num_threads(1)

julia> @btime mygemmavx!($C1, $A, $B)
  7.380 μs (0 allocations: 0 bytes)

julia> @btime mygemm!($C2, $A, $B)
  231.900 μs (0 allocations: 0 bytes)

julia> @btime mul!($C3, $A, $B); # julia's generic_matmul
  6.780 μs (0 allocations: 0 bytes)

And with Int:

julia> @btime mygemmavx!($C1, $A, $B)
  26.158 μs (0 allocations: 0 bytes)

julia> @btime mygemm!($C2, $A, $B)
  190.645 μs (0 allocations: 0 bytes)

julia> @btime mul!($C3, $A, $B); # julia's generic_matmul
  101.748 μs (6 allocations: 336 bytes)

So Int is about 3.5x slower than Float for me, while it is 5.3x slower for you.
With integers, it uses the vpmullq instruction for integer multiplication. But this instruction appears to be slow, with a reciprocal throughput of around 1.5-3, while the vpaddq instruction is around 0.33 or 0.5.
The floating point versions use fused multiply-add instructions to combine both the multiplication and addition, and have a reciprical throughput of about 0.5.
You can think of “reciprical throughput” as how many clock cycles it takes per completed instruction if a core is executing many simultaneously. It generally takes many more clock cycles to complete any given instruction (e.g., 4 for the fma instructions), but a core can work on many simultaneously, thus the rate at which they’re completed can be much faster.

3 Likes

Realize that implementing a highly optimized matrix–matrix multiplication is nontrivial. Optimized BLAS libraries typically involve tens of thousands of lines of code and painstaking performance tuning. While there is no theoretical reason why this cannot be replicated in Julia, it is a huge undertaking.

1 Like

Thanks, interesting. And the reason for this difference in the first place, perhaps that there’s just more demand for vectorised floating point stuff, which justifies spending a lot of silicon on it?

I have no doubt. “We should” in open source can perhaps only mean “it would be appreciated”.

Not having BLAS implementation for a type really hurts.

This is why Arraymancer (a BLAS written in Nim that supports a bunch of types/is generic?),
trashes everyone at integer matmul

Out of curiosity, how comes that JuliaBLAS is closing in on BLAS for floats but is not fundamentally faster than generic matmul for ints?