I have a matrix (size 35x35) that I build at the beginning of my code, in which does not change later in the code. This matrix is used to be multiped with other matrices in a for loop. Since I care about the performance, I tested two cases:
1- I built it to be a normal matrix and tested the execution time,
2- I built it to be βStaticArrayβ however, the  execution time increased is a little.
Is there any type or structure in Julia that can I build this matrix to be to give better performance in this case?
(1) How did you benchmark this? (2) How did you use StaticArrays?
From the StaticArrays readme:
Note that in the current implementation, working with large StaticArrays puts a lot of stress on the compiler, and becomes slower than Base.Array as the size increases. A very rough rule of thumb is that you should consider using a normal Array for arrays larger than 100 elements.
You have over 1,000 elements in your array.
Maybe MKL.jl or this (or some of the alternatives, listed there) are appropriate for you:
Whatβs the size of the other matrices and how do you multiply them together (the order could matter for speed)? Is there some special structure to them, such as them sparse?
LoopVectorization.jl will be faster than the alternatives, except Octavian, which it should match, but Octavian will probably have longer compile times.
I suggest making the size static, i.e. MArrays or StrideArrays, so that these can specialize on the size.
I am new for LoopVectorization.jl. Could you please modify my MWE?
In = 35;
ik = @MMatrix zeros(10*3,In);
SS = SMatrix{In,In,Float64,In*In}([zeros(1,In); Matrix(1.0I, In-1, In-1) zeros(In-1,1)])
@time begin
for t in 0:2000
for i in 1:10
ik[(i*3-2):(i*3),:] = ik[(i*3-2):(i*3),:]*SS;
end #for i in 1:10
end #for t in 0:2000
end #@time
The first thing is to define globals as constant:
const In=35
const ik=
const SS β¦
What is your actual application?
Multiplying 3 x 35 matrices with the 35 x 35?
Multiplying a 30 x 35 matrix with a 35 x 35 matrix would be a lot faster than multiplying 10 instances of 3 x 35 matrices by a 35 x 35.
Iβm wondering why not just do @benchmark with a 3 x 35 * 35 x 35.
Additionally, could you perhaps transpose these?
35 x 35 * 35 x 3 would be a lot faster than 3 x 35 * 35 x 35.
using LinearAlgebra
const In = 35;
const ik = zeros(10*3, In);
const SS = [zeros(1,In);  Matrix(1.0I, In-1, In-1) zeros(In-1,1)]
@time begin
    for t in 0:0.1:200000
        for i in 1:10
            ik[(i*3-2):(i*3),:] = ik[(i*3-2):(i*3),:]*SS;
        end
    end
end
@time begin
    for t in 0:0.1:200000
        for i in 1:10
            ik[(i*3-2):(i*3),:] = ik[(i*3-2):(i*3),:]*SS;
        end
    end
end
Gives me 26s, running it twice makes no differenceβ¦
I would usually listen to what Chris says, but it seems to me a sparse array is in order, with the structure simple and seemingly good here:
julia> SS = sparse(SS)
35Γ35 SparseMatrixCSC{Float64, Int64} with 34 stored entries:
β ’β‘β β β β β β β β β β β β β β β β 
β β β ’β‘β β β β β β β β β β β β β β 
β β β β β ’β‘β β β β β β β β β β β β 
β β β β β β β ’β‘β β β β β β β β β β 
β β β β β β β β β ’β‘β β β β β β β β 
β β β β β β β β β β β ’β‘β β β β β β 
β β β β β β β β β β β β β ’β‘β β β β 
β β β β β β β β β β β β β β β ’β‘β β 
β β β β β β β β β β β β β β β β β ’β 
I couldnβt actually confirm your 10 sec. I gave up waiting (so I commented out the outer useless loop, something seems missingβ¦):
julia> using SparseArrays
julia> function test(SS, ik)
         #for t in 0:0.0001:200000
           for i in 1:10
              ik[(i*3-2):(i*3),:] = ik[(i*3-2):(i*3),:]*SS;
           end #for i in 1:10
         #end #for t in 0:0.0001:200000
       end
test (generic function with 2 methods)
julia> @time test(SS, ik)
  1.030346 seconds (2.88 M allocations: 132.260 MiB, 4.51% gc time, 99.99% compilation time)
julia> SS = sparse(SS)
julia> @time test(SS, ik)
  0.280222 seconds (337.76 k allocations: 17.257 MiB, 99.97% compilation time)
julia> ik = sparse(ik)  # might not apply for this matrix, is it for sure supposed to be zeros...?
30Γ35 SparseMatrixCSC{Float64, Int64} with 0 stored entries:
julia> @time test(SS, ik)
  0.000050 seconds (120 allocations: 16.406 KiB)
using MKL, LinearAlgebra
const In = 35;
const ik = zeros(10*3, In);
const SS = [zeros(1,In);  Matrix(1.0I, In-1, In-1) zeros(In-1,1)]
@time begin
    for t in 0:0.1:200000
        for i in 1:10
            ik[(i*3-2):(i*3),:] = ik[(i*3-2):(i*3),:]*SS;
        end
    end
end
@time begin
    for t in 0:0.1:200000
        for i in 1:10
            ik[(i*3-2):(i*3),:] = ik[(i*3-2):(i*3),:]*SS;
        end
    end
end
This gives me 16s instead of 26s without MKL.
Anyway, questions aside, following your example:
using StaticArrays, LoopVectorization
@inline function AmulB!(C, A, B)
    @turbo for n β indices((B,C), 2), m β indices((A,C), 1)
         Cmn = zero(eltype(C))
         for k β indices((A,B), (2,1))
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
     end
end
function iterate!(ik, SS)
    for i in 0:9
        A = MMatrix{3,size(ik,2),eltype(ik)}(undef)
        @inbounds for j = axes(ik,2), k = 1:3
            A[k,j] = ik[i*3+k, j]
        end
        AmulB!(@view(ik[(i*3+1):(i*3+3),:]), A, SS)
    end #for i in 1:10
end
In = 35;
ik = @MMatrix zeros(10*3,In);
SS = MMatrix{In,In,Float64,In*In}([zeros(1,In); Matrix(1.0I, In-1, In-1) zeros(In-1,1)]);
@time begin
for t in 0:2000
iterate!(ik, SS)
end #for t in 0:2000
end #@time
I get:
julia> @time begin
       for t in 0:2000
       iterate!(ik, SS)
       end #for t in 0:2000
       end #@time
  0.111660 seconds (284.39 k allocations: 16.077 MiB, 95.54% compilation time)
julia> @time begin
       for t in 0:2000
       iterate!(ik, SS)
       end #for t in 0:2000
       end #@time
  0.005728 seconds
0.005728 seconds is a lot better than the 16s reported above.
I would strongly recommend doing things differently.
Anyway, to illustrate my point with transposing the computations:
julia> A35x35 = @MMatrix randn(35,35);
julia> C3x35 = @MMatrix randn(3,35);
julia> A3x35 = @MMatrix randn(3,35);
julia> A35x3 = @MMatrix randn(35,3);
julia> C35x3 = @MMatrix randn(35,3);
julia> @benchmark AmulB!($C3x35, $A3x35, $A35x35)
BenchmarkTools.Trial: 10000 samples with 540 evaluations.
 Range (min β¦ max):  203.233 ns β¦ 285.480 ns  β GC (min β¦ max): 0.00% β¦ 0.00%
 Time  (median):     204.417 ns               β GC (median):    0.00%
 Time  (mean Β± Ο):   204.889 ns Β±   1.771 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%
             βββ
  βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
  203 ns           Histogram: frequency by time          209 ns <
 Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark AmulB!($C35x3, $A35x35, $A35x3)
BenchmarkTools.Trial: 10000 samples with 957 evaluations.
 Range (min β¦ max):  89.336 ns β¦ 146.673 ns  β GC (min β¦ max): 0.00% β¦ 0.00%
 Time  (median):     90.070 ns               β GC (median):    0.00%
 Time  (mean Β± Ο):   90.225 ns Β±   0.851 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%
          ββββ ββββ
  βββββββββββββββββββββββ
ββ
βββββββββββββββββββββββββββββββββββ β
  89.3 ns         Histogram: frequency by time         92.3 ns <
 Memory estimate: 0 bytes, allocs estimate: 0.
So itβs well worth changing your data representation.
Yes, however the 10 instances here represents the rows of a vector of structures in my real code. So, it is hard for me for the current moment to change the configuration.
Why is it faster? is it because Julia column based? I am not sure if I can do the transpose. However, is there a trick to do that?
Thank you very much for your explanation. I will try to utilize it if possible in my code. What does the @view means here AmulB!(@view(ik[(i*3+1):(i*3+3),:]), A, SS)?
B[inds,:] allocates a new array, and copies the elements over to the new array.
This is slow, and β even worse β writing into the copy will not change the original array.
But changing the original array is exactly our intent, so we create a view.
For A, we do need to make an explicit copy, because the source and destination are not allowed to alias.
I create a copy MArray manually, because this MArray is stack allocated, making it very fast (note that there were 0 allocations from my second @time; stack allocations donβt count).
The intent there was also to make sure the array A was statically sized.
Yes, but thatβs not a very helpful answer, as the next question would be βWhy does Julia being column based make it faster?β
Note that most people will see a much smaller difference than I reported.
I benchmarked on a CPU with AVX512, allowing it to operate on 8 Float64 at a time.
But with only 3 rows in a column major array, it could only operate on 3 of them at a time.
With 35 rows, it could operate on 8 of them at a time instead (except for the last batch of rows, which is still only 3), hence over 2x faster.
However, on an M1 mac:
julia> @benchmark AmulB!($C3x35, $A3x35, $A35x35)
BenchmarkTools.Trial: 10000 samples with 520 evaluations.
 Range (min β¦ max):  216.346 ns β¦ 253.687 ns  β GC (min β¦ max): 0.00% β¦ 0.00%
 Time  (median):     216.667 ns               β GC (median):    0.00%
 Time  (mean Β± Ο):   217.130 ns Β±   2.078 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%
  β
βββ                     ββ                                   β
  ββββββ
β
β
ββββββ
β
ββββββ
ββββ
ββββββ
βββββββββ
βββββββββββββββββ
β
βββ β
  216 ns        Histogram: log(frequency) by time        229 ns <
 Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark AmulB!($C35x3, $A35x35, $A35x3)
BenchmarkTools.Trial: 10000 samples with 780 evaluations.
 Range (min β¦ max):  161.912 ns β¦ 575.427 ns  β GC (min β¦ max): 0.00% β¦ 0.00%
 Time  (median):     162.126 ns               β GC (median):    0.00%
 Time  (mean Β± Ο):   162.665 ns Β±   7.655 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%
  βββ
                 β                                         β
  ββββββ
βββ
β
β
ββββββ
ββββββββββββββ
ββββ
ββ
βββββββββ
ββββ
βββ
ββ
ββββ
ββ β
  162 ns        Histogram: log(frequency) by time        173 ns <
 Memory estimate: 0 bytes, allocs estimate: 0.
The advantage is much smaller.
That is because the M1 can only operate on up to 2 numbers at a time, so operating on 3 rows means it requires 2 operations, 1/4 of which is wasted, instead of 1 operation of which 5/8 is wasted.
The M1 can do twice as many operations at a time, which is why it gets comparable performance in the 3x35 case, but its operations doing 1/4 of the work is why it takes twice as long for the 35x3 case.
Not exactly sure why the iterate! example is much slower on the M1:
julia> @time begin
       for t in 0:2000
       iterate!(ik, SS)
       end #for t in 0:2000
       end #@time
  0.009869 seconds
It would be difficult with StaticArrays.jl, which makes transposes eager.
Itβd be easier with StrideArrays.jl. With them, a transpose is lazy (like with ordinary Julia matrices). A lazy transpose of a column major array is basically a row major array.
Basically, Pythonβs MKLβs symbols clash with those from the MKL used by MKL.jl.
Should be fixed by USE LBT 5.0 and the new MKL ILP64 suffixes by ViralBShah Β· Pull Request #104 Β· JuliaLinearAlgebra/MKL.jl Β· GitHub
This might be fixed on Julia 1.8-beta3, give it a tryβ¦
Excuse me, why did you declare ik to be const, though, it is changing in the for-loop?
const refers to the variable binding which isnβt changed when mutating an object.
Thanks for your reply! Am, could you please give a bit more details of some examples?