How to efficiently do a martix multiplication?

Julia_TestLang · June 3, 2020, 1:40pm

Hello!
I’m a Julia beginner and I’d like to know is there any way to make this code more memory efficient?

function myAllocTest()
    p = rand(50,1);
    A = rand(50,50);
    W = similar(A);
    D = similar(A);
    m = 2.0
    
    for i=1:50
        p = D[i:end, i]
        D[i:end, :] = D[i:end, :] - (m*p)*(p'*D[i:end,:])
        W[:, i:end] = W[:, i:end] - (W[:,i:end]*p)*(m*p)';
    end
end

Actualy my results for code above are 1.473 s (1187000 allocations: 4.01 GiB)

I tried so far:

to add @views macro before matrix multiplication and it gave me a better results:
827.542 ms (1297000 allocations: 2.10 GiB)
to make a mutable structure (result wasn’t so good)
I heared that mul!(out, A, B) are really efficient, but I don’t exactly know how can I use this method in my example.

I think that you know a few really good tips how to accelerate this code (the most important for me is a memory efficient - actualy 4GB or 2GB it’s really big amount).

Thank you very much for reading and clues!

sgaure · June 3, 2020, 2:29pm

I get

ERROR: LoadError: UndefVarError: R not defined

Julia_TestLang · June 3, 2020, 2:34pm

I’m sorry - one matrix was not updated. I did it and now it’s working fine. It’s important to type using BenchmarkTools

eliassno · June 3, 2020, 2:39pm

What exactly are you timing?

As @sgaure points out, R is not defined within myAllocTest.
I assume you could add R = similar(A) within the function, but I couldn’t replicate the timings.

using BenchmarkTools

function myAllocTest()
    p = rand(50,1);
    A = rand(50,50);
    W = similar(A);
    D = similar(A);
    R = similar(A);
    m = 2.0
           
    for i=1:50
        p = D[i:end, i]
        D[i:end, :] = D[i:end, :] - (m*p)*(p'*R[i:end,:])
        W[:, i:end] = W[:, i:end] - (W[:,i:end]*p)*(m*p)';
    end
end


@btime myAllocTest() # 1.003 ms (839 allocations: 4.11 MiB)

A small portion of the time goes to allocating the matrices initially, which I don’t think you want to measure. You could separate this into two functions.

using BenchmarkTools

function mytest!(W,D,R,m)
    for i = 1:50
        p = D[i:end, i]
        D[i:end, :] .-= (m*p) * (p' * R[i:end,:])
        W[:, i:end] .-= (W[:, i:end] * p) * (m*p)'
    end
end

function bench()
    W, D, R = (rand(50,50) for _ in 1:3)
    m = 2.0
    @btime mytest!($W, $D, $R, $m)
end

bench() # 711.449 μs (710 allocations: 3.05 MiB)

sgaure · June 3, 2020, 2:52pm

I get

370.069 μs (449 allocations: 2.11 MiB)

when I put a @views in front of for, and 670 μs without @views.
But I wonder, your matrices W and D are uninitialized (similar(A) creates uninitialized arrays of the same shape as A), that is, there can be NaNs there. On some cpus and OSs, depending on how things are compiled, NaNs can create various speed problems.

And it is a good idea to separate allocation from computation in a test like this, like @eliassno suggests.

Julia_TestLang · June 3, 2020, 2:54pm

I’d like to do minimalize memory allocation of this function. I’m timing:
@btime for i=1:1000 myAllocTest(); end

This is only an example function. I have to use similar function many times for example 1000 or more (To QR decomposition of given matrix)

Oh, it’s a really good suggestion I have to try - to separate allocation from computation, thank you.

nilshg · June 3, 2020, 4:26pm

There isn’t really a reason to benchmark your function 1,000 times, as @btime will automatically run your function many times over to get a reliable estimate of runtime and allocations (as you see here - others in the thread are getting 2-4MB, while you are getting 1,000 times that with 2-4GB)

Elrod · June 3, 2020, 6:25pm

using LinearAlgebra, BenchmarkTools
function myAllocTest(A)
    W = copy(A); # lets not do uninitialized
    D = copy(A);
    m = 2.0
    pb = Vector{Float64}(undef, 50)
    Dp = Vector{Float64}(undef, 50)
    @views for i=1:50
        p = (pb[1:end-i+1] .= D[i:end, i])
        mul!(Dp, D[i:end, :]', p) 
        D[i:end, :] .-= m .* p .* Dp'
        mul!(Dp, W[:,i:end], p) 
        W[:, i:end] .-= Dp .* m .* p';
    end
    D, W
end
A = rand(50,50);
D1, W1 = myAllocTest(A);

Testing vs original version:

function myAllocTest_old(A)
    W = copy(A);
    D = copy(A);
    m = 2.0
    for i=1:50
        p = D[i:end, i]
        D[i:end, :] = D[i:end, :] - (m*p)*(p'*D[i:end,:])
        W[:, i:end] = W[:, i:end] - (W[:,i:end]*p)*(m*p)';
    end
    D, W
end
D2, W2 = myAllocTest_old(A);

Correctness test:

julia> D1, W1 = myAllocTest(A);

julia> D2, W2 = myAllocTest_old(A);

julia> D1[isnan.(D1)] .= 12312431234.12341;

julia> D2[isnan.(D2)] .= 12312431234.12341;

julia> W1[isnan.(W1)] .= 12312431234.12341;

julia> W2[isnan.(W2)] .= 12312431234.12341;

julia> D1 ≈ D2
true

julia> W1 ≈ W2
true

I set the (huge number of) NaN values to the same non-NaN number, for the sake of getting it to return true for matching NaNs. But I’d wager to bet your original code is not doing what you intended, because the results look like total junk.

Benchmarks of three functions returning total junk (I’m adding foo(A) = nothing to the mix of useless functions):

julia> @benchmark myAllocTest($A)
BenchmarkTools.Trial:
  memory estimate:  40.25 KiB
  allocs estimate:  6
  --------------
  minimum time:     137.796 μs (0.00% GC)
  median time:      139.075 μs (0.00% GC)
  mean time:        140.731 μs (0.64% GC)
  maximum time:     1.523 ms (88.13% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark myAllocTest_old($A)
BenchmarkTools.Trial:
  memory estimate:  4.07 MiB
  allocs estimate:  734
  --------------
  minimum time:     645.335 μs (0.00% GC)
  median time:      666.032 μs (0.00% GC)
  mean time:        762.999 μs (11.55% GC)
  maximum time:     1.969 ms (56.75% GC)
  --------------
  samples:          6549
  evals/sample:     1

julia> @benchmark foo($A)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     0.016 ns (0.00% GC)
  median time:      0.020 ns (0.00% GC)
  mean time:        0.020 ns (0.00% GC)
  maximum time:     0.331 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

The point of foo is that it’s easy to be fast if you don’t care about being correct or useful…

On that note, you can’t just add @views to your original code, because that will change the answer. If p is a view, it changes when you update D[i:end, :], meaning you’ll get a different answer when you calculate (W[:,i:end]*p)*(m*p)'. Notice how I explicitly copy the view over to pb, so that it wont change and gets the same (mostly NaN-filled) result.

Topic		Replies	Views
Multiplication of vector of matrices and vector of vectors Performance linearalgebra , tullio	3	419	November 7, 2022
Passing views to function without allocation New to Julia	14	1534	December 18, 2020
Reducing allocations and run time General Usage	8	400	February 12, 2021
Matrix-by-(slice of)vector multiplication with limited allocation New to Julia question	6	816	September 11, 2020
Memory allocation in matrix multiplication Performance	2	2133	July 8, 2020

How to efficiently do a martix multiplication?

Related topics