Benchmark MATLAB & Julia for Matrix Operations

This is your Julia code. The result: 970 microSeconds (@benchmark answer), 1550 microSeconds, 1090 microSeconds.
This is for a big matrix with the size of 500. If you decrease the matrix size, the noise will be much more!
In your validation example, you combined three operations like addition, svd, and eigen so the noise gets less apparent.

You are using seconds as the unit and linear plots, and that’s why you don’t notice the noise.


using BenchmarkTools;
using LinearAlgebra;
using Statistics;

function SimpleMatrixOperation( mX , mY )
    mA = (0.2 .* mX) .+ (0.6 .* mY);
end

function MeasureTimeIterations( mX , mY, numIterations )
    vRunTime = zeros(numIterations);

    for ii = 1:numIterations
        vRunTime[ii] = @elapsed begin
            mA = SimpleMatrixOperation(mX , mY);
        end
    end

    runTime = minimum(vRunTime);

    return runTime;

end

function MeasureTimeBelapsed( mX , mY )

    runTime = @belapsed SimpleMatrixOperation(mX , mY);

    return runTime;

end


function MeasurTimeBenchmark(mX , mY)

    runTime = @benchmark SimpleMatrixOperation(mX , mY)

end

function directMeasurement( mX , mY)

    runTime = @elapsed begin

        mA = (0.2 .* mX) .+ (0.6 .* mY);
    end

    return runTime
end

function forDirectMeasurement(mX , mY)

    vRunTime=Array{Float64}(undef,numIterations)
    for ii = 1:numIterations
        vRunTime[ii] = directMeasurement(mX , mY);
    end
    runTime = minimum(vRunTime);
end
const matrixSize = 500;
const numIterations = 7; 

const mX = randn(matrixSize, matrixSize);
const mY = randn(matrixSize, matrixSize);

runTimeA = MeasureTimeIterations(mX , mY, numIterations);
# runTimeB = MeasureTimeIterations(mX , mY, 10 * numIterations);
runTimeD = MeasurTimeBenchmark(mX , mY);
runTimeE = forDirectMeasurement(mX , mY)

# @btime SimpleMatrixOperation($mX , $mY)
@benchmark SimpleMatrixOperation($mX , $mY)

But this is for Julia which has a very good compiler. For Matlab, it is even worse. See the big flactuation in the direct measurement result!


% validation
matrixSize=500;
numIterations=7; % 700

mX = randn(matrixSize, matrixSize);
mY = randn(matrixSize, matrixSize);

%%
mRunTime=zeros(1,numIterations);
for kk = 1:numIterations

        fun=@() SimpleMatrixOperation(mX, mY);
        mRunTime(kk) = timeit(fun)*1e6; % computes median of bench times

end
timeItResult=min(mRunTime);

disp(timeItResult)

%%
mRunTime=zeros(1,numIterations);
for kk = 1:numIterations

        [mA, mRunTime(kk)]=DirectSimpleMatrixOperation( mX , mY );

end
directResult=min(mRunTime);

disp(directResult)
%%
function [mA, runtime]=DirectSimpleMatrixOperation( mX , mY )

tic();
mA = (0.2 * mX) + (0.6 * mY);
runtime=toc();    

end
%%  
function mA=SimpleMatrixOperation( mX , mY )
    mA = (0.2 * mX) + (0.6 * mY);
end
% using Numiterations 7:
% Results for different runs:
>> validation
       642.38
        530.5
>> validation
       648.18
        721.6
>> validation
       647.03
        583.9
% using numIterations 700:
>> validation
        642.2
        457.5
>> validation
       644.03
      437

Also, median is a better statistical measure than minimum, since it shows what happens most of the time.

1 Like

Could a moderator please move the following messages into a different thread (Called - “Techniques to Measure Run Time of a Functions and Their Validity”):

  1. Benchmark MATLAB & Julia for Matrix Operations - #114 by Amin_Yahyaabadi
  2. Benchmark MATLAB & Julia for Matrix Operations - #115 by RoyiAvital (The lower part).
  3. Benchmark MATLAB & Julia for Matrix Operations - #116 by jling.
  4. Benchmark MATLAB & Julia for Matrix Operations - #117 by Amin_Yahyaabadi.
  5. Benchmark MATLAB & Julia for Matrix Operations - #118 by RoyiAvital.
  6. Benchmark MATLAB & Julia for Matrix Operations - #119 by Amin_Yahyaabadi.
  7. Benchmark MATLAB & Julia for Matrix Operations - #120 by RoyiAvital.
  8. Benchmark MATLAB & Julia for Matrix Operations - #121 by Amin_Yahyaabadi.

Thank You.

P. S.
Of course delete this message afterwards.

But the plot is of Matrix addition which doesn’t use BLAS?

Have you seen the code?
It runs 7 times and the measurement is the median.

Oops I just saw that, sorry. However, it’s better for this kind of benchmarking to be done closer to the actual computation, and not do anything between two runs (in particular memory allocations). Just use BenchmarkTools, which does all this itself and avoids some common pitfalls.

Regarding the code, Please review the code and point me to any place you think I should add it and I will.
Just out of curiousity, I’d understand that for existing array it will be important to barodcast / fuse teh assignment with .= . But I’d think that for a new generated array it won’t. Not that is the reason they are missing. I was just not aware enough of that. Not very experienced with Julia.

I told you, just add dots. When you do a = b.+c, you allocate an array for a, which is slow in julia (slower than in matlab, which has an allocator optimized for that extremely common pattern). The way to get optimal performance in julia is to track down systematically allocations. a .= b.+c does not allocate anything (well, at least, no arrays).

1 Like

I am not sure about the reason for that as I am not aware of what Julia is doing internally.
@ChrisRackauckas said it is a known open problem with fusion which doesn’t generate a code which hints the compiler there is no aliasing. Others said maybe MATLAB parallelizes this while Julia doesn’t (Though it seems like a memory bounded operation so I’m not sure parallelization can explain result).

This is the line of the function - mA = (scalarA .* mX) .+ (scalarB .* mY);.
If there is mistake on my side how it is right to write the function, please advise and I will change.

Regarding MATLAB, this is something they heavily invested in the few latest releases - Working and multi terms assignments. They might even infer fusion by themselves in this case.

I agree it doesn’t look reasonable. I will run it again.

I compared my technique of measuring to BenchMarksTools above. See that results are comparable within a reasonable accuracy of measuring CPU.

Dots were were there to begin with (With one exception I missed mE = exp.(-(mA .^ 2)); - No . before the -).
What I wrote that it didn’t make sense to me you need to have . before the assignment operator for a new allocated array.

For instance, no reason mZ .= mX .+ mY; will be faster than mZ = mX .+ mY; for the case mZ is not defined. That was my logic. I even tried it on Julia.
It seems to be Julia’s logic as well, because you can only use mZ .= mX .+ mY; in case mZ was pre allocated.

In my code the left side hand isn’t allocated. Hence no need for . before =. Only between operations on the right hand side. The code was like that to begin with.

mA .= (scalarA .* mX) .+ (scalarB .* mY);

Note the .=.

@PetrKryslUCSD, I addressed that above.
First it doesn’t make sense it will have any difference for the case mA isn’t pre allocated. This was logic on this.
But I’m surprised because not only it doesn’t make sense, it won’t work on Julia (It will generate an error). I’m not experienced so I wasn’t aware, but you should be aware of that.

You can use fused assignment .= only if the left side is pre allocated.
In my tests, as I wrote above, it is not pre allocated. Not in Julia and not in MATLAB.

Try this in Julia (Fresh session, so nothing is defined):

matrixSize = 10;
mX = randn(matrixSize, matrixSize); mY = randn(matrixSize, matrixSize);
mZ .= mX .+ mY;

This will yield: ERROR: UndefVarError: mZ not defined.

Yes, that’s the point. You should preallocate your arrays, and then benchmark writing on them with dot assignments. That’s the representative case for optimized code. Otherwise you’re benchmarking the allocator, which is not very well optimized for this in julia (as far as I understand)

No it is not.
It is optimized if you explicitly loop on its elements. Since it is one time use there is no gain in that as it is allocated in the assignment.
Whats happening is allocation and pointer naming. Exactly as if I did pre allocation and use fused assignment.

Moreover, it is the same as in MATLAB. Since the idea here is to compare to MATLAB on Linear Algebra related operations, there is no real issues here.

You are right, testing the allocation of the result is a legitimate target of a benchmark. I misunderstood your test.

1 Like

Given that the @. macro does not produce optimal results at the moment, I think it would be useful to test Julia in these operations also with explicit loops. I believe that the factor of around 3.3 (https://github.com/JuliaLang/julia/issues/28126) by which Julia is currently trailing Matlab on some benchmarks would be wiped out if the results were computed with loops.

The point isn’t to show that it is possible to achieve the same speed as MATLAB no matter what. That will almost surely be the case (explicit loop + threading annotations etc). The point is that one wants to write convenient broadcasting functions and have that perform as well as possible. Until that is the case there is work to do.

It would be interesting to benchmark GitHub - Jutho/Strided.jl: A Julia package for strided array views and efficient manipulations thereof and compare with Base Julia and MATLAB.

7 Likes

That’s not the only point.

I think, at least if everything follows basic logic, there should no performance difference between:

matrixSize = 1000;
mZ = Array{Float64, 2}(undef, matrixSize , matrixSize);
mX = randn(matrixSize, matrixSize);
mY = randn(matrixSize, matrixSize);
mZ .= mX .+ mY;

To:

matrixSize = 1000;
mX = randn(matrixSize, matrixSize);
mY = randn(matrixSize, matrixSize);
mZ = mX .+ mY;

The allocation will be the same.
Only difference is in the first you allocate memory and then operate on it as array while in the second you allocate array, do the right hand and then relink to that array a pointer named mA.

Think in terms of computer operations. There should no difference in the amount of allocations between the two.
Hence it is not only that it is done the same in MATLAB and Julia hence it is fair game it also doesn’t make sense for one time assignment to pre allocate.

If Julia does 2 allocations in the 2nd case there is a low hanging fruit there.
But I find it hard to believe this is the case.

But nothing like experimenting.

I wrote this code:

using BenchmarkTools;

function MatrixAdditioPreAllocate( matrixSize, mX, mY )
    mA = Array{Float64, 2}(undef, matrixSize, matrixSize);
    mA .= mX .+ mY;
    return mA;
end

function MatrixAdditionDirectAssignment( matrixSize, mX, mY )
    mA = mX .+ mY;
    return mA;
end

matrixSize = 2000;
mX = randn(matrixSize, matrixSize);
mY = randn(matrixSize, matrixSize);

@btime MatrixAdditioPreAllocate($matrixSize, $mX, $mY)
@btime MatrixAdditionDirectAssignment($matrixSize, $mX, $mY)

The results are:

13.691 ms (2 allocations: 30.52 MiB)
13.708 ms (2 allocations: 30.52 MiB)

This is expected as they are identical!

Remark
Things would have been different if the loop was inside the functions and timing was of each iteration in the loop. But just like in my bench, we time things as functions. The loop is around the function, not inside it.

I certainly agree with the idea to benchmark Strided arrays.

I don’t agree with your first statement. No doubt you know, as do many who frequent this forum, that Julia can beat Matlab. However, the consumers of these benchmarks may not have the benefit of having used Julia in the past, perhaps they are just interested in finding out about it. When they see tests in which Julia lags behind Matlab, lags behind quite a bit at that, their first reaction will probably be: Why bother with Julia?

The situation is not dissimilar to that discussed many times in this forum where a benchmark was run in the global scope, and correspondingly Julia’s performance was dismal.

1 Like

I think it is dissimilar. Telling people to not use global variables is usually a quite small change and it is, in general, good programming practice anyway. Manually rewriting a broadcast to a loop is nothing of the sort, it arguably makes the code harder to read and it is a significant effort.

9 Likes

To rewrite a = 9 * b + 0.3 * (c - d) as a double loop is significant effort? Kind of like this?

for i in 1:N, j in 1:m
    a[i, j] = 9 * b[i, j] + 0.3 * (c[i, j] - d[i, j])
end

Yes! In passing, you should use eachindex, who knows N, m?

Just benchmarking is useless because it just tells you that things are different. But the purpose of benchmarking is to try to uncover why things are different.

Here, a good portion of this difference might be this broadcast lowering performance bug. We know from DiffEq that you take a massive hit if you don’t work around it on longer broadcast expressions (hence DiffEqBase.@.. https://github.com/JuliaDiffEq/DiffEqBase.jl/blob/master/src/diffeqfastbc.jl). I would be curious to know whether this falls into this case and, if it does, whether it’s the whole story or whether there’s something more. That would tell us what to do about it, which to me is more interesting than the fact that there’s a difference. My guess is there’s something like this causing the majority of the difference, with the last <2x being MKL doing fancy footwork (since that seems to be what most of these benchmarks come out to after being optimized).

2 Likes

You need to take off the developer googles sometimes. Users don’t care why things are different, they care that things are different and will choose what software to use based on it.

3 Likes

OK.

for i in eachindex(a)
    a[i] = 9 * b[i] + 0.3 * (c[i] - d[i])
end

Is this really a big deal?