Benchmark MATLAB & Julia for Matrix Operations

@Balinus, do you have any conclusions from the results?
Do you think the’d change using other estimator?

Anyhow,
What’s interesting me is why MATLAB was sensitive to the amount of functions (Or size of code, not sure) in the same file.

I wonder if Julia is also sensitive to that.

It just mean that the error associated with the median estimator is 25% higher than the mean estimator. I haven’t looked at your results, so I can’t comment on that specifically.

You can estimate the error of the mean by doing, say, 30-50 iterations of your tests, and estimate the standard deviation (s) of your parent distribution. Then, when you do your 5 iterations (N = 5), you can estimate your error:

E = s / 5 for N = 5, using your estimated standard deviation s.

If you use the median, just multiply your E by 1.25.

If you do that, you’ll be able to see if the difference between Julia and MATLAB is inside the confidence interval of the estimators (e.g. if your results are significant with respect to your choice of estimators and your sampling size).

edit - In other words, for small sample, mean is usually a better choice if your aim is to estimate the location of your distribution. Considering that the 1st run in Julia will be the warm-up, you should discard this outlier (i.e. they don’t share the same parent distribution).

p.s. - Does the forum support Latex notation?

If this were true in general, then no one would ever use the median for anything. It might be true for some specific statistical distributions, but it’s definitely not true for others, and I’m pretty sure it’s not true for the measurement noise in software benchmarking. As I’ve argued, it is quite likely that the minimum estimator is optimal for this kind of measurement.

1 Like

Well, median is more robust in the presence of outliers. The standard error will still be usually higher though. The true error of the estimator is always unknown in any case.

I can’t comment on the optimal estimator for software benchmarking though.

The 25% figure applies to data with sufficient sample size for the Central Limit Theorem, an asymptotic result, to hold. Exactly what constitutes “sufficient sample size” will depend on the distribution of the population of benchmark times. If the population has a reasonably symmetric then a sample size of 10-15 may be sufficient. But benchmark times are usually skewed, in which case the distribution of the median will be more stable than that of the mean until you get to very large sample sizes.

Try a simulation from, say, an Exponential

using Distributions
function simulatemeanmedian(N, n)
    means = Float64[]
    medians = Float64[]
    samp = zeros(n)
    for i in 1:N
        push!(means, mean(rand!(Exponential(), samp)))
        push!(medians, median(samp))
    end
    (means, medians)
end

I get

julia> srand(1234321);

julia> mns, meds = simulatemeanmedian(100000, 10);

julia> StatsBase.mean_and_std.([mns, meds])
2-element Array{Tuple{Float64,Float64},1}:
 (0.999299,0.316037)
 (0.745321,0.310037)
5 Likes

Interesting! :slight_smile:

Benchmark results are indeed skewed.

The discussion of whether to use the minimum, median or mean as a representative value of the sample of benchmark times is not particularly relevant, certainly the mean vs median part. We are discussing results from very powerful software that is particularly well suited to data science and data visualization. Instead of worrying about 100 year old theoretical results that may or may not apply, just look at the data. See

for an example of visualizing benchmark results.

Well, they’re skewed because I used an Exponential distribution to generate the samples. As I said in a later comment, it is best to just look at the data.

I was more referring to some data I had here (e.g. “looking at the data”):

You need only to inspect the output of something like this:

julia> A = rand(100,100);
julia> @benchmark sin.(A)
BenchmarkTools.Trial: 
  memory estimate:  79.25 kb
  allocs estimate:  29
  --------------
  minimum time:     115.013 μs (0.00% GC)
  median time:      120.040 μs (0.00% GC)
  mean time:        139.268 μs (5.61% GC)
  maximum time:     2.674 ms (93.73% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

to see that benchmarking is far from gaussian in many (most?) realistic cases. The fact that run times can never be negative is evidence that your distribution is always skewed no matter what.

I find the code for LeastSquaresRunTime to be not very representative of least squares calculations. The matrix mA, called the “model matrix” by statisticians, is square so the calculations are just inefficient ways of computing mA \ vB and mA \ mB. For a least squares calculation you would usually want the model matrix to be rectangular with more rows than columns. In Julia the backslash operator provides the least squares fit directly in such cases. I don’t know about Matlab.

In general it is not a good idea to write an expression like

vX = (mA.' * mA) \ (mA.' * vB)

because you create a positive-definite symmetric matrix, mA.' * mA, but do not retain the information about the special structure. The backslash operator has to check the matrix to determine its symmetry and positive-definitieness.

Having said that, I have seen far too many statisticians use the R equivalent of

vX = inv(mA.' * mA) * (mA.' * vB)

which is even worse, to complain about this usage.

@dmbates,

I totally agree.
Usually for LS problem the Model Matrix if “Tall Matrix”.

Yet I wanted to be consistent with all other test so I use Square Matrix on all tests.
I think the result in the square case will also mean that in practice, for Tall Matrix" the results will be in favor of MATLAB.

@Balinus,
I’d be even happier to talk on estimators than Julia / MATLAB :-).
First, in order to ask which estimator is better one need to define well what is the parameter being estimated.

The Sample Mean is for example a great estimator of the mean of the distribution of the samples (Actually it is the Maximum Likelihood on most distributions).
Yet not necessary in the distribution of run time (You can thing of it as either a mixture of 2 distributions due to the first sample or something with long tail and then things get messy).

But again, the actual parameter here to estimate is the factor of performance between MATLAB and Julia and not the run time by itself. In that case the Median is good enough and more important it is very robust.

Any thought on the issue I raised before?

Aside from that, this often a terrible way to solve least-squares problems because it squares the condition number, so it exacerbates the sensitivity to roundoff errors and other problems. Doing mA \ vB solves it by a much more accurate method. (I think the default for mA \ vB when mA is non-square is to use qr factors … if you want to do least-squares for many right-hand sides you should create a QR = qrfact(mA) object and do QR \ vB for each right-hand side.)

2 Likes

@stevengj,

All you said is known and correct (Though sometimes the GRAM Matrix is pre calculated for other purposes hence the case above is what calculated).
Doing mA \ vB is done in the other tests no pint to do it again.

This case I wanted to see what happens when one solve the LS in its vanilla form (For instance, does the \ checks for PSD matrix.

Guys, it is not about Numeric Analysis, it is about performance in real world code.
Since it is the same code on Julia and MATLAB (On MATLAB also a user is advised to just do mA \ vB) the test makes sense.

1 Like

I feel like some of this discussion has been a bit weird. This tests similar code doing the same thing. Don’t fall in the trap people do when they’re saying the recursion tests on the julia speed page is useless. This is testing very usual operations in a very normal way for a lot of users. The syntax and applications (although I would love to see other setups than square matrices) seem quite standard for “scientific computing”.

On the face of it, Julia comes out as a loser in these tests, and the numbers should, if nothing else, be seen as a challenge to the community and developers. It will be interesting to see what Julia is doing here, and what can be optimized.

However, it is very important to be 100% sure what versions of stuff like BLAS and LAPACK is used. Could you output that in the README? Seems to be almost the most important information for stuff like this, but I can’t find that information anywhere.

3 Likes

@pkofod, I totally agree with your point of view.
This should be a target to aim.

Moreover, it might require collaboration with the people behind OpenBLAS as Intel MKL isn’t “Out of the Box” option.

I will add MATLAB’s BLAS and LAPACK version.
In MATLAB I can do version -blas and version -lapack for the version of the respective library.
How can I extract those versions in Julia?

Thank You.

julia> versioninfo()
Julia Version 0.5.0
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4600U CPU @ 2.10GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

edit: oh it doesn’t have versions…

See e.g. this paper, which argues that for most purposes you should use the minimum time. (Similar conclusions were also drawn in other benchmarking contexts, e.g. the classic lmbench software.)

2 Likes

I updated the results with some optimizations to Julia and updated tests (Added K-Means).

@ChrisRackauckas, I also added “Julia Optimized” files.
Feel free to optimize those (Don’t change the time measurement, just the code).

Julia now in parity with MATLAB for Quadratic Matrix Form and K-Means.
Need to check that.

Thank You.

It’s mostly because of implicit parallelism. I noted above that when I ran some of the tests with MATLAB’s implicit threading off (like the broadcasting multiplication/addition tests), just was much faster than MATLAB. That’s why I am suggesting an extra line should be added

maxNumCompThreads(1)

The major problem with this benchmark is, while it’s supposed to be enlightening, it’s just confounding too many details. Instead, each benchmark should isolate what’s really going on. Take for example linear least squares. It does the calculation:

vX = (mA.' * mA) \ (mA.' * vB)

But there’s already a benchmark on \ and *, so it just creates noise. Hint, the difference looks to be around those.

If you put the line in for how much multithreading is doing, people would understand what this all actually means. It would make an actual point and would press how much implicit multithreading matters… instead this is not shown at all in the benchmarks.