Why is this Julia code considerably slower than Matlab

btw: sincos as function to provide sin and cos for the same argument should afair be available as operation in the FPU. I don’t know which lib-m exposes this, but i’m pretty sure matlab detects the exp(i * pi* something) and uses this.

It’s a slow x87 instructions so I don’t think anything is using it.

Calling exp(im...) is extremely common in my line of work. An expim(::Real) function or something like that would be quite useful.

2 Likes

You’re looking for cis function.

4 Likes

The difference is marginal. Sometimes I see no difference, and sometimes it saves <1 sec. Which is odd because the timing of

carrier_signal .= exp.(im.*α)

vs

carrier_signal .= complex.(cos.(α), sin.(α))

indicates that I should be saving 2-3 sec.

Indeed I am! Thanks!

I also see that There is an AppleAccelerate.cis! function. Using this drops the single-threaded runtime to 325ms.

1 Like

So it seems that the biggest gain is avoiding exponent of complex number and working efficiently with memory.

I conclude that since @zsoerenm in his post managed to beat MATLAB using Julia (Even his result without Yepp).

Which means, as I guessed at first, this was all about efficiency of the code and not only the Multi Threading of MATLAB.

Namely MATLAB is better than Julia when the code is written in vectorized form not only because of Multi Threading but because it does it more efficiently.

@TsurHerman, You use very old MATLAB.
MATLAB R2016b (And R2017a which is around the corner) are much much more efficient.

Yet if Julia is written in most efficient manner it beats MATLAB.

Hence the conclusion is the same as before in the previous thread.
Priority number 1 is to minimize the gap between the code generated by Julia in Vector Form to what Julia can do in non vectorized form.

See v0.6 + future PRs for stack-allocated views. This has already been mentioned.

And just to be clear, the logic here is wrong.

There are multiple ways to improve performance and observing that you can improve the performance in one way in julia doesn’t means it was the difference between the original julia and MATLAB version.

If the question is “Why is this Julia code considerably slower than Matlab”, which is still the title, then the answer is absolutely that MATLAB is doing the most expensive part, which you correctly identified as the exp part, with multiple threads.

All of the methods mentioned in the threads that beats MATLAB with single thread code in julia mentioned in this thread are using SIMD which is not what MATLAB does. It is very easy to get a speed up from SIMD that beats multiple threads since all the CPU models mentioned in this thread support no more than 4 hardware cores but also support <4 x double> which basically means a 4x speed up (AFAICT with the libmvec implementation that gcc uses, the speed up is pretty linear in vector size). Basically even though these give you a big speed up which is reasonably easy to do in julia it is still not the answer to the question.

That said, vectorization (SIMD) of function call is a hard problem that is being worked on. It requires changes to LLVM so that it can recognize julia functions to be vectorized.

6 Likes

@yuyichao, You keep saying that but it is wrong.
I really kept looking carefully all data spread here in various posts.

For the original code (Vectorized Form) MATLAB is faster in Single Threaded code (See @zsoerenm post, He gets ~3.3 [Sec] for Single Threaded MATLAB vs. ~3.95 [Sec] for Julia) .
Moreover, @DNF in his post answering my question wrote that it seems most of the gain is due calculation sin and cos in efficient way (In place) instead of the exp() of complex argument.

So the facts are easy (We are talking Single Thread mode only, both Julia and MATLAB):

  1. MATLAB is faster in Single Thread in the vectorized code form (See original post).
  2. It seems both MATLAB and Julia implementation for exp(), cos() and sin() aren’t SIMD accelerated in the current form.
  3. Julia, using Built In sin() and cos() and avoiding calculation of exp() for Complex Argument is faster in Devectorized Form.
  4. The Devectroized form mainly handled memory better and avoids intermediate variables.

Wrap all those facts and what you get that MATLAB, in Single Thread for the Vectroized Form was faster due to:

  1. Better memory management.
  2. Probably decomposing exp(1i * realArg) into complex(cos(realArg), sin(realArg)).

Regarding the first, it seems those are fixed in Julia 0.6 (Though with usage of ., I wonder if some optimization can be made without it even).

About the second remark, no need to change anything in Julia as @giordano pointed out it is already there in Julia (See cis()).
The user needs to be aware of that (Better than add heuristic in my opinion so the choice is correct).

1 Like

I’ll say it’s been a pleasure to see one of these now classic “matlab is faster than julia” threads come to this resolution. Also very good signs for the development of Julia to see the 0.6 code even faster. And @TsurHerman s code for 0.6 is nicely readable and intuitive, while being fast :slight_smile:

1 Like

I’m not saying those aren’t the case. But I am saying that the multi threading is what causing the biggest difference. I also just don’t count this small difference because the multi threaded matlab result actually is also faster than what’s in the original post by the same factor.

1 Like

I couldn’t resist running this too, on 0.6. Modifying the original code to use broadcasting and threading for the main loop:

function performance_test()
    range = 1:2000000

    steering_vectors = complex(randn(4,11), randn(4,11)*2)

    sum_signal = zeros(Complex{Float64}, 4, length(range))
    steered_signal = zeros(Complex{Float64}, 4, length(range))
    carrier_signal = zeros(Complex{Float64}, length(range))
    Threads.@threads for i = 1:11
        carrier_signal .= exp.(2im * pi * 1.023e6 * range / 4e6 + 1im * 40 * pi / 180)
        steered_signal .= steering_vectors[:,i] .* carrier_signal.'
        sum_signal .+= steered_signal
    end
    return sum_signal
end

Julia timings:

  0.999100 seconds (6.83 k allocations: 275.003 MiB, 1.72% gc time)
  1.080039 seconds (131 allocations: 274.674 MiB, 8.55% gc time)

Matlab 2016b on the same machine, default options (i.e. using threading), using the original code I get 1.57 s, so using the latest Julia with the recommended practices does result in faster execution than Matlab. I could not find any difference between libm and openlibm on my machine.

4 Likes

Is that with -O3?
Would simplifying the calculation of of carrier_signal help (i.e. using cis, possibly? or using the simplified calculation I’d come up with for the imaginary part: muladd(1.606924642311179, k, 0.6981317007977318) (where k is the range value).

It was with the default -O2, but -O3 didn’t change anything.

I’d be happy to try a simplified version, but please feed me a line to copy/paste in that case :wink: Also, for fairness the equivalent simplification should be tested in Matlab in that case.

I also tested v0.6.0-pre.alpha.34 and found something interesting:
Shouldn’t that be the fastest using loop fusion?:

function test_perf4()
    range = 1:2000000
    range_transp = collect(range)'
    steering_vectors = complex.(ones(4,11), ones(4,11))

    sum_signal = zeros(Complex{Float64}, 4, length(range))
    for i = 1:11
        sum_signal .+= steering_vectors[:,i] .* cis.((2 * pi * 1.023e6 / 4e6) .* range_transp .+ (40 * pi / 180));
    end
    return sum_signal
  end

However this gives me:

BenchmarkTools.Trial: 
  memory estimate:  137.33 MiB
  allocs estimate:  41
  --------------
  minimum time:     14.918 s (0.07% GC)
  median time:      14.918 s (0.07% GC)
  mean time:        14.918 s (0.07% GC)
  maximum time:     14.918 s (0.07% GC)
  --------------
  samples:          1
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

However, when I remove the the dot after cis I get the following:

BenchmarkTools.Trial: 
  memory estimate:  641.00 MiB
  allocs estimate:  899
  --------------
  minimum time:     3.818 s (2.12% GC)
  median time:      3.852 s (3.95% GC)
  mean time:        3.852 s (3.95% GC)
  maximum time:     3.887 s (5.75% GC)
  --------------
  samples:          2
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

This gives me the following warning, though:

WARNING: cis{T <: Number}(x::AbstractArray{T}) is deprecated, use cis.(x) instead.

So it seems that loop fusion still needs some adjustments.
BTW these tests were done with -O3
EDIT: The same goes for exp() and exp.()
EDIT I filed in issue for that: https://github.com/JuliaLang/julia/issues/20875

Loop fusion will actually be slower here. This is clearer if we simplify the code a bit. Suppose we are doing X .= colvec .* cis.(rowvec) (i.e. combining a column vector and a row vector to make a matrix). This is lowered to broadcast!((x,y) -> x * cis(y), X, colvec, rowvec), which is essentially equivalent to:

for j=1:length(rowvec), i=1:length(colvec)
    X[i,j] = colvec[i] * cis(rowvec[j])
end

The problem here is that if X is m×n, then we end up calling the cis function mn times.

If, instead, we use

tmp = cis.(rowvec)
X .= colvec .* tmp

it only calls the cis function n times, at the expense of requiring more storage.

This is the sort of space/time tradeoff that it is hard to automate. As a rule of thumb, however, if you are doing a broadcast operation combining a row and a column vector, it will be faster if you do any expensive operations on the row and column vector before doing the broadcast combination.

In this particular case, I would suggest doing something like:

function test_perf5()
    rangeᵀ = (1:2000000)'
    rowtemp = similar(Complex128, rangeᵀ)
    steering_vectors = complex.(ones(4,11), ones(4,11)) # placeholder for actual vectors?

    sum_signal = zeros(Complex{Float64}, 4, length(rangeᵀ))
    for i = 1:11
        rowtemp .= cis.(1.6069246423111792 .* rangeᵀ .+ 0.6981317007977318)
        sum_signal .+= steering_vectors[:,i] .* rowtemp
    end
    return sum_signal
  end

You could also maybe try @views, or simply allocate steering_vectors as an array of vectors, to avoid allocating a copy in steering_vectors[:,i]. You can also use @. before the for and then omit all of the dots.

9 Likes

Could I ask the original poster: how did you determine the memory usage (62.7M) of your Matlab function? I don’t know how to do this with the Matlab profiler, and I couldn’t find anything about it in the Matlab docs.

Thanks,
Steve Vavasis

62,7M wasn’t correct. I had to put the code into a function:

Is used:

profile -memory on
...
profile report

Thanks for the info about how to get memory usage data from the Matlab profiler! Apparently this is an undocumented feature of Matlab. I just played with it a bit and seemed to get inconsistent results, so perhaps it’s unreliable. Or perhaps I’m using it incorrectly. Hard to say, since it’s undocumented…