carrier_signal = exp(2i * pi * 1.023e6 * range / 4e6 + 1i * 40 * pi / 180);
outside of the loop because it is static.
and ran comparison between julia 0.6 and matlab 2016a
Matlab code:
function Untitled()
range = 1:2000000;
steering_vectors = complex(randn(4,11), randn(4,11));
sum_signal = zeros(4,length(range));
carrier_signal = exp(2i * pi * 1.023e6 * range / 4e6 + 1i * 40 * pi / 180);
for i = 1:11
steered_signal = steering_vectors(:,i) * carrier_signal;
sum_signal = sum_signal + steered_signal;
end
Julia code after after fixing all the complains from the compiler, which resulted in the fully "fused: code
@inbounds function test_perf3()
range = 1:2000000;
steering_vectors = complex.(randn(4,11), randn(4,11));
sum_signal = complex.(zeros(4,length(range)));
carrier_signal = exp.(2im * pi * 1.023e6 * range / 4e6 + 1im * 40 * pi / 180)';
for i = 1:11
sum_signal .+= steering_vectors[:,i] .* carrier_signal;
end
end
note that the code is very much the same between matlab and Julia.
O.k now for the results on my machine:
Matlab:1.35 seconds
Donāt do that since it has been repeatedly mentioned in the thread that this canāt be done in the real code and itās actually where most of the time is spent.
btw: sincos as function to provide sin and cos for the same argument should afair be available as operation in the FPU. I donāt know which lib-m exposes this, but iām pretty sure matlab detects the exp(i * pi* something) and uses this.
So it seems that the biggest gain is avoiding exponent of complex number and working efficiently with memory.
I conclude that since @zsoerenm in his post managed to beat MATLAB using Julia (Even his result without Yepp).
Which means, as I guessed at first, this was all about efficiency of the code and not only the Multi Threading of MATLAB.
Namely MATLAB is better than Julia when the code is written in vectorized form not only because of Multi Threading but because it does it more efficiently.
@TsurHerman, You use very old MATLAB.
MATLAB R2016b (And R2017a which is around the corner) are much much more efficient.
Yet if Julia is written in most efficient manner it beats MATLAB.
Hence the conclusion is the same as before in the previous thread.
Priority number 1 is to minimize the gap between the code generated by Julia in Vector Form to what Julia can do in non vectorized form.
There are multiple ways to improve performance and observing that you can improve the performance in one way in julia doesnāt means it was the difference between the original julia and MATLAB version.
If the question is āWhy is this Julia code considerably slower than Matlabā, which is still the title, then the answer is absolutely that MATLAB is doing the most expensive part, which you correctly identified as the exp part, with multiple threads.
All of the methods mentioned in the threads that beats MATLAB with single thread code in julia mentioned in this thread are using SIMD which is not what MATLAB does. It is very easy to get a speed up from SIMD that beats multiple threads since all the CPU models mentioned in this thread support no more than 4 hardware cores but also support <4 x double> which basically means a 4x speed up (AFAICT with the libmvec implementation that gcc uses, the speed up is pretty linear in vector size). Basically even though these give you a big speed up which is reasonably easy to do in julia it is still not the answer to the question.
That said, vectorization (SIMD) of function call is a hard problem that is being worked on. It requires changes to LLVM so that it can recognize julia functions to be vectorized.
@yuyichao, You keep saying that but it is wrong.
I really kept looking carefully all data spread here in various posts.
For the original code (Vectorized Form) MATLAB is faster in Single Threaded code (See @zsoerenmpost, He gets ~3.3 [Sec] for Single Threaded MATLAB vs. ~3.95 [Sec] for Julia) .
Moreover, @DNF in his post answering my question wrote that it seems most of the gain is due calculation sin and cos in efficient way (In place) instead of the exp() of complex argument.
So the facts are easy (We are talking Single Thread mode only, both Julia and MATLAB):
MATLAB is faster in Single Thread in the vectorized code form (See original post).
It seems both MATLAB and Julia implementation for exp(), cos() and sin() arenāt SIMD accelerated in the current form.
Julia, using Built In sin() and cos() and avoiding calculation of exp() for Complex Argument is faster in Devectorized Form.
The Devectroized form mainly handled memory better and avoids intermediate variables.
Wrap all those facts and what you get that MATLAB, in Single Thread for the Vectroized Form was faster due to:
Better memory management.
Probably decomposing exp(1i * realArg) into complex(cos(realArg), sin(realArg)).
Regarding the first, it seems those are fixed in Julia 0.6 (Though with usage of ., I wonder if some optimization can be made without it even).
About the second remark, no need to change anything in Julia as @giordanopointed out it is already there in Julia (See cis()).
The user needs to be aware of that (Better than add heuristic in my opinion so the choice is correct).
Iāll say itās been a pleasure to see one of these now classic āmatlab is faster than juliaā threads come to this resolution. Also very good signs for the development of Julia to see the 0.6 code even faster. And @TsurHerman s code for 0.6 is nicely readable and intuitive, while being fast
Iām not saying those arenāt the case. But I am saying that the multi threading is what causing the biggest difference. I also just donāt count this small difference because the multi threaded matlab result actually is also faster than whatās in the original post by the same factor.
Matlab 2016b on the same machine, default options (i.e. using threading), using the original code I get 1.57 s, so using the latest Julia with the recommended practices does result in faster execution than Matlab. I could not find any difference between libm and openlibm on my machine.
Is that with -O3?
Would simplifying the calculation of of carrier_signal help (i.e. using cis, possibly? or using the simplified calculation Iād come up with for the imaginary part: muladd(1.606924642311179, k, 0.6981317007977318) (where k is the range value).
It was with the default -O2, but -O3 didnāt change anything.
Iād be happy to try a simplified version, but please feed me a line to copy/paste in that case Also, for fairness the equivalent simplification should be tested in Matlab in that case.