# Why is this Julia code considerably slower than Matlab

#1

If I compare this Julia code:

``````module Testing

function performance_test()
range = 1:2000000;

steering_vectors = complex(randn(4,11), randn(4,11));

sum_signal = zeros(Complex{Float64}, 4, length(range));
for i = 1:11
carrier_signal = map(x -> exp(2im * pi * 1.023e6 * x / 4e6 + 1im * 40 * pi / 180), range);
# carrier_signal = exp(2im * pi * 1.023e6 * range / 4e6 + 1im * 40 * pi / 180);
steered_signal = steering_vectors[:,i] * carrier_signal.';
sum_signal = sum_signal + steered_signal;
end
return sum_signal;
end

end
``````

against this Matlab code:

``````range = 1:2000000;

steering_vectors = complex(randn(4,11), randn(4,11));

sum_signal = zeros(4,length(range));
for i = 1:11
carrier_signal = exp(2i * pi * 1.023e6 * range / 4e6 + 1i * 40 * pi / 180);
steered_signal = steering_vectors(:,i) * carrier_signal;
sum_signal = sum_signal + steered_signal;
end
``````

I get the following results:

• Julia: 3.956940 seconds (207 allocations: 3.397 GB, 19.13% gc time)
• Matlab: Elapsed time is 1.623101 seconds (Allocated Memory: 62714.44 Kb)

The array transposition in Julia takes only a small amount of time compared to the rest when viewing ProfileView.view().

How can I get similiar results to Matlab?

Poor performance of the cis() function
Trig functions very slow
#2

This could be again one of the multi-threading (matlab default) vs. single threading (julia default) cases in observation.

But one comment: Donâ€™t use i as loop counter in matlab when you use it also as imaginary unit (like in your exp(2i * pi * â€¦)

#3

Also, as the code stands you can speed up both versions considerably by moving the `carrier_signal` assignment outside the loop.

#4

Oh yeah, sure. I simplified my code, to the code above. In my actual code I can not move carrier_signal outside the first loop, because it depends on i.

#5

When I use pmap for the first loop using all available processors with julia -p7:

``````module Testing

function performance_test()
range = 1:2000000;

steering_vectors = complex(randn(4,11), randn(4,11));
signals = pmap(1:11) do i
carrier_signal = map(x -> exp(2im * pi * 1.023e6 * x / 4e6 + 1im * 40 * pi / 180), range);
steered_signal = steering_vectors[:,i] * carrier_signal.';
end
return sum(signals);
end

end
``````

I get:

• Julia: 2.433277 seconds (10.97 k allocations: 2.504 GB, 18.81% gc time)

But this is still slower than 1.679762 seconds from Matlab (with corrected loop variable i)

#6

50% is spent in `sin` and `cos` so perhaps look at https://github.com/JuliaMath/VML.jl or https://github.com/JuliaMath/Yeppp.jl. Does your timings for Matlab change a lot if you let it run with one thread?

#7

Why is Julia allocating so much more? There must be an issue somewhere. The same algorithm should allocate similar amounts.

#8

Matlab is probably doing some in-place updates under the hood, while the Julia code doesnâ€™t do that.

#9

Hereâ€™s how I wouldâ€™ve written it:

``````module Testing

using InplaceOps

function performance_test()
range = 1:2000000
range_transp = range'
steering_vectors = complex(randn(4,11), randn(4,11))
steered_signal = zeros(Complex{Float64}, 4, length(range))
sum_signal = zeros(Complex{Float64}, 4, length(range))
for i = 1:11
carrier_signal = map(x -> exp(2im * pi * 1.023e6 * x / 4e6 + 1im * 40 * pi / 180), range')
# carrier_signal = exp(2im * pi * 1.023e6 * range / 4e6 + 1im * 40 * pi / 180)
@into! steered_signal = @view(steering_vectors[:,i]) * carrier_signal
for i in eachindex(steered_signal)
sum_signal[i] = sum_signal[i] + steered_signal[i]
end
end
return sum_signal
end

end
``````

On my computer, that changes it from 7.5 seconds to 4 seconds. Depends on how much memory you have.

Since thereâ€™s a matrix multiplication in there, the difference between MKL and OpenBLAS might be showing up again?

#10

What is the allocations for that code?

#11

Goes from

``````  7.520755 seconds (732.83 k allocations: 3.417 GB, 23.79% gc time)
``````

to

``````  4.818585 seconds (27.87 k allocations: 764.075 MB, 2.71% gc time)
``````

I havenâ€™t done much with it. I just pooped out some optimizations without much testing, so it can probably be made better.

#12

Also, to cut down on the allocations, one could use the in-place version of `map`, i.e.

`map!(x -> exp(2im * pi * 1.023e6 * x / 4e6 + 1im * 40 * pi / 180), carrier_signal, range_transp)`

#13

Or on v0.6 just use broadcast fusion.

#14

Full devectorization

``````    N = 2000000
range = 1:N

steering_vectors = complex(randn(4,11), randn(4,11))

sum_signal = zeros(Complex{Float64}, 4, length(range))
carrier_signal = zeros(Complex{Float64}, length(range))
for i = 1:11
for k = 1:N
carrier_signal[k] = exp(2im * pi * 1.023e6 * range[k] / 4e6 + 1im * 40 * pi / 180)
end

for j = 1:4
for k = 1:N
sum_signal[j,k] += steering_vectors[j,i] * carrier_signal[k]
end
end
end
return sum_signal
``````

brings it down on my computer from
`8.143687 seconds (207 allocations: 3.397 GB, 8.29% gc time)`
to
`3.646278 seconds (11 allocations: 152.590 MB, 0.22% gc time)`

Not really tested code so hopefully it computes the same thing.

#15

Wow!

This is a code with loop and MATLAB is better?

Which version of MATLAB are you using?
Did you put inside a function when testing in MATLAB?

Iâ€™d pay attention to the amount of memory consumed.
MATLAB is doing much better in that department.

@ChrisRackauckas,
Your example is exactly what shouldnâ€™t happen.
Very clean and simple code become messy to generate some performance where it still loses to MATLAB on a loop.

Iâ€™d say, based on memory, MATLAB intelligently create a more efficient code (Less memory work).
Looking at the better memory allocation Iâ€™d say it is not all because of Multi Threading (Easy to blame but not right thing to do).

#16

No itâ€™s not a loop.

Yes, though allocating big arrays this way isnâ€™t the problem. Itâ€™s cheap compared to the computation of `exp(::Complex128)`

Yes it is multithreading. As already pointed out, most of the time is spent in the `exp` (or `sin` and `cos`) functions and as I tested locally on MATLAB 2016b it is certainly using 4 physical cores for that.

#17

After increasing the count to 101 instead of 11. I get 11s on matlab and 44s on julia. With julia with 4 threads, I get 26s with openlibm, and 9s with system libm. The difference between the two libm versions is a known issue https://github.com/JuliaLang/julia/issues/17395 . Itâ€™s unclear whatâ€™s triggering itâ€¦

#18

I wasnâ€™t saying MATLAB doesnâ€™t use Multi Threading.
I was just saying this is MATLAB vs. Julia from a user stand of point, why would he mark this against MATLAB?

The thing is under a loop and MATLAB used to be very bad in those cases.

The low memory consumption tells me that besides the MT MATLAB created (Maybe?) more efficient code.
Iâ€™m not an expert, but if the memory is low it means less intermediate data is created, isnâ€™t that a symptom for something?
Iâ€™d like to see the same memory consumption and performance as in the Devectorized code, or at least close to that.

Hopefully this kind of code will be (Significantly) faster in Julia 0.6.

#19

Well, you said it wasnâ€™t the issue but in fact it is.

It used to be bad for cheap loops with many iterations. This is nothing closed to that.

Correct, but thatâ€™s not the issue here and thatâ€™s why the more generic loop fusion syntax in 0.5 + 0.6 is for thatâ€™s already mentioned earlier in the thread.

#20

Not this kind of loop. The â€śdevectorizedâ€ť loops? Yes. Looping over a bunch of vectorized operations? Thatâ€™s fine.

Note that we also havenâ€™t re-run the MATLAB codeâ€¦ and our Julia code has halved in its time. So weâ€™re likely about even now.