Why is this Julia code considerably slower than Matlab

zsoerenm · February 28, 2017, 3:03pm

If I compare this Julia code:

module Testing

  function performance_test()
    range = 1:2000000;

    steering_vectors = complex(randn(4,11), randn(4,11));

    sum_signal = zeros(Complex{Float64}, 4, length(range));
    for i = 1:11
        carrier_signal = map(x -> exp(2im * pi * 1.023e6 * x / 4e6 + 1im * 40 * pi / 180), range);
        # carrier_signal = exp(2im * pi * 1.023e6 * range / 4e6 + 1im * 40 * pi / 180);
        steered_signal = steering_vectors[:,i] * carrier_signal.';
        sum_signal = sum_signal + steered_signal;
    end
    return sum_signal;
  end

end

against this Matlab code:

range = 1:2000000;

steering_vectors = complex(randn(4,11), randn(4,11));

sum_signal = zeros(4,length(range));
for i = 1:11
    carrier_signal = exp(2i * pi * 1.023e6 * range / 4e6 + 1i * 40 * pi / 180);
    steered_signal = steering_vectors(:,i) * carrier_signal;
    sum_signal = sum_signal + steered_signal;
end

I get the following results:

Julia: 3.956940 seconds (207 allocations: 3.397 GB, 19.13% gc time)
Matlab: Elapsed time is 1.623101 seconds (Allocated Memory: 62714.44 Kb)

The array transposition in Julia takes only a small amount of time compared to the rest when viewing ProfileView.view().

How can I get similiar results to Matlab?

lobingera · February 28, 2017, 3:18pm

This could be again one of the multi-threading (matlab default) vs. single threading (julia default) cases in observation.

But one comment: Don’t use i as loop counter in matlab when you use it also as imaginary unit (like in your exp(2i * pi * …)

NickNack · February 28, 2017, 3:39pm

Also, as the code stands you can speed up both versions considerably by moving the carrier_signal assignment outside the loop.

zsoerenm · February 28, 2017, 3:43pm

Oh yeah, sure. I simplified my code, to the code above. In my actual code I can not move carrier_signal outside the first loop, because it depends on i.

zsoerenm · February 28, 2017, 3:54pm

When I use pmap for the first loop using all available processors with julia -p7:

module Testing

  function performance_test()
    range = 1:2000000;

    steering_vectors = complex(randn(4,11), randn(4,11));
    signals = pmap(1:11) do i
        carrier_signal = map(x -> exp(2im * pi * 1.023e6 * x / 4e6 + 1im * 40 * pi / 180), range);
        steered_signal = steering_vectors[:,i] * carrier_signal.';
    end
    return sum(signals);
  end

end

I get:

Julia: 2.433277 seconds (10.97 k allocations: 2.504 GB, 18.81% gc time)

But this is still slower than 1.679762 seconds from Matlab (with corrected loop variable i)

kristoffer.carlsson · February 28, 2017, 4:20pm

50% is spent in sin and cos so perhaps look at https://github.com/JuliaMath/VML.jl or https://github.com/JuliaMath/Yeppp.jl. Does your timings for Matlab change a lot if you let it run with one thread?

ChrisRackauckas · February 28, 2017, 4:29pm

Why is Julia allocating so much more? There must be an issue somewhere. The same algorithm should allocate similar amounts.

DNF · February 28, 2017, 4:35pm

Matlab is probably doing some in-place updates under the hood, while the Julia code doesn’t do that.

ChrisRackauckas · February 28, 2017, 4:39pm

Here’s how I would’ve written it:

module Testing

  using InplaceOps 
  
  function performance_test()
    range = 1:2000000
    range_transp = range'
    steering_vectors = complex(randn(4,11), randn(4,11))
    steered_signal = zeros(Complex{Float64}, 4, length(range))
    sum_signal = zeros(Complex{Float64}, 4, length(range))
    for i = 1:11
        carrier_signal = map(x -> exp(2im * pi * 1.023e6 * x / 4e6 + 1im * 40 * pi / 180), range')
        # carrier_signal = exp(2im * pi * 1.023e6 * range / 4e6 + 1im * 40 * pi / 180)
        @into! steered_signal = @view(steering_vectors[:,i]) * carrier_signal
        for i in eachindex(steered_signal)
          sum_signal[i] = sum_signal[i] + steered_signal[i]
        end
    end
    return sum_signal
  end

end

On my computer, that changes it from 7.5 seconds to 4 seconds. Depends on how much memory you have.

Since there’s a matrix multiplication in there, the difference between MKL and OpenBLAS might be showing up again?

mkborregaard · February 28, 2017, 4:41pm

What is the allocations for that code?

ChrisRackauckas · February 28, 2017, 4:42pm

Goes from

  7.520755 seconds (732.83 k allocations: 3.417 GB, 23.79% gc time)

to

  4.818585 seconds (27.87 k allocations: 764.075 MB, 2.71% gc time)

I haven’t done much with it. I just pooped out some optimizations without much testing, so it can probably be made better.

piever · February 28, 2017, 4:44pm

Also, to cut down on the allocations, one could use the in-place version of map, i.e.

map!(x -> exp(2im * pi * 1.023e6 * x / 4e6 + 1im * 40 * pi / 180), carrier_signal, range_transp)

ChrisRackauckas · February 28, 2017, 4:44pm

Or on v0.6 just use broadcast fusion.

GunnarFarneback · February 28, 2017, 4:46pm

Full devectorization

    N = 2000000
    range = 1:N
    
    steering_vectors = complex(randn(4,11), randn(4,11))
    
    sum_signal = zeros(Complex{Float64}, 4, length(range))
    carrier_signal = zeros(Complex{Float64}, length(range))
    for i = 1:11
        for k = 1:N
            carrier_signal[k] = exp(2im * pi * 1.023e6 * range[k] / 4e6 + 1im * 40 * pi / 180)
        end

        for j = 1:4
            for k = 1:N
                sum_signal[j,k] += steering_vectors[j,i] * carrier_signal[k]
            end
        end
    end
    return sum_signal

brings it down on my computer from
8.143687 seconds (207 allocations: 3.397 GB, 8.29% gc time)
to
3.646278 seconds (11 allocations: 152.590 MB, 0.22% gc time)

Not really tested code so hopefully it computes the same thing.

RoyiAvital · February 28, 2017, 5:29pm

Wow!

This is a code with loop and MATLAB is better?

Which version of MATLAB are you using?
Did you put inside a function when testing in MATLAB?

I’d pay attention to the amount of memory consumed.
MATLAB is doing much better in that department.

@ChrisRackauckas,
Your example is exactly what shouldn’t happen.
Very clean and simple code become messy to generate some performance where it still loses to MATLAB on a loop.

I’d say, based on memory, MATLAB intelligently create a more efficient code (Less memory work).
Looking at the better memory allocation I’d say it is not all because of Multi Threading (Easy to blame but not right thing to do).

yuyichao · February 28, 2017, 5:55pm

No it’s not a loop.

Yes, though allocating big arrays this way isn’t the problem. It’s cheap compared to the computation of exp(::Complex128)

Yes it is multithreading. As already pointed out, most of the time is spent in the exp (or sin and cos) functions and as I tested locally on MATLAB 2016b it is certainly using 4 physical cores for that.

yuyichao · February 28, 2017, 6:04pm

After increasing the count to 101 instead of 11. I get 11s on matlab and 44s on julia. With julia with 4 threads, I get 26s with openlibm, and 9s with system libm. The difference between the two libm versions is a known issue https://github.com/JuliaLang/julia/issues/17395 . It’s unclear what’s triggering it…

RoyiAvital · February 28, 2017, 6:07pm

@yuyichao,

I wasn’t saying MATLAB doesn’t use Multi Threading.
I was just saying this is MATLAB vs. Julia from a user stand of point, why would he mark this against MATLAB?

The thing is under a loop and MATLAB used to be very bad in those cases.

The low memory consumption tells me that besides the MT MATLAB created (Maybe?) more efficient code.
I’m not an expert, but if the memory is low it means less intermediate data is created, isn’t that a symptom for something?
I’d like to see the same memory consumption and performance as in the Devectorized code, or at least close to that.

Hopefully this kind of code will be (Significantly) faster in Julia 0.6.

yuyichao · February 28, 2017, 6:09pm

Well, you said it wasn’t the issue but in fact it is.

It used to be bad for cheap loops with many iterations. This is nothing closed to that.

Correct, but that’s not the issue here and that’s why the more generic loop fusion syntax in 0.5 + 0.6 is for that’s already mentioned earlier in the thread.

ChrisRackauckas · February 28, 2017, 6:17pm

Not this kind of loop. The “devectorized” loops? Yes. Looping over a bunch of vectorized operations? That’s fine.

Note that we also haven’t re-run the MATLAB code… and our Julia code has halved in its time. So we’re likely about even now.

Topic		Replies	Views
My julia code is somehow much slower than the matlab code New to Julia question , performance , matlab	55	4340	December 30, 2022
Julia is significantly slower (~18 x) than Matlab in vector and matrix algebra New to Julia	32	1884	June 25, 2023
Matlab versus Julia General Usage	33	4938	July 15, 2021
How to make my Julia code fast like MATLAB? New to Julia question	19	2021	May 25, 2021
Simple speed comparison with Matlab New to Julia question	33	1232	May 6, 2025

Why is this Julia code considerably slower than Matlab

Related topics