Why is this Julia code considerably slower than Matlab

performance

#1

If I compare this Julia code:

module Testing

  function performance_test()
    range = 1:2000000;

    steering_vectors = complex(randn(4,11), randn(4,11));

    sum_signal = zeros(Complex{Float64}, 4, length(range));
    for i = 1:11
        carrier_signal = map(x -> exp(2im * pi * 1.023e6 * x / 4e6 + 1im * 40 * pi / 180), range);
        # carrier_signal = exp(2im * pi * 1.023e6 * range / 4e6 + 1im * 40 * pi / 180);
        steered_signal = steering_vectors[:,i] * carrier_signal.';
        sum_signal = sum_signal + steered_signal;
    end
    return sum_signal;
  end

end

against this Matlab code:

range = 1:2000000;

steering_vectors = complex(randn(4,11), randn(4,11));

sum_signal = zeros(4,length(range));
for i = 1:11
    carrier_signal = exp(2i * pi * 1.023e6 * range / 4e6 + 1i * 40 * pi / 180);
    steered_signal = steering_vectors(:,i) * carrier_signal;
    sum_signal = sum_signal + steered_signal;
end

I get the following results:

  • Julia: 3.956940 seconds (207 allocations: 3.397 GB, 19.13% gc time)
  • Matlab: Elapsed time is 1.623101 seconds (Allocated Memory: 62714.44 Kb)

The array transposition in Julia takes only a small amount of time compared to the rest when viewing ProfileView.view().

How can I get similiar results to Matlab?


Poor performance of the cis() function
#2

This could be again one of the multi-threading (matlab default) vs. single threading (julia default) cases in observation.

But one comment: Don’t use i as loop counter in matlab when you use it also as imaginary unit (like in your exp(2i * pi * …)


#3

Also, as the code stands you can speed up both versions considerably by moving the carrier_signal assignment outside the loop.


#4

Oh yeah, sure. I simplified my code, to the code above. In my actual code I can not move carrier_signal outside the first loop, because it depends on i.


#5

When I use pmap for the first loop using all available processors with julia -p7:

module Testing

  function performance_test()
    range = 1:2000000;

    steering_vectors = complex(randn(4,11), randn(4,11));
    signals = pmap(1:11) do i
        carrier_signal = map(x -> exp(2im * pi * 1.023e6 * x / 4e6 + 1im * 40 * pi / 180), range);
        steered_signal = steering_vectors[:,i] * carrier_signal.';
    end
    return sum(signals);
  end

end

I get:

  • Julia: 2.433277 seconds (10.97 k allocations: 2.504 GB, 18.81% gc time)

But this is still slower than 1.679762 seconds from Matlab (with corrected loop variable i)


#6

50% is spent in sin and cos so perhaps look at https://github.com/JuliaMath/VML.jl or https://github.com/JuliaMath/Yeppp.jl. Does your timings for Matlab change a lot if you let it run with one thread?


#7

Why is Julia allocating so much more? There must be an issue somewhere. The same algorithm should allocate similar amounts.


#8

Matlab is probably doing some in-place updates under the hood, while the Julia code doesn’t do that.


#9

Here’s how I would’ve written it:

module Testing

  using InplaceOps 
  
  function performance_test()
    range = 1:2000000
    range_transp = range'
    steering_vectors = complex(randn(4,11), randn(4,11))
    steered_signal = zeros(Complex{Float64}, 4, length(range))
    sum_signal = zeros(Complex{Float64}, 4, length(range))
    for i = 1:11
        carrier_signal = map(x -> exp(2im * pi * 1.023e6 * x / 4e6 + 1im * 40 * pi / 180), range')
        # carrier_signal = exp(2im * pi * 1.023e6 * range / 4e6 + 1im * 40 * pi / 180)
        @into! steered_signal = @view(steering_vectors[:,i]) * carrier_signal
        for i in eachindex(steered_signal)
          sum_signal[i] = sum_signal[i] + steered_signal[i]
        end
    end
    return sum_signal
  end

end

On my computer, that changes it from 7.5 seconds to 4 seconds. Depends on how much memory you have.

Since there’s a matrix multiplication in there, the difference between MKL and OpenBLAS might be showing up again?


#10

What is the allocations for that code?


#11

Goes from

  7.520755 seconds (732.83 k allocations: 3.417 GB, 23.79% gc time)

to

  4.818585 seconds (27.87 k allocations: 764.075 MB, 2.71% gc time)

I haven’t done much with it. I just pooped out some optimizations without much testing, so it can probably be made better.


#12

Also, to cut down on the allocations, one could use the in-place version of map, i.e.

map!(x -> exp(2im * pi * 1.023e6 * x / 4e6 + 1im * 40 * pi / 180), carrier_signal, range_transp)


#13

Or on v0.6 just use broadcast fusion.


#14

Full devectorization

    N = 2000000
    range = 1:N
    
    steering_vectors = complex(randn(4,11), randn(4,11))
    
    sum_signal = zeros(Complex{Float64}, 4, length(range))
    carrier_signal = zeros(Complex{Float64}, length(range))
    for i = 1:11
        for k = 1:N
            carrier_signal[k] = exp(2im * pi * 1.023e6 * range[k] / 4e6 + 1im * 40 * pi / 180)
        end

        for j = 1:4
            for k = 1:N
                sum_signal[j,k] += steering_vectors[j,i] * carrier_signal[k]
            end
        end
    end
    return sum_signal

brings it down on my computer from
8.143687 seconds (207 allocations: 3.397 GB, 8.29% gc time)
to
3.646278 seconds (11 allocations: 152.590 MB, 0.22% gc time)

Not really tested code so hopefully it computes the same thing.


#15

Wow!

This is a code with loop and MATLAB is better?

Which version of MATLAB are you using?
Did you put inside a function when testing in MATLAB?

I’d pay attention to the amount of memory consumed.
MATLAB is doing much better in that department.

@ChrisRackauckas,
Your example is exactly what shouldn’t happen.
Very clean and simple code become messy to generate some performance where it still loses to MATLAB on a loop.

I’d say, based on memory, MATLAB intelligently create a more efficient code (Less memory work).
Looking at the better memory allocation I’d say it is not all because of Multi Threading (Easy to blame but not right thing to do).


#16

No it’s not a loop.

Yes, though allocating big arrays this way isn’t the problem. It’s cheap compared to the computation of exp(::Complex128)

Yes it is multithreading. As already pointed out, most of the time is spent in the exp (or sin and cos) functions and as I tested locally on MATLAB 2016b it is certainly using 4 physical cores for that.


#17

After increasing the count to 101 instead of 11. I get 11s on matlab and 44s on julia. With julia with 4 threads, I get 26s with openlibm, and 9s with system libm. The difference between the two libm versions is a known issue https://github.com/JuliaLang/julia/issues/17395 . It’s unclear what’s triggering it…


#18

@yuyichao,

I wasn’t saying MATLAB doesn’t use Multi Threading.
I was just saying this is MATLAB vs. Julia from a user stand of point, why would he mark this against MATLAB?

The thing is under a loop and MATLAB used to be very bad in those cases.

The low memory consumption tells me that besides the MT MATLAB created (Maybe?) more efficient code.
I’m not an expert, but if the memory is low it means less intermediate data is created, isn’t that a symptom for something?
I’d like to see the same memory consumption and performance as in the Devectorized code, or at least close to that.

Hopefully this kind of code will be (Significantly) faster in Julia 0.6.


#19

Well, you said it wasn’t the issue but in fact it is.

It used to be bad for cheap loops with many iterations. This is nothing closed to that.

Correct, but that’s not the issue here and that’s why the more generic loop fusion syntax in 0.5 + 0.6 is for that’s already mentioned earlier in the thread.


#20

Not this kind of loop. The “devectorized” loops? Yes. Looping over a bunch of vectorized operations? That’s fine.

Note that we also haven’t re-run the MATLAB code… and our Julia code has halved in its time. So we’re likely about even now.