Why is this Julia code considerably slower than Matlab

No, I wasn’t saying this is the issue.
I was writing you can hold it against MATLAB.
And I had doubt it is the whole story (As I thought something is strange about the large memory allocation which means intermediate [Unnecessary?] data was created).

This is MATLAB code and Julia code.
The user here care about how much time it takes him to run it.

But I may be wrong and it worth going to the bottom of it as I’m curious as well and we’re here to learn.

I made this MATLAB function:

function [ runTime, mB ] = TimeExpLargeMat( matSize, numThreads )
%UNTITLED2 Summary of this function goes here
%   Detailed explanation goes here

defaultNumThreads = maxNumCompThreads(numThreads);

NUM_ITER = 50;

mA = randn(matSize) + (1i * randn(matSize));
mB = zeros(matSize);

hRunTimer = tic();

for ii = 1:NUM_ITER
    mB(:) = exp(mA);
end

runTime = toc(hRunTimer) / NUM_ITER;

maxNumCompThreads(defaultNumThreads);

end

I have i7 6800K with 4 Channels of Memory (Which doubles the bandwidth of regular Desktop CPU) and 6 Cores.
On my system the performance scales x3.8 when the number of thread is set to 6 (Using Hyper Threading isn’t recommended here).
So the operation of exp() becomes Memory Bounded on my system.

Could someone try this function on a regular system (i7 x7xx) and check how it scales with the number of threads?
I’d expect it to be memory bounded faster (As the bandwidth is lower).
If it is not and it scales good with @zsoerenm CPU then it is all about the MT in MATLAB.

@zsoerenm, What your MATLAB version? Computer specification?
Your function runs on my system in 1.07 [Sec] (MATLAB Code) using 6 Threads.
It takes 3.05 [Sec] using 1 Thread which is scaling of 3 (Again on computer with good Memory Bandwidth).
Using MATLAB Profile it seems the line of the exp() and the line of the Inner Product consume the same time (~45% each).

@zsoerenm, Could you run this:

function [ vRunTimeStat ] = TimeRun( numThreads )

NUM_ITER = 20;

defaultNumThreads = maxNumCompThreads(numThreads);



for ii = 1:NUM_ITER
	hRunTimer 		= tic();
	mA 				= TestFun();
	vRunTime(ii) 	= toc(hRunTimer);
end

maxNumCompThreads(defaultNumThreads);

vRunTimeStat 	= zeros([4, 1]);
vRunTimeStat(1) = min(vRunTime);
vRunTimeStat(2) = median(vRunTime);
vRunTimeStat(3) = mean(vRunTime);
vRunTimeStat(4) = max(vRunTime);


end

function [ sum_signal ] = TestFun(  )

range = 1:2000000;

steering_vectors = complex(randn(4,11), randn(4,11));

sum_signal = zeros(4,length(range));
for ii = 1:11
    carrier_signal = exp(2i * pi * 1.023e6 * range / 4e6 + 1i * 40 * pi / 180);
    steered_signal = steering_vectors(:, ii) * carrier_signal;
    sum_signal = sum_signal + steered_signal;
end


end

With numThreads = 1; and numThreads = numCores; where numCores is the number of cores in your computer?

On a second thought, This should have been my first post on thread.

1 Like

Starting julia with

julia -O3

Then

function perf_test2()
    N = 2_000_000
    range = 1:N
    steering_vectors = complex(randn(4,11), randn(4,11))
    sum_signal = zeros(Complex{Float64}, 4, length(range))
    x, y = zeros(N), zeros(N)
    α = collect(2π * 1.023e6 * range + 40π/180)
    for i = 1:11
       AppleAccelerate.sincos!(x, y, α)
       carrier_signal = complex(x, y)
       for k = 1:N
           for j = 1:4
              @inbounds sum_signal[j,k] += steering_vectors[j,i] * carrier_signal[k]
           end
       end
    end
end

julia> @benchmark perf_test2()
BenchmarkTools.Trial: 
  memory estimate:  503.54 MiB
  allocs estimate:  34
  --------------
  minimum time:     431.948 ms (0.00% GC)
  median time:      578.867 ms (23.27% GC)
  mean time:        573.792 ms (23.63% GC)
  maximum time:     709.043 ms (33.76% GC)
  --------------
  samples:          9
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

On Matlab 2016b, the code runs in 1.5477 sec, fully multithreaded. There is still performance on the table in Julia. I could add multithreading, and also I think there is some memory waste because I cannot figure out how to update the complex array carrier_signal in-place.

But using the vector-optimized AppleAccelerate made all the difference here, really.

Hope I didn’t make any blatant mistakes here.

4 Likes

Using four threads, like Matlab, and adding a @threads decoration in front of the main loop, dropped the runtime to 274us. Not sure how it interacts with AppleAccelerate. Is that multithreaded?

1 Like

Thank you for digging into the problem.
I have a Intel i7-6820HQ CPU @ 2.70GHz and Matlab 2015b

I ran your code for a single thread and it outputted 0.35 - 0.37s but shouldn’t NUM_ITER be 11? In that case it’s between 0.31 - 0.32s. For 8 threads that’s 0.1561s and for NUM_ITER = 11 It’s 0.1421s.

The total time is

  • Matlab 1 thread: 3.584 s; Allocated memory: 296986.41 Kb
  • Matlab 8 threads: 1.569 s; Allocated memory: 296986.34 Kb

By the way: Where are you getting the memory allocation from in Matlab? Using the profiler I get Allocated memory 298379.38 Kb.

Yes, when I put that code into a function and return sum_signal, I get 296916.16 Kb. That was my fault.

Don’t you want j & k reversed in that loop? (because Julia arrays are column-major instead of row-major)

    for k = 1:N, j = 1:4
        sum_signal[j,k] += steering_vectors[j,i] * carrier_signal[k]
    end
2 Likes

@yuyichao, Which library in Julia calculates the Inner Product?
Does Julia use OpenBLAS for that as well?

@zsoerenm,
I think I was wrong with my first Code.
I updated it now, try it again.

@DNF, Wow, this is amazing, great showcase of Julia.
Could you explain what you do?
It seems you use Apple Accelerate, Is that like Intel VML?
I will try do something similar in MATLAB to see what happens.

By the way, why not use a library for the Inner Product?
Does LLVM vectorize it perfectly?

Thank You.

First off: I mistyped ms as us in my previous post! Yikes.

Apple Accelerate supplies a lot of vector-optimized math operations. I presume that it is similar to VML, and also Yeppp. I could have tried Yeppp, but problems with the package manager means I’m unable to install any packages at the moment(!)

The main time saving was to realize that most of the time is spent doing exp(im...). I’m sure Matlab has a super-optimized, multithreaded version, while Julia’s exp is just broadcasted with a dot, and with relatively few bells and whistles, for now, at least. I’m pretty sure mutlithreading and better SIMD and whatever will arrive in the future, so it’s a good idea to keep that in mind when comparing e.g. Julia and Matlab.

Beyond that, you just need to take a bit of care to reduce allocations (I have realized that I can do carrier_signal .= complex.(y, x)) to bring allocations below that of Matlab. The loop doesn’t seem particularly relevant for this example, since the iterations are few and expensive.

After adding a @threads decoration, there is some additional speedup, so now it’s down to ~210-220ms (not us), but if AppleAccelerate.sincos is multithreaded (not sure) that might cause some problems. One advantage with explicit multithreading (as opposed to Matlab’s threading) is that you get to choose how to parallelize your problem.

@RoyiAvital For 1 thread I get:

  • Min: 3.3162
  • Median: 3.3204
  • Mean: 3.3284
  • Max: 3.3925

whereas for 8 threads I get:

  • Min: 1.4319
  • Median: 1.4568
  • Mean; 1.4599
  • Max: 1.4910

@zsoerenm,
Could you check on 4 Threads?
Your CPU has only 4 cores and it is better tell MATLAB use only 4 threads in those cases (This is not the kind of task Hyper Threading assists).

@DNF, thank you for the great explanation.
Look at @zsoerenm updated timings, it shows clearly this code doesn’t scale with Threads. you get Memory Bounded (You also only gained x2).
Why didn’t you use the Library for the Inner Product?

So MATLAB is faster from Julia even using 1 Thread (While Julia might use MT on the Inner Product, I’m not sure, waiting for @yuyichao answer on that) for the original code.

In order to try go deeper, would you plug in Accelerate into the code as it is written above (Original post) and compare to MATLAB?

By the way, what you practically did was taking the generation of the vector outside of the loop (You left only the calculation of the exp() of the Complex Numbers).
Which @zsoerenm said you can’t do as in the original code the complex numbers depends on the loop variable:

Thank You.

Try this in Matlab

N = 2000000;
a = (0:N-1)/N * 2*pi*20;

hf = @()exp(1i*a);

t = [];
for n = 1:8
    maxNumCompThreads(n);
    t(n) = timeit(hf);
end

and you’ll see that exp scales pretty decently with the number of threads up to 4 (close to a factor of 3 at 4 threads, and I have 4 processors). Thing is, probably only certain parts of the code scale with the threads (or are threaded at all). When I sped up exp in Julia the whole thing sped up immensely, while in Matlab it seems that other parts of the code takes a lot of time too, and benefits less from threading.

If you’re only able to effectively parallelize half of your code, then even if that part scales perfectly with the number of threads, you can never exceed 2x speedup.

Yes, but it just doesn’t matter here.

2 Likes

@zsoerenm played with the number of thread and on 4 Cores computer the gain for the whole code was ~x2.
So the whole algorithm is limited by memory and not cores.

For the original code MATLAB is faster even with Single thread → for the original code MATLAB generates better and more efficient single threaded code (Or uses VML code in single thread which is better than Julia’s).

You showed us that since Julia gives the user more control on things one could generate even better code.
Could you show results when you calculate the vector inside the loop?

Thank You.

function perf_test2()
    N = 2_000_000
    range = collect(1:N)
    steering_vectors = complex(randn(4,11), randn(4,11))
    sum_signal = zeros(Complex{Float64}, 4, N)
    x, y, α = zeros(N), zeros(N), zeros(N)
    carrier_signal = zeros(Complex{Float64}, N)
    for i = 1:11
        α = (2π * 1.023e6) .* range .+ (40π/180)
        AppleAccelerate.sincos!(x, y, α)
        carrier_signal .= complex.(y, x)
        for k = 1:N, j = 1:4
            @inbounds sum_signal[j,k] += steering_vectors[j,i] * carrier_signal[k]
        end
    end
end

No multi-threading, but using -O3 flag:

@benchmark perf_test2()
BenchmarkTools.Trial: 
  memory estimate:  534.06 MiB
  allocs estimate:  69
  --------------
  minimum time:     592.667 ms (21.79% GC)
  median time:      635.260 ms (25.93% GC)
  mean time:        653.507 ms (28.29% GC)
  maximum time:     783.052 ms (41.78% GC)
  --------------
  samples:          8
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

Memory can be reduced from this, but because .* and .+ do not fuse on 0.5, I couldn’t be bothered. (Edit: Using in-place operations here reduces allocations to 213MB, and shaves off 10% of the runtime.)

As you see, it’s the performance difference in exp that’s the big difference.

I’m not surprised in the slightest that Matlab has good performance in cases like this, since it’s dominated by calls to C libraries. Edit: In fact, I would be shocked if Matlab didn’t beat Julia the way the original code was written.

1 Like

First, this is impressive.
Great results.

Now, I wonder if Apple Accelerate is Multi Threaded.

Assuming it is not, you are faster ~x6.5 from Single Threaded MATLAB with the world fastest exp() (Or at least close to that).
Can it be that faster?

Would you create x and y using Julia Internal?
I think once you do we’ll be able to analyze the whole picture.

Thank You.

I don’t really know, but I started Julia with a single thread, and in those cases e.g. BLAS is single-threaded, as far as I understand; and then, probably, so is AppleAccelerate.

I don’t know the internals of it, but (warning: wild speculation) it’s probably SIMD vectorized, and perhaps sacrifices some accuracy, or skips nan/inf-checking (who knows), and also does the sin and cos in one operation. I am also exploiting the fact that the input is purely imaginary, while Julias exp doesn’t know that, and jumps through all sorts of hoops while checking the properties of the real and imaginary parts.

I’ve tried playing with calling Julia’s sin and cos with limited success. This, in turn, calls some libm library, and that’s sort of the event horizon, as far as I’m concerned.

@DNF, Could you tell the timing of the run when you used Julia’s sin and cos?

Thank You.

I put all of the different versions in a file and tested all of them, along with my own version of @GunnarFarneback’s code,
and got the following numbers (on a MBP, 4 core, 2.8GHz i7), running julia -p7 -O3.

julia> @benchmark p1()
BenchmarkTools.Trial:
  memory estimate:  3.40 GiB
  allocs estimate:  203
  --------------
  minimum time:     6.413 s (23.32% GC)
  median time:      6.413 s (23.32% GC)
  mean time:        6.413 s (23.32% GC)
  maximum time:     6.413 s (23.32% GC)
  --------------
  samples:          1
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark p2()
BenchmarkTools.Trial:
  memory estimate:  2.51 GiB
  allocs estimate:  161880
  --------------
  minimum time:     3.272 s (20.70% GC)
  median time:      3.293 s (21.03% GC)
  mean time:        3.293 s (21.03% GC)
  maximum time:     3.314 s (21.36% GC)
  --------------
  samples:          2
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark p3()
BenchmarkTools.Trial:
  memory estimate:  762.94 MiB
  allocs estimate:  98
  --------------
  minimum time:     4.579 s (3.21% GC)
  median time:      4.594 s (3.43% GC)
  mean time:        4.594 s (3.43% GC)
  maximum time:     4.608 s (3.65% GC)
  --------------
  samples:          2
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark p4()
BenchmarkTools.Trial:
  memory estimate:  152.59 MiB
  allocs estimate:  7
  --------------
  minimum time:     4.295 s (0.02% GC)
  median time:      4.310 s (0.22% GC)
  mean time:        4.310 s (0.22% GC)
  maximum time:     4.324 s (0.43% GC)
  --------------
  samples:          2
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark p5()
BenchmarkTools.Trial:
  memory estimate:  503.54 MiB
  allocs estimate:  34
  --------------
  minimum time:     496.816 ms (22.87% GC)
  median time:      518.481 ms (26.79% GC)
  mean time:        519.989 ms (26.83% GC)
  maximum time:     543.588 ms (28.39% GC)
  --------------
  samples:          10
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark pspj()
BenchmarkTools.Trial:
  memory estimate:  122.07 MiB
  allocs estimate:  5
  --------------
  minimum time:     3.512 s (0.02% GC)
  median time:      3.529 s (0.01% GC)
  mean time:        3.529 s (0.01% GC)
  maximum time:     3.545 s (0.01% GC)
  --------------
  samples:          2
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

My version was:

function pspj()
    N = 2_000_000
    range = 1:N
    
    steering_vectors = complex(randn(4,11), randn(4,11))
    sum_signal = zeros(Complex{Float64}, 4, N)

    for i = 1:11, k = 1:N
        cs = exp(Complex(0.0, muladd(1.606924642311179, k, 0.6981317007977318)))
        @inbounds for j = 1:4 ; sum_signal[j, k] += steering_vectors[j, i] * cs ; end
    end
    sum_signal
end

I’d like to try this parallelized, but haven’t had time to play more with it.
The AppleAccelerate is amazing though.

No given the similarity in single thread speed the MATLAB one is certainly not the fastest sin/cos in the world and it’s not even close. I’m 99% sure it’s just SIMD and you can easily get this level of speed up with recent versions of gcc + glibc.