Performance issues - Comparison with Matlab

I have the following example of Chebyshev polynomials differentiation matrix in Julia.

function cheb(N)
    N==0 ? (return D=0.0; x=0.0) : (D=0.0; x=0.0)
    x = cos.(π*(0:N)./N)
    c = [2; ones(N-1); 2].*(-1).^(0:N)
    X = repeat(x, 1, N+1)
    dX = X - X' 
    D = (c*(1 ./ c)') ./ (dX + I)
    D = D - Diagonal(vec(sum(D',dims=1)))

    D, x'
end

and I run a simple example such as,

tick()
N = 201
D, x = cheb(N)
tock()

and I get the following results:

Elapsed time are,
0.125761261s
0.133601933s
0.148246084s
0.130149234s

I ran the code 4 times to be sure of the average time execution.

However an equivalent code in Matlab, such as

function [D,x] = cheb(N)
if N==0, D=0; x=1; return, end
x = cos(pi*(0:N)/N)';
c = [2; ones(N-1,1); 2].*(-1).^(0:N)';
X = repmat(x,1,N+1);
dX = X-X';
D = (c*(1./c)')./(dX+(eye(N+1))); % off-diagonal entries
D = D - diag(sum(D'));

and again a simple example,

tic
N = 200;
[D,x] = cheb(N); 
toc

Elapsed time is,
0.004155s
0.003415s
0.003218s
0.003986s

The Matlab code is on average 30 times faster than this Julia code, and I’m not an expert in Julia but I appreciate if someone can explain why the Julia example is slower than Matlab.

The biggest issues is that the main thing you are measuring is compile time. I don’t have Matlab for comparison, but if you time the second run of the function (or better yet, use BenchmarkTools) I get that the function takes .00019 seconds on average.
Your code is also not especially idiomatic for Julia, and can probably be sped up by at least a factor of 5 from where it is now. I’ll look at it and reply again once I figure out how.

5 Likes

Here’s my first pass at improving this.

function cheb(N)
    N==0 ? (return D=0.0; x=0.0) : (D=0.0; x=0.0)
    x = cospi.(0 : 1/N : 1)
    c = ones(Float64, N+1); 
    c[begin]=c[end]=2
    c[2:2:N+2] .*= -1
    D = (c*(1 ./ c)') ./ (x .- x' +. I(N+1))
    D[diagind(D)] -= sum(D,dims=2)
    return D, x'
end

It is over 4x faster (tested on N=4001). There’s definitely still room for improvement though. (and it’s only 2.4x faster on a small input like 200 where the O(n) stuff matters more).
I think futher improvements could be made by using the symmetry of the problem, but I think that’s probably out of scope for this answer.

2 Likes

Here’s what I see on my machine with your original code:

julia> function cheb(N)
           N==0 ? (return D=0.0; x=0.0) : (D=0.0; x=0.0)
           x = cos.(π*(0:N)./N)
           c = [2; ones(N-1); 2].*(-1).^(0:N)
           X = repeat(x, 1, N+1)
           dX = X - X' 
           D = (c*(1 ./ c)') ./ (dX + I)
           D = D - Diagonal(vec(sum(D',dims=1)))
           D, x'
       end
cheb (generic function with 1 method)

julia> @time cheb(201);
  0.118869 seconds (208.11 k allocations: 13.116 MiB)

julia> @time cheb(201);
  0.021767 seconds (76 allocations: 1.881 MiB, 98.05% gc time)

julia> @time cheb(201);
  0.000631 seconds (76 allocations: 1.881 MiB)

julia> @time cheb(201);
  0.050263 seconds (76 allocations: 1.881 MiB, 98.52% gc time)

julia> @time cheb(201);
  0.000248 seconds (76 allocations: 1.881 MiB)

It looks like the speed after compilation is highly variable depending on garbage collection, so we should really be using something something like BenchmarkTools.jl:

julia> using BenchmarkTools

julia> @benchmark cheb(201)
BenchmarkTools.Trial: 
  memory estimate:  1.88 MiB
  allocs estimate:  76
  --------------
  minimum time:     217.252 μs (0.00% GC)
  median time:      266.693 μs (0.00% GC)
  mean time:        298.768 μs (6.68% GC)
  maximum time:     1.136 ms (37.38% GC)
  --------------
  samples:          10000
  evals/sample:     1

So this is currently running much faster than your reported Matlab timings, but I don’t know how reliable those are. As @Oscar_Smith mentioned, there’s a lot that can be done to speed this up further though.

1 Like

I followed the suggestions of @Oscar_Smith and @Mason to perform a fair comparison between the codes, and I appreciate the help about my misconception about the writing/analysing the Julia codes.

Now I tried to perform a better comparison, and I still using the same N=201 to be consistent with my first post.

In Matlab, running 4 times:

Elapsed time,

0.002614 seconds
0.002895 seconds
0.003185 seconds
0.002868 seconds
Avg = 0.0028905 s

The time was slightly different because now I only measure the function cheb.

Now using the BenchmarkTools with my first version of Julia cheb function,

0.000886 seconds (46 allocations: 1.880 MiB)
0.000347 seconds (46 allocations: 1.880 MiB)
0.000275 seconds (46 allocations: 1.880 MiB)
0.000875 seconds (46 allocations: 1.880 MiB)
Avg = 0.00059575 s

and now the @Oscar_Smith version

0.000121 seconds (27 allocations: 649.906 KiB)
0.000123 seconds (27 allocations: 649.906 KiB)
0.000122 seconds (27 allocations: 649.906 KiB)
0.000120 seconds (27 allocations: 649.906 KiB)
Avg = 0.0001215 s

We can see the first version of cheb is on average 4.8 times faster than the Matlab version and the @Oscar_Smith is 23.8 times faster than Matlab and 4.9 times faster than my, and it’s impressive!

Now I’m looking forward to finding the most efficient way to write my codes. This is part of my studies to convert my codes (Fortran, C++, Matlab, Python, …) to the Julia version, and I’m trying to learn the better way to do this. I now as Oscar mentioned, it out of scope but if it is possible for you to help me code a more efficient cheb function I’ll be thankful.

4 Likes

There are a couple of things to watch out for if you’re used to matlab.

In matlab it is generally a good idea to prefer vector operations over explicit loops. This is also possible in Julia, but not necessarily a performance gain. Or put the other way, there is no performance loss by writing explicit loops. On the contrary, vector operations can in some instances lead to unnecessary memory allocations which may possibly slow down the program.

Another performance problem is type instability. Even though you do not have to specify the types of arguments in Julia, they will be available when the progam is run. The compiler is invoked on the first run of a function, it will then know the type of the input arguments, and can optimize the function for these types. However, if the value of a variable can change type during the function evaluation, the compiler can not be sure what’s happening, and will insert code to convert values or make room for other types.

The typical example is

function mysum(N)
   x=0
   for i in 1:N
     x += 1.0/i
   end
   return x
end

The variable x starts out with being an Int64, but will change into a Float64 if the loop is run at least once. The net result is that the compiler doesn’t know whether your routine returns an Int64 or a Float64, and it must take this into account wherever you use your mysum() routine. E.g. if you call some other routine with the result of mysum(). This has bad consequences for optimizations.

In this trivial example, the compiler would know whether the loop is run or not, because it knows the value of N, but it will not use the values, only their types. Because the compiled routine should work for other Ns of the same type as well. (The compiler might of course do inlining of mysum() at any particular place, and use the actual value if it is a compile time constant, but that is another story).

Type stability can be checked with a macro, at the very least the Body:: specification should be a single type, not a Union:

@code_warntype mysum(10000)

I don’t think this expression does what you think. Observe:

julia> N = 1
1

julia> N==0 ? (return D=1.0; x=2.0) : (D=3.0; x=4.0)
4.0

julia> N==1 ? (return D=1.0; x=2.0) : (D=3.0; x=4.0)
1.0

So this expression

(return D=1.0; x=2.0)

only returns D = 1.0, the other case only returns x=4.0. I think what you want is

D, x = 0.0, 0.0
N == 0 && return (D, x)
5 Likes