Unusual slow Julia code compared to C++?

#1

Hello,
We are investigating Gap Junctions channels in cells (during cardiac arrhythmia also propagation of excitation in neural networks).
I am trying to implement our model in Julia, but first version ran about 10x slower than C++. Is this normal? because https://aaronang.github.io/2018/stl-benchmark-comparison-cpp-vs-julia/ and https://www.quora.com/Is-it-possible-for-Julia-to-perform-as-fast-as-C++-What-would-be-necessary-for-Julia-to-be-a-suitable-alternative-to-C++ states that Julia is just little slower than C++.

Model is based on Markov chain.
Our paper about model: https://www.sciencedirect.com/science/article/pii/S0006349516001600?via%3Dihub
to run - “runMc36sm.jl” https://pastebin.com/LX5jAFV1
main model - “mc36sm_ss.jl” https://pastebin.com/QyaaW3Ti
commands to run:

using Revise, Plots;
plotlyjs();
includet("runMc36sm.jl"); includet("mc36sm_ss.jl"); includet("mc36sm.jl");
(vj, gj) = runGj();

also c++ version:
“mc36sm.cpp” https://pastebin.com/cSgPUTjC

Kestutis

#2

Hi Kestutis :wave:

Welcome to Julia. It’s great to see new people with cool applications.

Let me flip this question around on you. Let’s say I have developed a cool Julia package that has been optimized over a period of time and looking at the benchmarks, think that I should be able to eek out even more performance if I rewrote the model in C++. I go ahead and try and my first attempt is 10x slower than Julia. Should I blame C++ or should I blame my own own code? :blush:

The question you should be asking is, “What did I do wrong and how can I improve it?”

Julia has some amazing tools for profiling code. Most of the performance gotchas come from unnecessary allocations and type instability.

Usually, what happens is you build your prototype. Seeing a 10x slowdown compared to C++ with a prototype is actually pretty impressive I think. Once you have improved your typed stability, removed unnecessary allocations and done some other optimizations, I am 100% confident your model will run as fast or faster in Julia than C++. I’ve seen the same thing happen over and over again.

What kind of profiling have you done on your Julia code? Have you gone through the pieces looking at @code_warntype for any type instabilities?

7 Likes
#3

This is impossible to say without a review of your code — you could facilitate that by making a small, self-contained example. But if you are just learning Julia, it is very likely that you could improve it a lot by carefully working through the performance tips.

3 Likes
#4

For this kind of code there is no reason to settle with anything worse than 2x slower than C++ and probably can get much closer than that. I skimmed through the code without seeing anything obviously bad but without a fully reproducible example, including the time measurement, I can’t say more. I understand if it’s hard to cut anything from the computational algorithm but packing the minimum necessary into a single file that can be downloaded and run would make it much easier to analyze. Skip everything about Revise and plotting but include timing measurement.

1 Like
#5

It’s most likely possible to get this code to run at the same speed of C++. However, you will get much better help if instead of just dumping your whole code base for the model, you do some profiling and spend some effort reducing the C++ and Julia code as much as possible to find where the core performance difference is.

When you have a small example, it is then possible to look at the generated code to see if the problem is due to some missed vectorization from Julia (perhaps a missing @inbounds) etc.

But to properly go through the code as it is posted here, profile it, compare with C++ etc would probably take on the order of hours and it is unlikely (but of course not impossible) that someone will put in that effort.

5 Likes
#6

Would be great if you could at least say what you have tried yourself…

… anyway, one of the reasons why your code is slower is that you’re doing different things. For example you’re allocating ggg and q over and over again, and you’re not using inplace (loops) multiplications. Just by changing these things in 3-4 places it runs in 13 seconds instead of 27 seconds on my machine. A clear indication that this was part of the problem is that it now allocates 84mb of memory instead of 21gb !

6 Likes
#7

Another thing. Is the C++ code using row major storage for the arrays? Loop orders seem quite weird.

Lastly, writing out the q=p*PPP loop and I’m on 10 instead of 27.5 seconds on my computer, so that’s 36% of the original timing, making it almost 3 times faster. Almost all of the time is spent here

for j = 1:36
    for i = 1:36
        @inbounds q[j] += p[i]*PPP[i,j]
    end
end
4 Likes
#8

Consider also changing the vectors to column vectors therefore doing PPP' * p'.
I don’t know why @pkofod code is faster than p*PPP (isn’t this highly optimized blas vector matrix multiplication?)
My benchmarks are:

q = zeros(1, 36) 
p = zeros(1, 36) 
PPP = ones(36,36)

function custom_mul(q,p,PPP)
    for j = 1:36
        for i = 1:36
            @inbounds q[j] += p[i]*PPP[i,j]
        end
    end
end

pt = collect(p')
PPP_t = collect(PPP')

using BenchmarkTools
@btime p*PPP
@btime custom_mul(q,p,PPP)
@btime PPP_t*pt

# 1.305 μs (1 allocation: 368 bytes)
# 999.000 ns (0 allocations: 0 bytes)
# 742.346 ns (1 allocation: 368 bytes)

PS:

  • 0 allocations might be a plus if you call this a lot of times. You can try calling gemv!
  • this can probably go faster with MKL as a BLAS backend (you would need a julia version with that).
#9

These matrices are quite small, and it’s a vec-mat operation. OpenBLAS has been reported to be slower than naive loops before, so … https://github.com/JuliaLang/julia/issues/3239

#10

Ah also, this might have similar performance to statically sized arrays in C++:

pts = SVector(pt...)
PPP_ts = SMatrix{36,36}(PPP_t)

@btime PPP_ts*pts
# 279.891 ns 

The issue here is that SMatrix{36,36} is pretty hard on the compiler (long compile times)

#11

Thank you all for replies, I will try improve Julia code. Here are one-file version of Julia code: https://pastebin.com/Y0p86rAU main method is run().
I will post one-file simple C++ test case soon, when finish extracting code from our model.

#12

If you want people to carefully look at the code it would be better to post a small example here on Discourse. Maybe you could come up with a (much) smaller channel model to compare with C++ as your first Julia exercise? It would then be easier for people to see where there is room for improvement in your Julia code.

Generally you should look at the performance recommendations in the Julia documentation. Some things you should certainly look at, as others have mentioned, is avoiding allocations of new arrays as much as possible, using @inbounds to turn off bounds checking on arrays, and making sure you loop through matrices in the order their data is stored in memory. This means your loop ordering should look like

for j = 1:ncols
    for i = 1:nrows
        do something with A[i,j] 
   end
end
#13

If you use a profiler on your code, you’ll find that most time is spent in p = p * PPP which is a matrix multiplication that allocates a new matrix for each iteration.
What you can do instead is put a tmp = similar(p) before you’re while loop, i.e. preallocate a matrix of size p, and then use in-place multiplication mul! from LinearAlgebra and save the result in tmp, then copy the result to p.

Overall it means adding a using LinearAlgebra before your function definitions and having the last lpart look like:

    tmp = similar(p)
    while (d_g / (g_old + g_final) > .001e-5)
        i = i + 1;
        mul!(tmp, p, PPP)
        copyto!(p, tmp)
        # p = p * PPP
        ggg = 1 ./ sum( (1 ./ g), dims=2 ); #%sums rows
        g_final = (p*ggg)[1];
        d_g = abs(g_old - g_final);
        g_old = g_final;
    end

This more than halves execution time for me and is a low-hanging fruit.

#14

Some modifications after looking at the profile (start with most expensive lines, optimize them out. I only did a single pass and did not profile the modified variant; I’d guess another 2x-10x improvement is possible). I started from your pastebin.

Before:

julia> testMC36SM.run(); @time testMC36SM.run()
starting..
done.
starting..
done.
  2.867807 seconds (8.59 M allocations: 1.922 GiB, 2.92% gc time)
(range(0.0, stop=120.0, length=100), [5.45759e-9, 5.4623e-9, 5.4656e-9, 5.46773e-9, 5.46899e-9, 5.46966e-9, 5.47005e-9, 5.4705e-9, 5.47136e-9, 5.47298e-9  …  4.54801e-10, 4.25117e-10, 3.9744e-10, 3.71625e-10, 3.47539e-10, 3.25061e-10, 3.04076e-10, 2.8448e-10, 2.66177e-10, 2.49078e-10])

After:

julia> testMC36SM.run(); @time testMC36SM.run()
starting..
done.
starting..
done.
  0.148181 seconds (7.97 k allocations: 4.565 MiB, 7.99% gc time)
(range(0.0, stop=120.0, length=100), [5.45759e-9, 5.46231e-9, 5.4656e-9, 5.46774e-9, 5.469e-9, 5.46967e-9, 5.47006e-9, 5.47051e-9, 5.47136e-9, 5.47298e-9  …  4.54801e-10, 4.25117e-10, 3.9744e-10, 3.71625e-10, 3.47539e-10, 3.25061e-10, 3.04076e-10, 2.8448e-10, 2.66177e-10, 2.49078e-10])

Pastebin: https://pastebin.com/vmp4qdHd

What I did: reuse some buffers, hoist some computations out of loops, turn some 1x36 matrices into vectors. That is nice because we can use LinearAlgebra.dot instead of calling into blas/gemm.

8 Likes
#15

My biggest question is why you are using the power method to find the fixed point. You can replace the time-consuming parts of your code:

   i = 0;
   while (d_g / (g_old + g_final) > .001e-5)
       i = i + 1;
       p=p*PPP;
       ggg = 1 ./ sum( (1 ./ g), dims=2 ); #%sums rows
       g_final = (p*ggg)[1];
       d_g = abs(g_old - g_final);
       g_old = g_final;
   end
   return g_final;

with e.g.

   ggg = 1 ./ sum( (1 ./ g), dims=2 ); #%sums rows
   q = nullspace(PPP' - I)
   return dot(q, ggg) / sum(q)

This requires

using LinearAlgebra
2 Likes
#16

As far as I can tell, there is no need for copyto! here. Just switch the names around:

p, tmp = tmp, p

That is basically a free operation.

4 Likes