Unusual slow Julia code compared to C++?

Hello,
We are investigating Gap Junctions channels in cells (during cardiac arrhythmia also propagation of excitation in neural networks).
I am trying to implement our model in Julia, but first version ran about 10x slower than C++. Is this normal? because STL Benchmark Comparison: C++ vs. Julia | Aaron Ang and https://www.quora.com/Is-it-possible-for-Julia-to-perform-as-fast-as-C++-What-would-be-necessary-for-Julia-to-be-a-suitable-alternative-to-C++ states that Julia is just little slower than C++.

Model is based on Markov chain.
Our paper about model: Stochastic Model of Gap Junctions Exhibiting Rectification and Multiple Closed States of Slow Gates - ScienceDirect
to run - “runMc36sm.jl” runMc36sm.jl - Pastebin.com
main model - “mc36sm_ss.jl” mc36sm_ss.jl - Pastebin.com
commands to run:

using Revise, Plots;
plotlyjs();
includet("runMc36sm.jl"); includet("mc36sm_ss.jl"); includet("mc36sm.jl");
(vj, gj) = runGj();

also c++ version:
“mc36sm.cpp” mc36smLib.cpp - Pastebin.com

Kestutis

Hi Kestutis :wave:

Welcome to Julia. It’s great to see new people with cool applications.

Let me flip this question around on you. Let’s say I have developed a cool Julia package that has been optimized over a period of time and looking at the benchmarks, think that I should be able to eek out even more performance if I rewrote the model in C++. I go ahead and try and my first attempt is 10x slower than Julia. Should I blame C++ or should I blame my own own code? :blush:

The question you should be asking is, “What did I do wrong and how can I improve it?”

Julia has some amazing tools for profiling code. Most of the performance gotchas come from unnecessary allocations and type instability.

Usually, what happens is you build your prototype. Seeing a 10x slowdown compared to C++ with a prototype is actually pretty impressive I think. Once you have improved your typed stability, removed unnecessary allocations and done some other optimizations, I am 100% confident your model will run as fast or faster in Julia than C++. I’ve seen the same thing happen over and over again.

What kind of profiling have you done on your Julia code? Have you gone through the pieces looking at @code_warntype for any type instabilities?

9 Likes

This is impossible to say without a review of your code — you could facilitate that by making a small, self-contained example. But if you are just learning Julia, it is very likely that you could improve it a lot by carefully working through the performance tips.

3 Likes

For this kind of code there is no reason to settle with anything worse than 2x slower than C++ and probably can get much closer than that. I skimmed through the code without seeing anything obviously bad but without a fully reproducible example, including the time measurement, I can’t say more. I understand if it’s hard to cut anything from the computational algorithm but packing the minimum necessary into a single file that can be downloaded and run would make it much easier to analyze. Skip everything about Revise and plotting but include timing measurement.

1 Like

It’s most likely possible to get this code to run at the same speed of C++. However, you will get much better help if instead of just dumping your whole code base for the model, you do some profiling and spend some effort reducing the C++ and Julia code as much as possible to find where the core performance difference is.

When you have a small example, it is then possible to look at the generated code to see if the problem is due to some missed vectorization from Julia (perhaps a missing @inbounds) etc.

But to properly go through the code as it is posted here, profile it, compare with C++ etc would probably take on the order of hours and it is unlikely (but of course not impossible) that someone will put in that effort.

5 Likes

Would be great if you could at least say what you have tried yourself…

… anyway, one of the reasons why your code is slower is that you’re doing different things. For example you’re allocating ggg and q over and over again, and you’re not using inplace (loops) multiplications. Just by changing these things in 3-4 places it runs in 13 seconds instead of 27 seconds on my machine. A clear indication that this was part of the problem is that it now allocates 84mb of memory instead of 21gb !

9 Likes

Another thing. Is the C++ code using row major storage for the arrays? Loop orders seem quite weird.

Lastly, writing out the q=p*PPP loop and I’m on 10 instead of 27.5 seconds on my computer, so that’s 36% of the original timing, making it almost 3 times faster. Almost all of the time is spent here

for j = 1:36
    for i = 1:36
        @inbounds q[j] += p[i]*PPP[i,j]
    end
end
5 Likes

Consider also changing the vectors to column vectors therefore doing PPP' * p'.
I don’t know why @pkofod code is faster than p*PPP (isn’t this highly optimized blas vector matrix multiplication?)
My benchmarks are:

q = zeros(1, 36) 
p = zeros(1, 36) 
PPP = ones(36,36)

function custom_mul(q,p,PPP)
    for j = 1:36
        for i = 1:36
            @inbounds q[j] += p[i]*PPP[i,j]
        end
    end
end

pt = collect(p')
PPP_t = collect(PPP')

using BenchmarkTools
@btime p*PPP
@btime custom_mul(q,p,PPP)
@btime PPP_t*pt

# 1.305 ÎĽs (1 allocation: 368 bytes)
# 999.000 ns (0 allocations: 0 bytes)
# 742.346 ns (1 allocation: 368 bytes)

PS:

  • 0 allocations might be a plus if you call this a lot of times. You can try calling gemv!
  • this can probably go faster with MKL as a BLAS backend (you would need a julia version with that).

These matrices are quite small, and it’s a vec-mat operation. OpenBLAS has been reported to be slower than naive loops before, so … Poor performance of OpenBLAS matrix-vector multiplication on small sizes · Issue #3239 · JuliaLang/julia · GitHub

Ah also, this might have similar performance to statically sized arrays in C++:

pts = SVector(pt...)
PPP_ts = SMatrix{36,36}(PPP_t)

@btime PPP_ts*pts
# 279.891 ns 

The issue here is that SMatrix{36,36} is pretty hard on the compiler (long compile times)

Thank you all for replies, I will try improve Julia code. Here are one-file version of Julia code: test - Pastebin.com main method is run().
I will post one-file simple C++ test case soon, when finish extracting code from our model.

If you want people to carefully look at the code it would be better to post a small example here on Discourse. Maybe you could come up with a (much) smaller channel model to compare with C++ as your first Julia exercise? It would then be easier for people to see where there is room for improvement in your Julia code.

Generally you should look at the performance recommendations in the Julia documentation. Some things you should certainly look at, as others have mentioned, is avoiding allocations of new arrays as much as possible, using @inbounds to turn off bounds checking on arrays, and making sure you loop through matrices in the order their data is stored in memory. This means your loop ordering should look like

for j = 1:ncols
    for i = 1:nrows
        do something with A[i,j] 
   end
end

If you use a profiler on your code, you’ll find that most time is spent in p = p * PPP which is a matrix multiplication that allocates a new matrix for each iteration.
What you can do instead is put a tmp = similar(p) before you’re while loop, i.e. preallocate a matrix of size p, and then use in-place multiplication mul! from LinearAlgebra and save the result in tmp, then copy the result to p.

Overall it means adding a using LinearAlgebra before your function definitions and having the last lpart look like:

    tmp = similar(p)
    while (d_g / (g_old + g_final) > .001e-5)
        i = i + 1;
        mul!(tmp, p, PPP)
        copyto!(p, tmp)
        # p = p * PPP
        ggg = 1 ./ sum( (1 ./ g), dims=2 ); #%sums rows
        g_final = (p*ggg)[1];
        d_g = abs(g_old - g_final);
        g_old = g_final;
    end

This more than halves execution time for me and is a low-hanging fruit.

Some modifications after looking at the profile (start with most expensive lines, optimize them out. I only did a single pass and did not profile the modified variant; I’d guess another 2x-10x improvement is possible). I started from your pastebin.

Before:

julia> testMC36SM.run(); @time testMC36SM.run()
starting..
done.
starting..
done.
  2.867807 seconds (8.59 M allocations: 1.922 GiB, 2.92% gc time)
(range(0.0, stop=120.0, length=100), [5.45759e-9, 5.4623e-9, 5.4656e-9, 5.46773e-9, 5.46899e-9, 5.46966e-9, 5.47005e-9, 5.4705e-9, 5.47136e-9, 5.47298e-9  …  4.54801e-10, 4.25117e-10, 3.9744e-10, 3.71625e-10, 3.47539e-10, 3.25061e-10, 3.04076e-10, 2.8448e-10, 2.66177e-10, 2.49078e-10])

After:

julia> testMC36SM.run(); @time testMC36SM.run()
starting..
done.
starting..
done.
  0.148181 seconds (7.97 k allocations: 4.565 MiB, 7.99% gc time)
(range(0.0, stop=120.0, length=100), [5.45759e-9, 5.46231e-9, 5.4656e-9, 5.46774e-9, 5.469e-9, 5.46967e-9, 5.47006e-9, 5.47051e-9, 5.47136e-9, 5.47298e-9  …  4.54801e-10, 4.25117e-10, 3.9744e-10, 3.71625e-10, 3.47539e-10, 3.25061e-10, 3.04076e-10, 2.8448e-10, 2.66177e-10, 2.49078e-10])

Pastebin: module testMC36SMusing LinearAlgebraexport runfunction run() println( - Pastebin.com

What I did: reuse some buffers, hoist some computations out of loops, turn some 1x36 matrices into vectors. That is nice because we can use LinearAlgebra.dot instead of calling into blas/gemm.

11 Likes

My biggest question is why you are using the power method to find the fixed point. You can replace the time-consuming parts of your code:

   i = 0;
   while (d_g / (g_old + g_final) > .001e-5)
       i = i + 1;
       p=p*PPP;
       ggg = 1 ./ sum( (1 ./ g), dims=2 ); #%sums rows
       g_final = (p*ggg)[1];
       d_g = abs(g_old - g_final);
       g_old = g_final;
   end
   return g_final;

with e.g.

   ggg = 1 ./ sum( (1 ./ g), dims=2 ); #%sums rows
   q = nullspace(PPP' - I)
   return dot(q, ggg) / sum(q)

This requires

using LinearAlgebra
2 Likes

As far as I can tell, there is no need for copyto! here. Just switch the names around:

p, tmp = tmp, p

That is basically a free operation.

5 Likes

Thanks to all, now Julia code runs faster than C++ compiled with default flags, but slower than C++ with optimizations enabled.

C++ with default flags:

g++ testMC36SM.cpp -o runTest                   
./runTest                     
Done. 148956 iterations.
Time required for execution: 0.652836 seconds.

C++ with optimization flags enabled:

g++ testMC36SM.cpp -O3 -march=native -o runTest
./runTest
Done. 149301 iterations.
Time required for execution: 0.112877 seconds.

Julia:

@time vj,gj=run();
done. 149301 iterations.
  0.352668 seconds (1.35 M allocations: 314.194 MiB, 8.52% gc time)

Results are quite good. But it would be interesting to see if it Julia code could be optimized further to reach closer to C++ :wink:

Also it is interesting that iterations number slightly differs, despite same initial conditions for C++ and Julia codes. Results slightly differs, about 1%.

Julia code:

using LinearAlgebra
function run()
    println("starting..")
    vj = LinRange(0, 120, 100);
    gj = zeros(length(vj))
    parAllFast = [ .15 20 20e-9 2e-9  1e10  1e10  .5 -1;
                   .15 20 20e-9 2e-9  1e10  1e10  .5  1;]
    parAllSlow = [ .1  30 40e-9 1e-15 1e10  1e10  .5 -1;
                   .1  30 40e-9 1e-15 1e10  1e10  .5  1];
    par = [parAllFast; parAllSlow];
    pc1c2 = 0;
    pc2c1 = 0;
    pc1c2 = 0.01;
    pc2c1 = 0.001;
    ppp = zeros(1, 36);
    ppp[1] = 1;
    iter = 0;
    for i = 1:length(vj)
        gj[i], ii = MC36SM_Mindaugo_SS(vj[i], par, ppp, pc1c2, pc2c1);
        iter += ii;
    end
    println("done. ", iter)
    return vj, gj
end

function MC36SM_Mindaugo_SS(vj::Float64, par::Array{Float64,2}, p::Array{Float64}, pc1c2::Float64, pc2c1::Float64)
    PPP = zeros(36,36);
    p1 = 0;
    p2 = 0;
    p3 = 0;
    p4 = 0;

    A = par[:,1]';
    v0 = par[:,2]';
    Go = par[:,3]';
    Gc = par[:,4]';
    Ro = par[:,5]';
    Rc = par[:,6]';
    Pt = par[:,7]';
    P = par[:,8]';;

   K = Pt[1];
   g = zeros(36,4);
   v = zeros(36,4);
   k = zeros(36,4);
   R = zeros(36,4);

#    @bp
   # //states conductances
   for i1=1:2
       for i2=1:3
           for i3=1:3
               for i4=1:2
                   if (i1==1)
                       g[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,1]=Go[1];
                       R[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,1]=Ro[1];
                   else
                       g[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,1]=Gc[1];
                       R[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,1]=Rc[1];
                   end
                   if (i2==1)
                       g[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,2]=Go[2];
                       R[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,2]=Ro[2];
                   else
                       g[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,2]=Gc[2];
                       R[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,2]=Rc[2];
                   end
                   if (i3==1)
                       g[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,3]=Go[3];
                       R[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,3]=Ro[3];
                   else
                       g[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,3]=Gc[3];
                       R[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,3]=Rc[3];
                   end
                   if (i4==1)
                       g[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,4]=Go[4];
                       R[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,4]=Ro[4];
                   else
                       g[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,4]=Gc[4];
                       R[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,4]=Rc[4];
                   end
               end
           end
       end
   end

   for a = 1:4 #// voltages and rectification
    @inbounds gg = 1 ./ sum( (1 ./ g), dims = 2 ); #sums rows
    @inbounds v=vj*[gg gg gg gg]./g;
       g=g .* exp.(v./R);
   end

   for i = 1:36
       for j = 1:4
           @inbounds k[i,j] = exp(A[j] * (v[i,j] * P[j] - v0[j]));
       end
   end

   # //------------Matrica P-------------------------------------------------
   for i1=1:2 for i2=1:3 for i3=1:3 for i4=1:2
    for j1=1:2 for j2=1:3 for j3=1:3 for j4=1:2
      i=(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1;
      j=(j1-1)*18+(j2-1)*6+(j3-1)*2+(j4-1)*1+1;


      # %-------be Pt---------------------
      # % p1
       if (i1==1)
           if (j1==1)
               p1=(1-K*k[i,1]/(1+k[i,1]));
           else
               p1=K*k[i,1]/(1+k[i,1]);
           end
       else
           if (j1==2)
               p1=(1-K/(1+k[i,1]));
           else
               p1=K/(1+k[i,1]);
           end
       end
   # %------------------------------
   #   % p2
       if (i2==1)
           if (j2==1)
               p2=(1-K*k[i,2]/(1+k[i,2]));
           end
           if (j2==2)
               p2=K*k[i,2]/(1+k[i,2]);
           end
           if (j2==3)
               p2=0;
           end
       end

       if (i2==2)
           if (j2==1)
               p2=(K/(1+k[i,2]))*(1-pc1c2);
           end
           if (j2==2)
               p2=(1-K/(1+k[i,2]))*(1-pc1c2);
           end
           if (j2==3)
               p2=pc1c2;
           end
       end

       if (i2==3)
           if (j2==1)
               p2=0;
           end
           if (j2==2)
               p2=pc2c1;
           end
           if (j2==3)
               p2=1-pc2c1;
           end
       end
     # %----------------------------------------
     # % p3
       if (i3==1)
           if (j3==1)
               p3=(1-K*k[i,3]/(1+k[i,3]));
           end
           if (j3==2)
               p3=K*k[i,3]/(1+k[i,3]);
           end
           if (j3==3)
               p3=0;
           end
       end

       if (i3==2)
           if (j3==1)
               p3=(K/(1+k[i,3]))*(1-pc1c2);
           end
           if (j3==2)
               p3=(1-K/(1+k[i,3]))*(1-pc1c2);
           end
           if (j3==3)
               p3=pc1c2;
           end
       end

       if (i3==3)
           if (j3==1)
               p3=0;
           end
           if (j3==2)
               p3=pc2c1;
           end
           if (j3==3)
               p3=1-pc2c1;
           end
       end
     # %----------------------------
     # %p4
       if (i4==1)
           if (j4==1)
               p4=(1-K*k[i,4]/(1+k[i,4]));
           else
               p4=K*k[i,4]/(1+k[i,4]);
           end
       else
           if (j4==2)
               p4=(1-K/(1+k[i,4]));
           else
               p4=K/(1+k[i,4]);
           end
       end

    PPP[i,j]=p1*p2*p3*p4;

   end
   end
   end
   end
   end
   end
   end
   end

   d_g = 100000;
   g_old = 100000;
   g_final = 0;
   i = 0;
#    @bp
   tmp = similar(p);
   while (d_g / (g_old + g_final) > .001e-5)
       i = i + 1;
    #    p=p*PPP;
       mul!(tmp, p, PPP) 
       p, tmp = tmp, p;
       ggg = 1 ./ sum( (1 ./ g), dims=2 ); #%sums rows
       g_final = dot(ggg, p); # (p*ggg)[1];
       d_g = abs(g_old - g_final);
       g_old = g_final;
    #    print(" ",i,", ", d_g / (g_old + g_final), " ")
   end
#    print(i, " ")
#    println(i)
#    @bp
   return g_final, i;
end
# end

C++ code:

#include <math.h>
#include <iostream>
#include <time.h>
#include <fstream>

double MC36SM_Mindaugo_SS(double vj, double *par, double *p, double Pt, double pc1c2, double pc2c1, int &iter)
{
	double *A = par;
	double *v0 = par + 1 * 4;
	double *Go = par + 2 * 4;
	double *Gc = par + 3 * 4;
	double *Ro = par + 4 * 4;
	double *Rc = par + 5 * 4;
	// double *Pt = par + 6 * 4; //not used
	double *P = par + 7 * 4;

	double g[36][4] = {};// initialize to 0
	double v[36][4] = {};
	double k[36][4] = {};
	double R[36][4] = {};

	//states conductances
	for (int i1 = 0; i1 < 2; i1++)
		for (int i2 = 0; i2 < 3; i2++)
			for (int i3 = 0; i3 < 3; i3++)
				for (int i4 = 0; i4 < 2; i4++)
				{
					if (i1 == 0)
					{
						g[i1 * 18 + i2 * 6 + i3 * 2 + i4 * 1][0] = Go[0];
						R[i1 * 18 + i2 * 6 + i3 * 2 + i4 * 1][0] = Ro[0];
					}
					else
					{
						g[i1 * 18 + i2 * 6 + i3 * 2 + i4 * 1][0] = Gc[0];
						R[i1 * 18 + i2 * 6 + i3 * 2 + i4 * 1][0] = Rc[0];
					}
					if (i2 == 0)
					{
						g[i1 * 18 + i2 * 6 + i3 * 2 + i4 * 1][1] = Go[1];
						R[i1 * 18 + i2 * 6 + i3 * 2 + i4 * 1][1] = Ro[1];
					}
					else
					{
						g[i1 * 18 + i2 * 6 + i3 * 2 + i4 * 1][1] = Gc[1];
						R[i1 * 18 + i2 * 6 + i3 * 2 + i4 * 1][1] = Rc[1];
					}
					if (i3 == 0)
					{
						g[i1 * 18 + i2 * 6 + i3 * 2 + i4 * 1][2] = Go[2];
						R[i1 * 18 + i2 * 6 + i3 * 2 + i4 * 1][2] = Ro[2];
					}
					else
					{
						g[i1 * 18 + i2 * 6 + i3 * 2 + i4 * 1][2] = Gc[2];
						R[i1 * 18 + i2 * 6 + i3 * 2 + i4 * 1][2] = Rc[2];
					}
					if (i4 == 0)
					{
						g[i1 * 18 + i2 * 6 + i3 * 2 + i4 * 1][3] = Go[3];
						R[i1 * 18 + i2 * 6 + i3 * 2 + i4 * 1][3] = Ro[3];
					}
					else
					{
						g[i1 * 18 + i2 * 6 + i3 * 2 + i4 * 1][3] = Gc[3];
						R[i1 * 18 + i2 * 6 + i3 * 2 + i4 * 1][3] = Rc[3];
					}

				}

	double h[36][4] = {};
	for (int i = 0; i < 4; i++) 	for (int j = 0; j < 36; j++)		h[j][i] = g[j][i];

	for (int a = 0; a < 4; a++) // voltages and rectification
	{
		double gg[36] = {};
		for (int i = 0; i < 36; i++) 		//gg = 1. / sum(1. / g, 2); %sums rows
		{
			for (int j = 0; j < 4; j++)
				gg[i] += 1 / g[i][j];
			gg[i] = 1 / gg[i];
		}
		for (int i = 0; i < 4; i++) //v=vj*[gg gg gg gg]./g;
			for (int j = 0; j < 36; j++)
				v[j][i] = vj * gg[j] / g[j][i];
		for (int i = 0; i < 4; i++) //g=g.*exp(v./R);
			for (int j = 0; j < 36; j++)
				g[j][i] = h[j][i] * exp(v[j][i] / R[j][i]);
	}

	for (int i = 0; i < 36; i++) //k=exp(repmat(A,16,1).*(v.*repmat(P,16,1)-repmat(v0,16,1)));
		for (int j = 0; j < 4; j++)
			k[i][j] = exp(A[j] * (v[i][j] * P[j] - v0[j]));

	double K = Pt;
	double PPP[36][36] = {};
	//------------Matrica P-------------------------------------------------
	double l = pc1c2 / pc2c1;
	double p1, p2, p3, p4;
	for (int i1 = 0; i1 < 2; i1++)
		for (int i2 = 0; i2 < 3; i2++)
			for (int i3 = 0; i3 < 3; i3++)
				for (int i4 = 0; i4 < 2; i4++)
					for (int j1 = 0; j1 < 2; j1++)
						for (int j2 = 0; j2 < 3; j2++)
							for (int j3 = 0; j3 < 3; j3++)
								for (int j4 = 0; j4 < 2; j4++)
								{
									int i = i1 * 18 + i2 * 6 + i3 * 2 + i4 * 1;
									int j = j1 * 18 + j2 * 6 + j3 * 2 + j4 * 1;
									////////////// p1
									if (i1 == 0)
									{
										if (j1 == 0)
											p1 = (1 - K * k[i][0] / (1 + k[i][0]));
										else
											p1 = K * k[i][0] / (1 + k[i][0]);
									}
									else
									{
										if (j1 == 1)
											p1 = (1 - K / (1 + k[i][0]));
										else
											p1 = K / (1 + k[i][0]);
									}
									///////////////p2
									if (i2 == 0)
									{
										if (j2 == 0)
											p2 = (1 - K * k[i][1] / (1 + k[i][1]));
										if (j2 == 1)
											p2 = K * k[i][1] / (1 + k[i][1]);
										if (j2 == 2)
											p2 = 0;
									}
									else if (i2 == 1)
									{
										if (j2 == 0)
											p2 = (K / (1 + k[i][1]))*(1 - pc1c2);
										if (j2 == 1)
											p2 = (1 - K / (1 + k[i][1]))*(1 - pc1c2);
										if (j2 == 2)
											p2 = pc1c2;
									}
									else if (i2 == 2)
									{
										if (j2 == 0)
											p2 = 0;
										if (j2 == 1)
											p2 = pc2c1;
										if (j2 == 2)
											p2 = 1 - pc2c1;
									}
									////////////////p3
									if (i3 == 0)
									{
										if (j3 == 0)
											p3 = (1 - K * k[i][2] / (1 + k[i][2]));
										if (j3 == 1)
											p3 = K * k[i][2] / (1 + k[i][2]);
										if (j3 == 2)
											p3 = 0;
									}
									else if (i3 == 1)
									{
										if (j3 == 0) //c1->o
											p3 = (K / (1 + k[i][2]))*(1 - pc1c2);
										if (j3 == 1)
											p3 = (1 - K / (1 + k[i][2]))*(1 - pc1c2);
										if (j3 == 2)
											p3 = pc1c2;
									}
									else if (i3 == 2)
									{
										if (j3 == 0)
											p3 = 0;
										if (j3 == 1)
											p3 = pc2c1;
										if (j3 == 2)
											p3 = 1 - pc2c1;
									}
									/////////////////p4
									if (i4 == 0)
									{
										if (j4 == 0)
											p4 = (1 - K * k[i][3] / (1 + k[i][3]));
										else
											p4 = K * k[i][3] / (1 + k[i][3]);
									}
									else
									{
										if (j4 == 1)
											p4 = (1 - K / (1 + k[i][3]));
										else
											p4 = K / (1 + k[i][3]);
									}
									PPP[i][j] = p1 * p2*p3*p4;
								}

	// -------------- SS skaiciavimas
	double d_g = 100000;
	double g_old = 100000;
	double g_final = 0;
	int i = 0;
	while (d_g / (g_old + g_final) > .001e-5)
	{
		i++;
		double q[36] = {};
		for (int j = 0; j < 36; j++) for (int i = 0; i < 36; i++)	 q[j] += p[i] * PPP[i][j]; // q=p*PPP;
		double ggg[36] = {};
		for (int i = 0; i < 36; i++) 		//ggg = 1. / sum(1. / g, 2); %sums rows
		{
			for (int j = 0; j < 4; j++)	ggg[i] += 1 / g[i][j];
			ggg[i] = 1 / ggg[i];
		}

		g_final = 0;
		for (int j = 0; j < 36; j++) //g_final = q*ggg';
			g_final += q[j] * ggg[j];

		for (int i = 0; i < 36; i++)	p[i] = q[i]; //grazina p

		d_g = fabs(g_old - g_final);
		g_old = g_final;
	}
	iter += i;
	return g_final;
}

int main()
{
    const int T = 100;
	double gj[T] = {};
	double vj[T] = {};
	double par[4*8] = { .15, .15, .1, .1,   20, 20, 30, 30,   20e-9, 20e-9, 40e-9, 40e-9,  2e-9, 2e-9, 1e-15, 1e-15,
						1e10, 1e10, 1e10, 1e10,    1e10, 1e10, 1e10, 1e10,    0.5, 0.5, 0.5, 0.5,  -1, 1, -1, 1  };
    // parAllFast = [ .15 20 20e-9 2e-9  1e-15 1e-15 .5 -1;
    //                .15 20 20e-9 2e-9  1e-15 1e-15 .5  1;]
    // parAllSlow = [ .1  30 40e-9 1e-15 1e10  1e10  .5 -1;
    //                .1  30 40e-9 1e-15 1e10  1e10  .5  1];						
    // double par[4*8] = { .15, 28, 24e-9, 24e-9/10, 800, 50, .01, -1,   .1, 36, 140e-9, 140e-9/10, 1e10, 1e10, .01, 1,
    //                     .15, 28, 24e-9, 24e-9/10, 800, 50, .01, -1,   .1, 36, 140e-9, 140e-9/10, 1e10, 1e10, .01, 1  };
    double p[36] = {};
    p[0] = 1;
	std::cout << "Starting..\n";
	for (int t = 0; t < T; t++)
	{
			vj[t] = (double) 120 * t / T;
	}
	clock_t t_start, t_end;
	t_start = clock();
	int iter = 0;
	for (int t = 0; t < T; t++)
	{
		gj[t] = MC36SM_Mindaugo_SS(vj[t],par,p,0.5,0.01,0.001,iter);
	}
	t_end = clock();
	std::cout << "Time required for execution: "
		<< (double)(t_end - t_start) / CLOCKS_PER_SEC
		<< " seconds." << "\n";
	std::ofstream outfile("outfile.txt");
	for (int t = 0; t < T; t++)
	{
		outfile << vj[t] << '\t' << gj[t] << std::endl;
	}
	std::cout << "Done. " << iter << "\n";
	outfile.close();
	return 0;
}

As I noted above, the line ggg = 1 ./ sum( (1 ./ g), dims=2 ); is catastrophically bad. You should relocate this line outside of the while loop (compute ggg only once).

Modern compilers can often fix bad code like this, by recognizing that ggg is independent of the loop (“loop invariant”) and hence moving (“hoisting”) its computation outside of the loop. This requires a lot of analysis, though: Can any pointers alias? Is there any function call that can have side-effects that change this?

In your C++ code, the compiler gets all the required info: The loop has no function calls / allocations, and ggg is double[32]. In your julia code, you refuse to tell the compiler about the size (i.e. use MVector from StaticArrays) and instead use the built-in dynamically sized heap-allocated Array. Hence, runtime calls and no LICM.

But the issue is not C++ vs julia or compiler flags, it is simply that you should hoist these loop invariants by hand, both in julia and C++.

Never rely on compilers to give complexity-class improvements: This makes code unreadable and performance unreliable.

Furthermore, you can write the loop computing ggg in 3 lines of C++. If this was inside a hot loop (i.e. if it could not be manually hoisted), then it would be better to write out the loop in julia as well: This is more readable and faster than a complicated dot-notation construct.

12 Likes

The difference in iteration counts is telling. For x86 registers have more precision than memory for doubles ( in a register it’s 80-bit ), so you can end up with different results given initial conditions depending on how the registers are utilized. You may have denormals as well. Try setting the cpu to flush them in Julia, might be a source of the performance difference.

from: 50x speed difference in gemv for different values in vector

set_zero_subnormals(true)
1 Like

~ 7.2x speedup:

using LinearAlgebra
using BenchmarkTools
#%%
function run()
    println("starting..")
    vj = LinRange(0, 120, 100);
    gj = zeros(length(vj))
    parAllFast = [ .15 20 20e-9 2e-9  1e10  1e10  .5 -1;
                   .15 20 20e-9 2e-9  1e10  1e10  .5  1;]
    parAllSlow = [ .1  30 40e-9 1e-15 1e10  1e10  .5 -1;
                   .1  30 40e-9 1e-15 1e10  1e10  .5  1];
    par = [parAllFast; parAllSlow];
    pc1c2 = 0;
    pc2c1 = 0;
    pc1c2 = 0.01;
    pc2c1 = 0.001;
    ppp = zeros(36); #made it column wise instead of row-wise
    ppp[1] = 1;
    iter = 0;
    for i = 1:length(vj)
        gj[i], ii = MC36SM_Mindaugo_SS(vj[i], par, ppp, pc1c2, pc2c1);
        iter += ii;
    end
    println("done. ", iter)
    return vj, gj
end

function MC36SM_Mindaugo_SS(vj::Float64, par::Array{Float64,2}, p::Array{Float64}, pc1c2::Float64, pc2c1::Float64)
    PPP = zeros(36,36);
    p1 = 0;
    p2 = 0;
    p3 = 0;
    p4 = 0;

    A = par[:,1]';
    v0 = par[:,2]';
    Go = par[:,3]';
    Gc = par[:,4]';
    Ro = par[:,5]';
    Rc = par[:,6]';
    Pt = par[:,7]';
    P = par[:,8]';;

   K = Pt[1];
   g = zeros(36,4);
   gg = zeros(36) #buffer for the sums
   v = zeros(36,4);
   k = zeros(36,4);
   R = zeros(36,4);

   for i1=1:2
       for i2=1:3
           for i3=1:3
               for i4=1:2
                   if (i1==1)
                       g[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,1]=Go[1];
                       R[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,1]=Ro[1];
                   else
                       g[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,1]=Gc[1];
                       R[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,1]=Rc[1];
                   end
                   if (i2==1)
                       g[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,2]=Go[2];
                       R[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,2]=Ro[2];
                   else
                       g[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,2]=Gc[2];
                       R[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,2]=Rc[2];
                   end
                   if (i3==1)
                       g[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,3]=Go[3];
                       R[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,3]=Ro[3];
                   else
                       g[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,3]=Gc[3];
                       R[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,3]=Rc[3];
                   end
                   if (i4==1)
                       g[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,4]=Go[4];
                       R[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,4]=Ro[4];
                   else
                       g[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,4]=Gc[4];
                       R[(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1,4]=Rc[4];
                   end
               end
           end
       end
   end

   for a = 1:4 #// voltages and rectification
	   fill!(gg, 0.0)
	   row_sum!(gg, g) #happens twice so took it out into function
	   for j = 1:4, i = 1:36 #loop so no allocations
   	       @inbounds v[i, j] = vj * gg[i] / g[i, j];
	   end
	   g .*= exp.(v./R);
   end
   for i = 1:36
       for j = 1:4
           @inbounds k[i,j] = exp(A[j] * (v[i,j] * P[j] - v0[j]));
       end
   end

   for i1=1:2, i2=1:3, i3=1:3, i4=1:2, j1=1:2, j2=1:3, j3=1:3, j4=1:2
      i=(i1-1)*18+(i2-1)*6+(i3-1)*2+(i4-1)*1+1;
      j=(j1-1)*18+(j2-1)*6+(j3-1)*2+(j4-1)*1+1;


       if (i1==1)
           if (j1==1)
               p1=(1-K*k[i,1]/(1+k[i,1]));
           else
               p1=K*k[i,1]/(1+k[i,1]);
           end
       else
           if (j1==2)
               p1=(1-K/(1+k[i,1]));
           else
               p1=K/(1+k[i,1]);
           end
       end
       if (i2==1)
           if (j2==1)
               p2=(1-K*k[i,2]/(1+k[i,2]));
           end
           if (j2==2)
               p2=K*k[i,2]/(1+k[i,2]);
           end
           if (j2==3)
               p2=0;
           end
       end

       if (i2==2)
           if (j2==1)
               p2=(K/(1+k[i,2]))*(1-pc1c2);
           end
           if (j2==2)
               p2=(1-K/(1+k[i,2]))*(1-pc1c2);
           end
           if (j2==3)
               p2=pc1c2;
           end
       end

       if (i2==3)
           if (j2==1)
               p2=0;
           end
           if (j2==2)
               p2=pc2c1;
           end
           if (j2==3)
               p2=1-pc2c1;
           end
       end
       if (i3==1)
           if (j3==1)
               p3=(1-K*k[i,3]/(1+k[i,3]));
           end
           if (j3==2)
               p3=K*k[i,3]/(1+k[i,3]);
           end
           if (j3==3)
               p3=0;
           end
       end

       if (i3==2)
           if (j3==1)
               p3=(K/(1+k[i,3]))*(1-pc1c2);
           end
           if (j3==2)
               p3=(1-K/(1+k[i,3]))*(1-pc1c2);
           end
           if (j3==3)
               p3=pc1c2;
           end
       end

       if (i3==3)
           if (j3==1)
               p3=0;
           end
           if (j3==2)
               p3=pc2c1;
           end
           if (j3==3)
               p3=1-pc2c1;
           end
       end
       if (i4==1)
           if (j4==1)
               p4=(1-K*k[i,4]/(1+k[i,4]));
           else
               p4=K*k[i,4]/(1+k[i,4]);
           end
       else
           if (j4==2)
               p4=(1-K/(1+k[i,4]));
           else
               p4=K/(1+k[i,4]);
           end
       end
# 
    PPP[j, i]=p1*p2*p3*p4; #fill in transpose so no later we can do column wise mult!

   end

   d_g = 100000;
   g_old = 100000;
   g_final = 0;
   i = 0;
   tmp = similar(p);
   test = zeros(36)
   while (d_g / (g_old + g_final) > .001e-5)
       i = i + 1;
       mul!(tmp, PPP, p) # column-wise mult = around 2x speedup
       p, tmp = tmp, p;
       fill!(gg, 0.0)
       row_sum!(gg, g)
       g_final = dot(gg, p); # (p*ggg)[1];
       d_g = abs(g_old - g_final);
       g_old = g_final;
   end
   return g_final, i;
end

@inline function row_sum!(buffer, g) #fill the buffer in a loop -> no allocations
   for j = 1:size(g, 2), i = 1:size(g, 1)
	   @inbounds buffer[i] += 1 / g[i, j]
   end
   buffer .= 1 ./ buffer
end
#%%
@btime vj, gj=run()

  1. buffer gg for intermediate summing
  2. column vectors instead of row vectors

went from 260ms to 36ms on my computer, optimized c++ takes 50 ms

EDIT: cleaning up the ifs into if/elseif/elses and id recalculation gives me another 2ms, rest is up to you buddy :smiley:

7 Likes