Creating the same matrix in Julia as in Matlab takes longer time

I’m moving my question here that I asked on github some hours ago. The issue is that one line makes my Julia-code run slower than my Matlab-code and I don’t see why or how I could solve it. In Matlab I use ( tCPU=cputime; “code” cputime-tCPU ) to calculate time and in Julia I use @time.

I’m posting my example here (Julia code) and maybe you can see if there’s something strange even if you don’t have the rest of the code. I time my function like this:

@time ip( edgesTotal,trianglesTotal,
trianglePlus,triangleMinus,center,center2)

And the function is:

function ip( edgesTotal,trianglesTotal,
trianglePlus,triangleMinus,center,center2)

 gP=complex(zeros(1,9,edgesTotal))     #K complex -> g complex -> gP complex

 for p=1:trianglesTotal
      plus=find(trianglePlus[1:736]-p.==0)
      minus=find(triangleMinus[1:736]-p.==0)
      D=center2.-center[:,p] 
      R=sqrt.(sum(D.*D,1))                             # [1 9 trianglesTotal]
      g=exp(-K.*R)./R                               #  [1 9 trianglesTotal]   
      gP=g[:,:,trianglePlus]
  end 

end

Which gives result: (notice memory allocation…)
0.943518 seconds (7.22 M allocations: 559.592 MB, 15.35% gc time)

When I comment: #gP=g[:,:,trianglePlus] I get the following result instead:
0.375961 seconds (53.91 k allocations: 294.959 MB, 6.79% gc time)

Corresponding times in Matlab:
Entire code: 0.5148
Excluding gP-line: 0.3900

The times differs a bit each time I run the code (0.1 sec difference) but these are usual values for me and as you can see it’s something strange going on in the Julia code. Is there anyone who has any idea what’s making my julia code slow?

It would be helpful to measure each line’s performance separately to find the difference, but at the very least you can convert all array copies produced by slicing (e.g. trianglePlus[1:736]) to views. The simplest way to do so is to wrap the whole loop into a @views macro.

1 Like

Please quote your code (I edited your github version bug I can’t do that here).

Given you do gP=complex(zeros(1,9,edgesTotal)) outside the loop I assume you intended to preallocate the array but it’s actually only causing extra allocation. In julia variables are always references and assignment always means changing the reference. There are countless number of ways to improve the performance but a few very simple ones are

function ip(edgesTotal, trianglesTotal, trianglePlus, triangleMinus, center, center2)

    gP = zeros(Complex128, 1, 9, edgesTotal)     #K complex -> g complex -> gP complex

    for p = 1:trianglesTotal
        plus = find(trianglePlus[1:736] .- p .== 0)
        minus = find(triangleMinus[1:736] .- p .== 0)
        D = center2 .- center[:, p]
        R = sqrt.(sum(D .* D, 1))                             # [1 9 trianglesTotal]
        g = exp.(.-K .* R) ./ R                               #  [1 9 trianglesTotal]
        gP .= @view g[:,:,trianglePlus]
    end
end

Not tested since you didn’t give the input.

Don’t optimize Julia code like you would Matlab code. Don’t turn everything into vector operations that generate zillions of temporary arrays in your inner loops. From looking at your code, I can confidently predict that writing it in a more Julian style will speed it up by more than an order of magnitude.

However, it’s hard to be more specific without a complete working example. (e.g. you could post sample data for the arrays in a gist somewhere. Note that you don’t even use the plus and minus arrays in your inner loop.)

2 Likes

in julia. It’s usually a little bit disturbing to julia fans, but there are problems out there, that can be formulated best in a readable vectorised form. Down selecting from a bigger matrix by an index, a conditional etc. is a valid operation and not something someone invented as a code optimization.
I still don’t understand why (almost) everyone in the julia fan camp assumes that vectorisation is only a optimization that forces you write code you didn’t want.

I’m sorry, the first line is left from when I tried solving the issue with a for-loop instead (which strangely enough was faster but not as fast as Matlab and undoubtely very ugly…). But thank you for the more simple way of writing it!

I changed to “gP .= @view g[:,:,trianglePlus]” as you recomended but the time was the same as before. I’ve also tried gP=view(g,:,:,trianglePlus) and this part became very much faster indeed. The problem is that this is only a part of the function I want to write and the rest of the code became very very slow when I used view. Still, it could be that I should change the rest of the code instead. I only posted this part because already at this point the code was very slow.

I haven’t tried to improve the rest of the code since I wanted to solve this part first and it’s probably a lot that I can improve from the rest of the code as well. I’m posting the entire function below so maybe you’ll know what I want to achieve (since some solutions might affect the rest of the code’s performance maybe it’s good for you to know) and you don’t have to look to closely at the rest of the code.
(How do I quote code?)

 function ip( edgesTotal,trianglesTotal,
                            edgeLength,K,
                            center,center2,
                            trianglePlus,triangleMinus,
                            R_P,R_M,
                            R_Plus2,R_Minus2,
                           cA,cFi)   
    Z=zeros(edgesTotal,edgesTotal)+im*zeros(edgesTotal,edgesTotal)

    for p=1:trianglesTotal
        plus=find(trianglePlus[1:736]-p.==0)
        minus=find(triangleMinus[1:736]-p.==0)
        D=center2.-center[:,p] #repeat(rep,inner=[1,1,trianglesTotal])
        R=sqrt.(sum(D.*D,1))                             # [1 9 trianglesTotal]
        g=exp(-K.*R)./R                               #  [1 9 trianglesTotal]   
        gP=g[:,:,trianglePlus]
        gM=g[:,:,triangleMinus]
        Fi=sum(sum(gP,1),2).-sum(sum(gM,1),2)                            # [1 1 edgesTotal]
        ZF= transpose(cFi).*reshape(Fi,edgesTotal,1)        # [edgesTotal 1]
        
        for k=1:length(plus)       
            n=plus[k]
            RP=R_Plus2[:,:,n]
            A=sum(sum(RP.*R_P,1).*gP,2)+sum(sum(RP.*R_M,1).*gM,2)  #SLOW  
            Z1= transpose(cA).*reshape(A,edgesTotal,1)  
            Z[:,n]=Z[:,n]+edgeLength[n]*(Z1+ZF)    
        end 

        for k=1:length(minus)       
            n=minus[k]
            RP=R_Minus2[:,:,n]
            A=sum(sum(RP.*R_P,1).*gP,2)+sum(sum(RP.*R_M,1).*gM,2)     
            Z1= transpose(cA).*reshape(A,edgesTotal,1)  
            Z[:,n]=Z[:,n]+edgeLength[n]*(Z1-ZF)    
        end 
    end
    return Z
end 

The OP is specifically asking about optimizations and how to reduce allocations. Suggesting to avoid un-needed temporary arrays seems entirely appropriate.

I understand, thank you for your tips still! The problem is that I don’t know how to do that, but I might look into that shortly.

By quoting code I mean using markdown code quoting syntax. You can see the edit in the github pr. (basically put ``` before and after the code block in a new line and put ` around inline code.)

You should also give a sample input.

There are, unfortunately, problems with the code you posted: you’re missing an end after a for loop, the variable RHO_P is undefined, and probably more.

You should try to set up a full example with inputs, and test that it actually runs; then post the full example.

People here really like speeding up code, so there’s a good chance of getting help. But it works best if you make it easy and convenient to test the code.

Some background information here.

allocations introduced by the julia compiler. Also mentioned above: [quote=“dfdx, post:2, topic:4410”]
all array copies produced by slicing (e.g. trianglePlus[1:736])
[/quote]

Someone familiar with writing vectorised code cannot understand the concept, that accessing data (via slicing) is creating a copy. There seems to be the concept of @view but i’d rather like to see a compiler deciding when to copy (very seldom) and when to @view (very often).

In what way does that render the answer unhelpful? And many of the allocations in the code are not due to slicing.[quote=“lobingera, post:12, topic:4410”]
Someone familiar with writing vectorised code cannot understand the concept, that accessing data (via slicing) is creating a copy.
[/quote]

I am confident that they can.

there was a discussion way back about making slices “copy on write”, this would solve at least this particular allocation problem? Any chance this will come for 1.0?

I don’t think there are any plans to do this. You can always put @views in front of a block of code (e.g. a loop or a function) if you want all slices to be views.

Honestly, the slicing is the least of your worries with this sort of Matlab-style code. Every array expression (-, D.*D, find, etcetera) in your loop is allocating a new array. From the looks of things, you really should be able to collapse all of these array operations into one or two loops that do no allocations.

2 Likes

But if you put @view in front then you may unintentionally overwrite your data.

Wouldn’t copy-on-write supply the best of both worlds here? Lightweight and non-allocating when you don’t want to write into them, but still safe. On the surface it looks like the perfect compromise, right? Or at least it seems strictly preferable to the current behaviour, and arguably better than default views.

What’s the downside?

Not light weight at all and can be extremely hard to implement.

Lighter than a full copy, surely? Hard to implement, ok, I can get that.

It is not clear why that would be more of a problem for @view than for anything else.

1 Like

With copy-on-write it wouldn’t be a problem at all.

My understanding of the dilemma: There was previously a push for making slices into views instead of copies. The reason was to avoid allocations and improve performance. The big drawback was that it was massively breaking, and the performance gains were not sufficient to justify this (at least for now). It also makes it easier to accidentally destroy your data, introducing hard-to-find bugs.

Explicitely using @view saves the allocations, but then you need to be very careful about overwriting. Also, it makes the code appear more cluttered.

Copy-on-write saves allocations when you just want a view, but is also safe. Would it even be a breaking change? It seems to combine the strengths of views and copies, and avoids the binary choice between them.

Please note, that I am not making a strong claim about this. It is more of a question: What are the downsides?

As @yuyichao says, @views may not be lightweight (but stack-allocated ones may be?), and it’s hard to implement.

Anything else?