Trying to understand how to write good code (example)

Hi folks,

I have writen a diffusion Monte Carlo code in Julia 0.5.0 that, unfortunately, performs about 5 or 6 times slower than the corresponding fortran version. I have learned that it is very easy to write bad code in Julia, and that several detailed things should be taken into account in order to make the code fast -but somehow I fail to understand basic things. I hope somebody here can help me understand what happens to this piece of code, to start with.
Consider the following array:

WW = rand(64,3,10);

and the following function

function Drift_Force(Wr::Array{Float64},Lbox::Float64)
    Np,dim,NW = size(Wr)
    F         = zeros(Np,dim,NW)
    ri        = zeros(dim,1)    
    rj        = zeros(dim,1)
    rij       = zeros(dim,1)
    for iw in 1:NW
        for i in 1:Np-1
            ri  = zeros(3,1);
            for j in i+1:Np
               rj = (1.0,2.0,3.0);
               end;            
        end;        
    end;    
end;

Now I try

Lbox = 5.75;
@time Drift_Force(WW,Lbox);

to get

0.000081 seconds (638 allocations: 84.516 KB)

But if I replace the rj = (1.0,2.0,3.0) line with rj = zeros(3,1)
I get

0.001496 seconds (20.80 k allocations: 2.236 MB)

instead. This last version takes much longer and surprisingly allocates much more memory.
Would you say this is normal, expected behaviour? How can it be that by adding three Floats in an
array I get an overhead of 3MB? I ask because this is just part of a more complex code that, in the end, keeps allocating and allocating memory to the point that it becomes much slower than the fortran version…

Thanks in advance for your kind help,

Ferran.

Welcome! I’ve added code formatting to your post. You can surround code blocks with ``` to make them a bit easier to read.

Your function isn’t doing what you think it’s doing. Take a look at this blog post for details.

2 Likes

A few things. First of all, you likely don’t want to be changing the type of rj if this corresponds to a real application. Either have it both start and end as a tuple, or both start and end as an Array. It doesn’t make a big difference here, but type-stability will make a difference if you’re not hiding these allocations behind some function barrier.

But more importantly, as @mbauman notes, you’re creating new arrays in a loop.

        for i in 1:Np-1
            ri  = zeros(3,1);
            for j in i+1:Np
               rj = (1.0,2.0,3.0);
            end            
        end

That isn’t mutating arrays: it’s creating and binding new ones. What you probably want to do instead is use

        for i in 1:Np-1
            ri  .= zero(Float64)
            for j in i+1:Np
               rj .= zero(Float64)
            end            
        end

or fill!(rj,zero(Float64)), or any variants of that which mutate instead of making new arrays. That’s where your allocations are coming from.

@ChrisRackauckas are you sure about that?
I just tested the .= (julia 0.5) version and I get

0.001230 seconds (20.80 k allocations: 2.236 MB)

vs the Kb I get if in both ri and rj I just put tuples

0.000009 seconds (8 allocations: 15.609 KB)

You’re testing it wrong. What did you test?

function Drift_Force(Wr::Array{Float64},Lbox::Float64)
    Np,dim,NW = size(Wr)
    F         = zeros(Np,dim,NW)
    ri        = zeros(dim,1)    
    rj        = zeros(dim,1)
    rij       = zeros(dim,1)
    for iw in 1:NW
        for i in 1:Np-1
            ri  .= zero(Float64)
            for j in i+1:Np
               rj .= zero(Float64)
               end         
        end     
    end   
end

WW = rand(64,3,10)
Lbox = 5.75
@time Drift_Force(WW,Lbox)
@time Drift_Force(WW,Lbox)


0.000362 seconds (8 allocations: 15.609 KB)

I tested the exact same code that you wrote above (using .=) but with the function zeros instead of zero. Isn’t your rode replacing ri for a single Float instead of an Array of floats?

zeros creates an array. zero does not. I am mutating the array to have a new value, which is zero. It’s the same thing as:

for i in eachindex(ri)
  ri[i] = 0.0
end

The array is never created.

1 Like

But you might as well write

ri  .= 0

It gets promoted to the correct type anyway.

BTW. It might be intentional on your part, but in many cases zeros(dim, 1) can be replaced with zeros(dim). An Nx1 matrix is not quite the same as a length N vector, although in most cases the difference doesn’t matter too much.

In this case. There are many cases where the promotion will fail, and so zero(eltype(ri)) is the only safe way. Those are edge cases though and probably only matter in package/library development.

Even when just filling an array like this?

Example: Unitful numbers will throw an error when converting an integer to a number with units.

Interesting. But zero(Float64) seemed a bit like overkill.

To the OP: If you can use StaticArrays.jl you might get some serious speed-ups. (You might be able to use it if dim is a fixed, relatively small number, that can be decided at compile time.)

2 Likes

Hi again,
thanks for the responses :slight_smile: I see your modifications do the trick, although I fail to understand why the zero(Float64) fills the object instead of allocating space. But in any case, and as you can imagine, this code was just a oy. In fact I want to make ri and rj be columns of Wr. If I try

function Drift_Force(Wr::Array{Float64},Lbox::Float64)
    Np,dim,NW = size(Wr)
    F         = zeros(Np,dim,NW)
    ri        = zeros(dim,1)    
    rj        = zeros(dim,1)
    rij       = zeros(dim,1)
    for iw in 1:NW
        for i in 1:Np-1
            ri  = Wr[i,:,iw];
            for j in i+1:Np
                rj = Wr[i,:,iw];
            end;            
        end;        
    end;    
end;

then once again I get

0.002811 seconds (41.59 k allocations: 2.553 MB)

however in the spirit of avoiding array allocations, I could use instead

function Drift_Force(Wr::Array{Float64},Lbox::Float64)
    Np,dim,NW = size(Wr)
    F         = zeros(Np,dim,NW)
    ri        = zeros(dim,1)    
    rj        = zeros(dim,1)
    rij       = zeros(dim,1)
    for iw in 1:NW
        for i in 1:Np-1
            ri  = Wr[i,:,iw];
            for j in i+1:Np
                for k in 1:3
                    rj[k] = Wr[j,k,iw];
                end;
            end;            
        end;        
    end;    
end;

and this time things improve

0.000255 seconds (1.27 k allocations: 94.359 KB)

I could do the same transformation with ri but this is done much less times than the rj assignment, so I don’t care much.

Bottom line is: I like the array manipulation capabilities of Julia but I fail to understand how to use them properly, as I see. How should I transform the code to have as little allocations as in this last version, and with the same speed? I understand I can use @view as follows

function Drift_Force(Wr::Array{Float64},Lbox::Float64)
    Np,dim,NW = size(Wr)
    F         = zeros(Np,dim,NW)
    ri        = zeros(dim,1)    
    rj        = zeros(dim,1)
    rij       = zeros(dim,1)
    for iw in 1:NW
        for i in 1:Np-1
            ri  = @view Wr[i,:,iw];
            for j in i+1:Np
                rj = @view Wr[j,:,iw];
            end;            
        end;        
    end;    
end;

and this time I get

0.001843 seconds (20.80 k allocations: 990.141 KB)

which is much better than the first version, but not close to the element-wise one. What would be then the right way to write it, to achieve the same results?

Thank in advance once again,

Ferran,

Code like:

rj = zeros(dim,1)
# ...
rj = @view Wr[i,:,iw]

does not mutate rj in-place. It completely loses track of whatever it was before. If you’d like to mutate the array, you need to either use indexed assignment (like rj[:] = …) or the new .= syntax. I highly recommend reading the blog post I linked above.

1 Like

Again, you’re replacing rj instead of mutating it.

ri .= @view Wr[i,:,iw]
rj .= @view Wr[j,:,iw]

Hi,
regarding mutating/replacing values, what I can’t figure out is why this

rj = @view Wr[j,:,iw]

is replacing instead of mutating. I would even say that

rj = Wr[j,:,iw]

should be mutating by default, but I’m learning that, for some reason I can’t understand, it does not. Anyway, by changing the code in the way suggested

function Drift_Force(Wr::Array{Float64},Lbox::Float64)
    Np,dim,NW = size(Wr)
    F         = zeros(Np,dim,NW)
    ri        = zeros(dim,1)    
    rj        = zeros(dim,1)
    rij       = zeros(dim,1)
    for iw in 1:NW
        for i in 1:Np-1
            ri  .= @view Wr[i,:,iw];
            for j in i+1:Np
                rj .= @view Wr[j,:,iw];
            end;            
        end;        
    end;    
end;

one still gets

0.002208 seconds (41.59 k allocations: 1.284 MB)

which performs worse than the explicit element-wise version I posted above. Similar things happen when I use

ri[:]  = @view Wr[i,:,iw];

0.001705 seconds (20.80 k allocations: 990.141 KB)

I still would like to know what would be the version that performs equally well as the elemnt-wise version above…

Thanks again,

Ferran.

As far as why normal assignment in Julia doesn’t mutate… I think it’s best to think about each operation independently. A[i,:,iw] is just a normal function call. In fact, it is immediately transformed into getindex(A, i, :, iw). This function call has a return value. The thing that the function returns is an array. It’s an array that contains all the values of the row you selected. The function doesn’t know that it’s being used as the right hand side of a = expression. It doesn’t know that the binding on the left hand side had previously been associated with a similarly-shaped array. You could just as easily have written sum(A[i,:,iw]), which means that it never gets associated with a name. So it needs a place to store all the values of that row. It needs to allocate its own storage.

Your toy example is still not representative of any real work. In fact, a sufficiently advanced compiler could detect that there are no observable side-effects from those for loops and simply jump straight to returning nothing. That’s even more likely to occur in the scalar case. It doesn’t look like that’s what is happening, but it’s something to look out for.

2 Likes

The . version did get some performance improvements in v0.6, but it will always have a slight overhead over looping when the arrays are really small. This is because it still has to allocate the views, so if the computation is very tiny, the fact that you make views actually becomes part of the timing. This will never matter for arrays larger than I don’t know size 10? The blog post on .-fusion mentions this (again, on v0.6 where broadcast is slightly faster). In the end, it’s a tradeoff between conveience/readability and a tiny amount of speed. But in large computations, I don’t think you’d notice it very much.

That said, I think there are improvements planned for things like stack-allocated views to make this even less overhead? I don’t know the details there, but the idea I get out of the whole thing is that you only need to loop if you expect tiny arrays.

Just curious - What do you mean by Kb here? Thanks.

But that should be ok here, right? In fact, I would expect not mutating to be faster; since the views do not allocate, you’re just applying the label rj to a view. And, in fact, benchmarking seems to bear that out: Dropping the dot gives the same amount of allocations, but a ~40% speedup on my pc, because you’re not copying data into rj.