# Trying to understand how to write good code (example)

#1

Hi folks,

I have writen a diffusion Monte Carlo code in Julia 0.5.0 that, unfortunately, performs about 5 or 6 times slower than the corresponding fortran version. I have learned that it is very easy to write bad code in Julia, and that several detailed things should be taken into account in order to make the code fast -but somehow I fail to understand basic things. I hope somebody here can help me understand what happens to this piece of code, to start with.
Consider the following array:

``````WW = rand(64,3,10);
``````

and the following function

``````function Drift_Force(Wr::Array{Float64},Lbox::Float64)
Np,dim,NW = size(Wr)
F         = zeros(Np,dim,NW)
ri        = zeros(dim,1)
rj        = zeros(dim,1)
rij       = zeros(dim,1)
for iw in 1:NW
for i in 1:Np-1
ri  = zeros(3,1);
for j in i+1:Np
rj = (1.0,2.0,3.0);
end;
end;
end;
end;
``````

Now I try

``````Lbox = 5.75;
@time Drift_Force(WW,Lbox);
``````

to get

``````0.000081 seconds (638 allocations: 84.516 KB)
``````

But if I replace the rj = (1.0,2.0,3.0) line with rj = zeros(3,1)
I get

``````0.001496 seconds (20.80 k allocations: 2.236 MB)
``````

instead. This last version takes much longer and surprisingly allocates much more memory.
Would you say this is normal, expected behaviour? How can it be that by adding three Floats in an
array I get an overhead of 3MB? I ask because this is just part of a more complex code that, in the end, keeps allocating and allocating memory to the point that it becomes much slower than the fortran versionâ€¦

Ferran.

#2

Welcome! Iâ€™ve added code formatting to your post. You can surround code blocks with ````` to make them a bit easier to read.

Your function isnâ€™t doing what you think itâ€™s doing. Take a look at this blog post for details.

#3

A few things. First of all, you likely donâ€™t want to be changing the type of `rj` if this corresponds to a real application. Either have it both start and end as a tuple, or both start and end as an `Array`. It doesnâ€™t make a big difference here, but type-stability will make a difference if youâ€™re not hiding these allocations behind some function barrier.

But more importantly, as @mbauman notes, youâ€™re creating new arrays in a loop.

``````        for i in 1:Np-1
ri  = zeros(3,1);
for j in i+1:Np
rj = (1.0,2.0,3.0);
end
end
``````

That isnâ€™t mutating arrays: itâ€™s creating and binding new ones. What you probably want to do instead is use

``````        for i in 1:Np-1
ri  .= zero(Float64)
for j in i+1:Np
rj .= zero(Float64)
end
end
``````

or `fill!(rj,zero(Float64))`, or any variants of that which mutate instead of making new arrays. Thatâ€™s where your allocations are coming from.

#4

@ChrisRackauckas are you sure about that?
I just tested the `.=` (julia 0.5) version and I get

`0.001230 seconds (20.80 k allocations: 2.236 MB)`

vs the Kb I get if in both ri and rj I just put tuples

`0.000009 seconds (8 allocations: 15.609 KB)`

#5

Youâ€™re testing it wrong. What did you test?

``````function Drift_Force(Wr::Array{Float64},Lbox::Float64)
Np,dim,NW = size(Wr)
F         = zeros(Np,dim,NW)
ri        = zeros(dim,1)
rj        = zeros(dim,1)
rij       = zeros(dim,1)
for iw in 1:NW
for i in 1:Np-1
ri  .= zero(Float64)
for j in i+1:Np
rj .= zero(Float64)
end
end
end
end

WW = rand(64,3,10)
Lbox = 5.75
@time Drift_Force(WW,Lbox)
@time Drift_Force(WW,Lbox)

0.000362 seconds (8 allocations: 15.609 KB)
``````

#6

I tested the exact same code that you wrote above (using `.=`) but with the function `zeros` instead of `zero`. Isnâ€™t your rode replacing `ri` for a single Float instead of an Array of floats?

#7

`zeros` creates an array. `zero` does not. I am mutating the array to have a new value, which is zero. Itâ€™s the same thing as:

``````for i in eachindex(ri)
ri[i] = 0.0
end
``````

The array is never created.

#8

But you might as well write

``````ri  .= 0
``````

It gets promoted to the correct type anyway.

BTW. It might be intentional on your part, but in many cases `zeros(dim, 1)` can be replaced with `zeros(dim)`. An `Nx1` matrix is not quite the same as a length `N` vector, although in most cases the difference doesnâ€™t matter too much.

#9

In this case. There are many cases where the promotion will fail, and so `zero(eltype(ri))` is the only safe way. Those are edge cases though and probably only matter in package/library development.

#10

Even when just filling an array like this?

#11

Example: Unitful numbers will throw an error when converting an integer to a number with units.

#12

Interesting. But `zero(Float64)` seemed a bit like overkill.

To the OP: If you can use `StaticArrays.jl` you might get some serious speed-ups. (You might be able to use it if `dim` is a fixed, relatively small number, that can be decided at compile time.)

#13

Hi again,
thanks for the responses I see your modifications do the trick, although I fail to understand why the zero(Float64) fills the object instead of allocating space. But in any case, and as you can imagine, this code was just a oy. In fact I want to make ri and rj be columns of Wr. If I try

``````function Drift_Force(Wr::Array{Float64},Lbox::Float64)
Np,dim,NW = size(Wr)
F         = zeros(Np,dim,NW)
ri        = zeros(dim,1)
rj        = zeros(dim,1)
rij       = zeros(dim,1)
for iw in 1:NW
for i in 1:Np-1
ri  = Wr[i,:,iw];
for j in i+1:Np
rj = Wr[i,:,iw];
end;
end;
end;
end;
``````

then once again I get

0.002811 seconds (41.59 k allocations: 2.553 MB)

however in the spirit of avoiding array allocations, I could use instead

``````function Drift_Force(Wr::Array{Float64},Lbox::Float64)
Np,dim,NW = size(Wr)
F         = zeros(Np,dim,NW)
ri        = zeros(dim,1)
rj        = zeros(dim,1)
rij       = zeros(dim,1)
for iw in 1:NW
for i in 1:Np-1
ri  = Wr[i,:,iw];
for j in i+1:Np
for k in 1:3
rj[k] = Wr[j,k,iw];
end;
end;
end;
end;
end;
``````

and this time things improve

0.000255 seconds (1.27 k allocations: 94.359 KB)

I could do the same transformation with ri but this is done much less times than the rj assignment, so I donâ€™t care much.

Bottom line is: I like the array manipulation capabilities of Julia but I fail to understand how to use them properly, as I see. How should I transform the code to have as little allocations as in this last version, and with the same speed? I understand I can use @view as follows

``````function Drift_Force(Wr::Array{Float64},Lbox::Float64)
Np,dim,NW = size(Wr)
F         = zeros(Np,dim,NW)
ri        = zeros(dim,1)
rj        = zeros(dim,1)
rij       = zeros(dim,1)
for iw in 1:NW
for i in 1:Np-1
ri  = @view Wr[i,:,iw];
for j in i+1:Np
rj = @view Wr[j,:,iw];
end;
end;
end;
end;
``````

and this time I get

0.001843 seconds (20.80 k allocations: 990.141 KB)

which is much better than the first version, but not close to the element-wise one. What would be then the right way to write it, to achieve the same results?

Ferran,

#14

Code like:

``````rj = zeros(dim,1)
# ...
rj = @view Wr[i,:,iw]
``````

does not mutate `rj` in-place. It completely loses track of whatever it was before. If youâ€™d like to mutate the array, you need to either use indexed assignment (like `rj[:] = â€¦`) or the new `.=` syntax. I highly recommend reading the blog post I linked above.

#15

Again, youâ€™re replacing `rj` instead of mutating it.

``````ri .= @view Wr[i,:,iw]
rj .= @view Wr[j,:,iw]
``````

#16

Hi,
regarding mutating/replacing values, what I canâ€™t figure out is why this

``````rj = @view Wr[j,:,iw]
``````

is replacing instead of mutating. I would even say that

``````rj = Wr[j,:,iw]
``````

should be mutating by default, but Iâ€™m learning that, for some reason I canâ€™t understand, it does not. Anyway, by changing the code in the way suggested

``````function Drift_Force(Wr::Array{Float64},Lbox::Float64)
Np,dim,NW = size(Wr)
F         = zeros(Np,dim,NW)
ri        = zeros(dim,1)
rj        = zeros(dim,1)
rij       = zeros(dim,1)
for iw in 1:NW
for i in 1:Np-1
ri  .= @view Wr[i,:,iw];
for j in i+1:Np
rj .= @view Wr[j,:,iw];
end;
end;
end;
end;
``````

one still gets

0.002208 seconds (41.59 k allocations: 1.284 MB)

which performs worse than the explicit element-wise version I posted above. Similar things happen when I use

``````ri[:]  = @view Wr[i,:,iw];
``````

0.001705 seconds (20.80 k allocations: 990.141 KB)

I still would like to know what would be the version that performs equally well as the elemnt-wise version aboveâ€¦

Thanks again,

Ferran.

#17

As far as why normal assignment in Julia doesnâ€™t mutateâ€¦ I think itâ€™s best to think about each operation independently. `A[i,:,iw]` is just a normal function call. In fact, it is immediately transformed into `getindex(A, i, :, iw)`. This function call has a return value. The thing that the function returns is an array. Itâ€™s an array that contains all the values of the row you selected. The function doesnâ€™t know that itâ€™s being used as the right hand side of a `=` expression. It doesnâ€™t know that the binding on the left hand side had previously been associated with a similarly-shaped array. You could just as easily have written `sum(A[i,:,iw])`, which means that it never gets associated with a name. So it needs a place to store all the values of that row. It needs to allocate its own storage.

Your toy example is still not representative of any real work. In fact, a sufficiently advanced compiler could detect that there are no observable side-effects from those for loops and simply jump straight to returning `nothing`. Thatâ€™s even more likely to occur in the scalar case. It doesnâ€™t look like thatâ€™s what is happening, but itâ€™s something to look out for.

#18

The `.` version did get some performance improvements in v0.6, but it will always have a slight overhead over looping when the arrays are really small. This is because it still has to allocate the views, so if the computation is very tiny, the fact that you make views actually becomes part of the timing. This will never matter for arrays larger than I donâ€™t know size 10? The blog post on .-fusion mentions this (again, on v0.6 where broadcast is slightly faster). In the end, itâ€™s a tradeoff between conveience/readability and a tiny amount of speed. But in large computations, I donâ€™t think youâ€™d notice it very much.

That said, I think there are improvements planned for things like stack-allocated views to make this even less overhead? I donâ€™t know the details there, but the idea I get out of the whole thing is that you only need to loop if you expect tiny arrays.

#19

Just curious - What do you mean by Kb here? Thanks.

#20

But that should be ok here, right? In fact, I would expect not mutating to be faster; since the views do not allocate, youâ€™re just applying the label `rj` to a view. And, in fact, benchmarking seems to bear that out: Dropping the dot gives the same amount of allocations, but a ~40% speedup on my pc, because youâ€™re not copying data into `rj`.