# Writing more optimal Julia code

I am trying to figure out if I have indeed written this small test bit of code in the most efficient way for Julia. The overall performance I am getting is better than what I had with python using numpy arrays. However, I am not sure if I am missing some important functions (or parallelization opportunities).
The basic premise of this code is that there are some particles in a 1D box that diffuse around and reversibly stick to some region.

I ran the following code several times and the best run I got was:

@time f =stickyparticle()
4.960075 seconds (1.95 M allocations: 4.613 GiB, 7.55% gc time)

It seems to me like that last two numbers are fairly high. Is it largely because of the random numbers being generated on every loop?

``````using Distributions
function stickyparticle()
nm = 5000;#number of particles
D = 4*10^6; # diffusiion coefficent
h = 10000; # length of container
totaltime = 300;
dt = 0.01; #time step size
fpart = rand(nm)*h; #initialize particle positions
kon = 5; #Binding Constant
koff = 0.5; #Unbinding Constant
iters =  collect(dt:dt:totaltime); #number of iterations for simulation
s=1000; #Range of sticking zone: x = [0:s]
free = fpart.>0; #Boolean array tracking free particles
bound = .~(free); #Boolean array tracking stuck particles
drnd =  Normal(0,sqrt(2*D*dt)); #Standard deviation for particle diffusion
local wpos::BitArray
local wneg::BitArray
local unbind::BitArray
local bind::BitArray

@fastmath @inbounds  @views for t in iters
freepart = view(fpart,free); #Get only free particles
din = rand(drnd,length(freepart)); #Calculate diffusion of particles
freepart .= (din+freepart); #Update particle positions
wpos =  freepart .> h ; #Find particles that  diffuse beyond the length of the container
freepart[wpos] .-= 2*h; #Reflect particles back
wneg = freepart .< 0; #Find particles that  diffuse beyond second boundary
freepart[wneg] .= -freepart[wneg]; #Reflect particles back
unbind = (rand(length(bound)) .< (dt*koff)) .& bound #Calculate stuck particles that unbind
bind = ((rand(length(free)) .< (dt*kon) ) .& (fpart .<=s)).& free; #Calculate free particles that stick
#Stick and unstick particles
bound[unbind] .= false;
free[unbind] .= true;
bound[bind] .= true;
free[bind] .= false;
end
return fpart
end
``````

What happens if you just write a loop instead of vectorising everything?

4 Likes

Minor note: thereās almost certainly no need to call `collect()` when generating `iters`. That just unnecessarily creates an array.

5 Likes

Hereās a faster implementation that follows @dpsandersā advice:

``````function stickyparticle()
nm = 5000 #number of particles
D = 4e6 # diffusion coefficent
h = 10000. # length of container
dt = 0.01 #time step size
totaltime = 300.
kon = 5. #Binding Constant
koff = 0.5 #Unbinding Constant
s = 1000. #Range of sticking zone: x = [0:s]
drnd =  Normal(0., sqrt(2 * D * dt)) #Standard deviation for particle diffusion

fpart = rand(nm) * h #initialize particle positions
free = fpart .> 0; #Boolean array tracking free particles
for t = dt : dt : totaltime
for i = 1 : nm
if free[i]
fpart[i] += rand(drnd) #Calculate diffusion of particles
if fpart[i] > h #Find particles that  diffuse beyond the length of the container
fpart[i] -= 2 * h #Reflect particles back
end
if fpart[i] < 0 #Find particles that  diffuse beyond second boundary
fpart[i] = -fpart[i] #Reflect particles back
end
if fpart[i] <= s
free[i] = rand() > dt * kon #Stick and unstick particles
end
else # bound
free[i] = rand() < dt * koff
end
end
end
fpart
end
``````

I believe it should do the same thing as your original code, but you should verify that it does the right thing. Itās hard to test because the `rand` calls necessarily have to occur in an order thatās different from the one used in your code, so just setting the initial random seed wonāt result in the exact same behavior.

With `julia -O3 --check-bounds=no`, Iām getting 5.420 seconds and 4.61 GiB of allocations for your code, and 2.256 seconds and 83.25 KiB of allocations (independent of the number of time steps) for mine. I checked this using the BenchmarkTools package, which is an awesome tool.

Why is this faster? Two main reasons:

• your code allocates lots of memory. Both allocating memory and collecting garbage (the inverse of allocation, freeing memory that you no longer need) can be big sources of overhead. A line such as `din = rand(drnd,length(freepart))` allocates (at least) `length(freepart)` `Float64`s every iteration. By writing out the loops, you can get rid of all allocations, apart from the initial setup before the loop. Note by the way that you can use `julia --track-allocation=user` to find which lines allocate if youāre not sure.
• cache locality: CPUs have a limited amount of very fast cache memory available. Data is pulled into that cache memory from slower memory when it is needed, but this takes time. Because of this, CPUs can achieve performance benefits by pulling in not just exactly the memory that is needed right now, but also some additional, nearby memory blocks, because it is likely that those will be needed anyway. You can take advantage of that knowledge while programming (in any language) by accessing (and writing to) memory in a linear fashion, instead of jumping around all over the place (spatial locality), and doing āas much as possibleā with the data while it is loaded in the fast cache (temporal locality). In your code, lines like `freepart = view(fpart,free)` result in a non-contiguous view of memory that, when iterated over, is bad in terms of spatial locality. The fact that you essentially iterate over all the particles multiple times in a single time step is bad for temporal locality; itās better to iterate over them only once.

BTW, using ProfileView, I found out that in the new code, most of the time is being spent in the `rand(:: Distributions.Normal)` call, which is slower than it probably should be. Replacing `rand(drnd)` with `Ļ * randn()`, where `Ļ` is precomputed before the loop as `sqrt(2 * D * dt)`, the runtime is further reduced to 1.761 s.

Iām sure this can be improved further by the way. For example, this should be very easy to parallelize since none of the particles interact with each other.

5 Likes

Swap out the RNG for the Xoroshiro128Plus from RandomNumbers.jl

https://sunoru.github.io/RandomNumbers.jl/latest/man/xorshifts/

I find I get some pretty nifty āfreeā speedups from it.

That looks great, do you know if itās possible to define a setting (e.g. in the .juliarc.jl file) to make this the default RNG for `rand` calls?

Another proposal. I hope that I didnāt implement any mistakesā¦

EDIT: Removed, made a stupid mistakeā¦

As expected I made a mistake in my previous post. Hereās an alternative version, similar to @tkoolenās proposal. However, it might be slightly faster. I noticed that replacing the boolean array āfreeā with a similar array of integers slightly improves performance although the amount of allocated memory is increased.

``````using Distributions

function stickyparticle()
nm = 5000 # number of particles
D = 4*10^6 # diffusion coefficent
h = 10000 # length of container
totaltime = 300.0
dt = 0.01 #time step size
fpart = rand(nm)*h # initialize particle positions
kon = 5.0 # Binding Constant
koff = 0.5 # Unbinding Constant
iters =  dt:dt:totaltime # number of iterations for simulation
s = 1000 # Range of sticking zone: x = [0:s]

free = zeros(Int, nm) # indexing in integer arrays is slightly cheaper
for k in eachindex(fpart)
if fpart[k] > 0.0 # "Boolean" array tracking free particles
free[k] = 1
end
end

drnd =  Normal(0.0, sqrt(2*D*dt)) # Standard deviation for particle diffusion

dtkoff = dt * koff
dtkon = dt * kon

@inbounds for t in iters
for k in 1:nm
if free[k] > 0
din = rand(drnd) # Calculate diffusion of particles
fpart[k] += din # Update particle positions

if fpart[k] > h # Find particles that  diffuse beyond the length of the container
fpart[k] -= 2.0*h # Reflect particles back
# assuming h > 0
elseif fpart[k] < 0.0 # Find particles that diffuse beyond second boundary
fpart[k] *= -1.0 # Reflect particles back
end
if (fpart[k] <= s) # Calculate free particles that stick
if rand() < dtkon
free[k] = 0
end
end
else
if rand() < dtkoff # Calculate stuck particles that unbind
free[k] = 1
end
end
end
end
return fpart
end

``````
1 Like

Wow this is quite a speed up! Well done!
There is a minor error on line 34
`fpart[k] -= 2.0*h # Reflect particles back`
should be
`fpart[k] = 2.0*h - fpart[k] # Reflect particles back`

This fix doesnāt appear to make any difference in performance.

I was also wondering if I could get this to scale with a computer that has multi-core processors. I added in `@parallel for k in 1:nm` on line 28. It is a bit of a strange result, however.

@time stickyparticle();
0.329424 seconds (1.75 M allocations: 82.504 MiB, 33.03% gc time)

A few things confuse me. The first is that the time it spits out is indeed faster than before. However, the actual time it takes to finish in the real world is actually now longer than before. Is there something about starting up and closing down the parallel process that isnāt accounted for in the performance readout? Also, it appears that now we see a lot more time spent on garbage collection than before. Are there other functions I should be using if I want the performance to scale as I increase the number of cores available?

I wonder why you see any increase in performance at all. Performing the same modification you did, the runtime has shown to be far longer than in the initial version (even if āfpartā and āfreeā are initialized as SharedArrays) which doesnāt surprise me: Every iteration of the inner loop takes only little more than 100Āµs why the increased overhead due to scheduling should affect the processing time far more than any possible speedup due to parallelization.
However, there might be an increase in performance if the outer loop was distributed across several workers. Sadly, this is not an option since the particle distribution at each timestep depends on the previous one.

tl;dr: Itās probably better to stick to a serial implementation.

You could also batch the particles.