I am trying to figure out if I have indeed written this small test bit of code in the most efficient way for Julia. The overall performance I am getting is better than what I had with python using numpy arrays. However, I am not sure if I am missing some important functions (or parallelization opportunities).
The basic premise of this code is that there are some particles in a 1D box that diffuse around and reversibly stick to some region.
I ran the following code several times and the best run I got was:
@time f =stickyparticle()
4.960075 seconds (1.95 M allocations: 4.613 GiB, 7.55% gc time)
It seems to me like that last two numbers are fairly high. Is it largely because of the random numbers being generated on every loop?
using Distributions
function stickyparticle()
nm = 5000;#number of particles
D = 4*10^6; # diffusiion coefficent
h = 10000; # length of container
totaltime = 300;
dt = 0.01; #time step size
fpart = rand(nm)*h; #initialize particle positions
kon = 5; #Binding Constant
koff = 0.5; #Unbinding Constant
iters = collect(dt:dt:totaltime); #number of iterations for simulation
s=1000; #Range of sticking zone: x = [0:s]
free = fpart.>0; #Boolean array tracking free particles
bound = .~(free); #Boolean array tracking stuck particles
drnd = Normal(0,sqrt(2*D*dt)); #Standard deviation for particle diffusion
local wpos::BitArray
local wneg::BitArray
local unbind::BitArray
local bind::BitArray
@fastmath @inbounds @views for t in iters
freepart = view(fpart,free); #Get only free particles
din = rand(drnd,length(freepart)); #Calculate diffusion of particles
freepart .= (din+freepart); #Update particle positions
wpos = freepart .> h ; #Find particles that diffuse beyond the length of the container
freepart[wpos] .-= 2*h; #Reflect particles back
wneg = freepart .< 0; #Find particles that diffuse beyond second boundary
freepart[wneg] .= -freepart[wneg]; #Reflect particles back
unbind = (rand(length(bound)) .< (dt*koff)) .& bound #Calculate stuck particles that unbind
bind = ((rand(length(free)) .< (dt*kon) ) .& (fpart .<=s)).& free; #Calculate free particles that stick
#Stick and unstick particles
bound[unbind] .= false;
free[unbind] .= true;
bound[bind] .= true;
free[bind] .= false;
end
return fpart
end
Hereās a faster implementation that follows @dpsandersā advice:
function stickyparticle()
nm = 5000 #number of particles
D = 4e6 # diffusion coefficent
h = 10000. # length of container
dt = 0.01 #time step size
totaltime = 300.
kon = 5. #Binding Constant
koff = 0.5 #Unbinding Constant
s = 1000. #Range of sticking zone: x = [0:s]
drnd = Normal(0., sqrt(2 * D * dt)) #Standard deviation for particle diffusion
fpart = rand(nm) * h #initialize particle positions
free = fpart .> 0; #Boolean array tracking free particles
for t = dt : dt : totaltime
for i = 1 : nm
if free[i]
fpart[i] += rand(drnd) #Calculate diffusion of particles
if fpart[i] > h #Find particles that diffuse beyond the length of the container
fpart[i] -= 2 * h #Reflect particles back
end
if fpart[i] < 0 #Find particles that diffuse beyond second boundary
fpart[i] = -fpart[i] #Reflect particles back
end
if fpart[i] <= s
free[i] = rand() > dt * kon #Stick and unstick particles
end
else # bound
free[i] = rand() < dt * koff
end
end
end
fpart
end
I believe it should do the same thing as your original code, but you should verify that it does the right thing. Itās hard to test because the rand calls necessarily have to occur in an order thatās different from the one used in your code, so just setting the initial random seed wonāt result in the exact same behavior.
With julia -O3 --check-bounds=no, Iām getting 5.420 seconds and 4.61 GiB of allocations for your code, and 2.256 seconds and 83.25 KiB of allocations (independent of the number of time steps) for mine. I checked this using the BenchmarkTools package, which is an awesome tool.
Why is this faster? Two main reasons:
your code allocates lots of memory. Both allocating memory and collecting garbage (the inverse of allocation, freeing memory that you no longer need) can be big sources of overhead. A line such as din = rand(drnd,length(freepart)) allocates (at least) length(freepart)Float64s every iteration. By writing out the loops, you can get rid of all allocations, apart from the initial setup before the loop. Note by the way that you can use julia --track-allocation=user to find which lines allocate if youāre not sure.
cache locality: CPUs have a limited amount of very fast cache memory available. Data is pulled into that cache memory from slower memory when it is needed, but this takes time. Because of this, CPUs can achieve performance benefits by pulling in not just exactly the memory that is needed right now, but also some additional, nearby memory blocks, because it is likely that those will be needed anyway. You can take advantage of that knowledge while programming (in any language) by accessing (and writing to) memory in a linear fashion, instead of jumping around all over the place (spatial locality), and doing āas much as possibleā with the data while it is loaded in the fast cache (temporal locality). In your code, lines like freepart = view(fpart,free) result in a non-contiguous view of memory that, when iterated over, is bad in terms of spatial locality. The fact that you essentially iterate over all the particles multiple times in a single time step is bad for temporal locality; itās better to iterate over them only once.
BTW, using ProfileView, I found out that in the new code, most of the time is being spent in the rand(:: Distributions.Normal) call, which is slower than it probably should be. Replacing rand(drnd) with Ļ * randn(), where Ļ is precomputed before the loop as sqrt(2 * D * dt), the runtime is further reduced to 1.761 s.
Iām sure this can be improved further by the way. For example, this should be very easy to parallelize since none of the particles interact with each other.
As expected I made a mistake in my previous post. Hereās an alternative version, similar to @tkoolenās proposal. However, it might be slightly faster. I noticed that replacing the boolean array āfreeā with a similar array of integers slightly improves performance although the amount of allocated memory is increased.
using Distributions
function stickyparticle()
nm = 5000 # number of particles
D = 4*10^6 # diffusion coefficent
h = 10000 # length of container
totaltime = 300.0
dt = 0.01 #time step size
fpart = rand(nm)*h # initialize particle positions
kon = 5.0 # Binding Constant
koff = 0.5 # Unbinding Constant
iters = dt:dt:totaltime # number of iterations for simulation
s = 1000 # Range of sticking zone: x = [0:s]
free = zeros(Int, nm) # indexing in integer arrays is slightly cheaper
for k in eachindex(fpart)
if fpart[k] > 0.0 # "Boolean" array tracking free particles
free[k] = 1
end
end
drnd = Normal(0.0, sqrt(2*D*dt)) # Standard deviation for particle diffusion
dtkoff = dt * koff
dtkon = dt * kon
@inbounds for t in iters
for k in 1:nm
if free[k] > 0
din = rand(drnd) # Calculate diffusion of particles
fpart[k] += din # Update particle positions
if fpart[k] > h # Find particles that diffuse beyond the length of the container
fpart[k] -= 2.0*h # Reflect particles back
# assuming h > 0
elseif fpart[k] < 0.0 # Find particles that diffuse beyond second boundary
fpart[k] *= -1.0 # Reflect particles back
end
if (fpart[k] <= s) # Calculate free particles that stick
if rand() < dtkon
free[k] = 0
end
end
else
if rand() < dtkoff # Calculate stuck particles that unbind
free[k] = 1
end
end
end
end
return fpart
end
Wow this is quite a speed up! Well done!
There is a minor error on line 34 fpart[k] -= 2.0*h # Reflect particles back
should be fpart[k] = 2.0*h - fpart[k] # Reflect particles back
This fix doesnāt appear to make any difference in performance.
I was also wondering if I could get this to scale with a computer that has multi-core processors. I added in @parallel for k in 1:nm on line 28. It is a bit of a strange result, however.
A few things confuse me. The first is that the time it spits out is indeed faster than before. However, the actual time it takes to finish in the real world is actually now longer than before. Is there something about starting up and closing down the parallel process that isnāt accounted for in the performance readout? Also, it appears that now we see a lot more time spent on garbage collection than before. Are there other functions I should be using if I want the performance to scale as I increase the number of cores available?
I wonder why you see any increase in performance at all. Performing the same modification you did, the runtime has shown to be far longer than in the initial version (even if āfpartā and āfreeā are initialized as SharedArrays) which doesnāt surprise me: Every iteration of the inner loop takes only little more than 100Āµs why the increased overhead due to scheduling should affect the processing time far more than any possible speedup due to parallelization.
However, there might be an increase in performance if the outer loop was distributed across several workers. Sadly, this is not an option since the particle distribution at each timestep depends on the previous one.
tl;dr: Itās probably better to stick to a serial implementation.