Writing more optimal Julia code

tross · August 8, 2017, 11:12pm

I am trying to figure out if I have indeed written this small test bit of code in the most efficient way for Julia. The overall performance I am getting is better than what I had with python using numpy arrays. However, I am not sure if I am missing some important functions (or parallelization opportunities).
The basic premise of this code is that there are some particles in a 1D box that diffuse around and reversibly stick to some region.

I ran the following code several times and the best run I got was:

@time f =stickyparticle()
4.960075 seconds (1.95 M allocations: 4.613 GiB, 7.55% gc time)

It seems to me like that last two numbers are fairly high. Is it largely because of the random numbers being generated on every loop?

using Distributions
function stickyparticle()
  nm = 5000;#number of particles
  D = 4*10^6; # diffusiion coefficent
  h = 10000; # length of container
  totaltime = 300;
  dt = 0.01; #time step size
  fpart = rand(nm)*h; #initialize particle positions
  kon = 5; #Binding Constant
  koff = 0.5; #Unbinding Constant
  iters =  collect(dt:dt:totaltime); #number of iterations for simulation
  s=1000; #Range of sticking zone: x = [0:s]
  free = fpart.>0; #Boolean array tracking free particles
  bound = .~(free); #Boolean array tracking stuck particles
  drnd =  Normal(0,sqrt(2*D*dt)); #Standard deviation for particle diffusion
  local wpos::BitArray
  local wneg::BitArray
  local unbind::BitArray
  local bind::BitArray

@fastmath @inbounds  @views for t in iters
    freepart = view(fpart,free); #Get only free particles
    din = rand(drnd,length(freepart)); #Calculate diffusion of particles
    freepart .= (din+freepart); #Update particle positions
    wpos =  freepart .> h ; #Find particles that  diffuse beyond the length of the container
    freepart[wpos] .-= 2*h; #Reflect particles back
    wneg = freepart .< 0; #Find particles that  diffuse beyond second boundary
    freepart[wneg] .= -freepart[wneg]; #Reflect particles back
    unbind = (rand(length(bound)) .< (dt*koff)) .& bound #Calculate stuck particles that unbind
    bind = ((rand(length(free)) .< (dt*kon) ) .& (fpart .<=s)).& free; #Calculate free particles that stick
    #Stick and unstick particles
    bound[unbind] .= false;
    free[unbind] .= true;
    bound[bind] .= true;
    free[bind] .= false;
  end
  return fpart
end

dpsanders · August 8, 2017, 11:40pm

What happens if you just write a loop instead of vectorising everything?

rdeits · August 9, 2017, 2:58am

Minor note: there’s almost certainly no need to call collect() when generating iters. That just unnecessarily creates an array.

tkoolen · August 9, 2017, 6:03am

Here’s a faster implementation that follows @dpsanders’ advice:

function stickyparticle()
    nm = 5000 #number of particles
    D = 4e6 # diffusion coefficent
    h = 10000. # length of container
    dt = 0.01 #time step size
    totaltime = 300.
    kon = 5. #Binding Constant
    koff = 0.5 #Unbinding Constant
    s = 1000. #Range of sticking zone: x = [0:s]
    drnd =  Normal(0., sqrt(2 * D * dt)) #Standard deviation for particle diffusion

    fpart = rand(nm) * h #initialize particle positions
    free = fpart .> 0; #Boolean array tracking free particles
    for t = dt : dt : totaltime
        for i = 1 : nm
            if free[i]
                fpart[i] += rand(drnd) #Calculate diffusion of particles
                if fpart[i] > h #Find particles that  diffuse beyond the length of the container
                    fpart[i] -= 2 * h #Reflect particles back
                end
                if fpart[i] < 0 #Find particles that  diffuse beyond second boundary
                    fpart[i] = -fpart[i] #Reflect particles back
                end
                if fpart[i] <= s
                    free[i] = rand() > dt * kon #Stick and unstick particles
                end
            else # bound
                free[i] = rand() < dt * koff 
            end
        end
    end
    fpart
end

I believe it should do the same thing as your original code, but you should verify that it does the right thing. It’s hard to test because the rand calls necessarily have to occur in an order that’s different from the one used in your code, so just setting the initial random seed won’t result in the exact same behavior.

With julia -O3 --check-bounds=no, I’m getting 5.420 seconds and 4.61 GiB of allocations for your code, and 2.256 seconds and 83.25 KiB of allocations (independent of the number of time steps) for mine. I checked this using the BenchmarkTools package, which is an awesome tool.

Why is this faster? Two main reasons:

your code allocates lots of memory. Both allocating memory and collecting garbage (the inverse of allocation, freeing memory that you no longer need) can be big sources of overhead. A line such as din = rand(drnd,length(freepart)) allocates (at least) length(freepart) Float64s every iteration. By writing out the loops, you can get rid of all allocations, apart from the initial setup before the loop. Note by the way that you can use julia --track-allocation=user to find which lines allocate if you’re not sure.
cache locality: CPUs have a limited amount of very fast cache memory available. Data is pulled into that cache memory from slower memory when it is needed, but this takes time. Because of this, CPUs can achieve performance benefits by pulling in not just exactly the memory that is needed right now, but also some additional, nearby memory blocks, because it is likely that those will be needed anyway. You can take advantage of that knowledge while programming (in any language) by accessing (and writing to) memory in a linear fashion, instead of jumping around all over the place (spatial locality), and doing ‘as much as possible’ with the data while it is loaded in the fast cache (temporal locality). In your code, lines like freepart = view(fpart,free) result in a non-contiguous view of memory that, when iterated over, is bad in terms of spatial locality. The fact that you essentially iterate over all the particles multiple times in a single time step is bad for temporal locality; it’s better to iterate over them only once.

BTW, using ProfileView, I found out that in the new code, most of the time is being spent in the rand(:: Distributions.Normal) call, which is slower than it probably should be. Replacing rand(drnd) with σ * randn(), where σ is precomputed before the loop as sqrt(2 * D * dt), the runtime is further reduced to 1.761 s.

I’m sure this can be improved further by the way. For example, this should be very easy to parallelize since none of the particles interact with each other.

ChrisRackauckas · August 9, 2017, 7:12am

Swap out the RNG for the Xoroshiro128Plus from RandomNumbers.jl

https://sunoru.github.io/RandomNumbers.jl/latest/man/xorshifts/

I find I get some pretty nifty “free” speedups from it.

mkborregaard · August 9, 2017, 8:24am

That looks great, do you know if it’s possible to define a setting (e.g. in the .juliarc.jl file) to make this the default RNG for rand calls?

jonas-kr · August 9, 2017, 12:49pm

Another proposal. I hope that I didn’t implement any mistakes…

EDIT: Removed, made a stupid mistake…

jonas-kr · August 9, 2017, 12:49pm

As expected I made a mistake in my previous post. Here’s an alternative version, similar to @tkoolen’s proposal. However, it might be slightly faster. I noticed that replacing the boolean array “free” with a similar array of integers slightly improves performance although the amount of allocated memory is increased.

using Distributions

function stickyparticle()
    nm = 5000 # number of particles
    D = 4*10^6 # diffusion coefficent
    h = 10000 # length of container
    totaltime = 300.0
    dt = 0.01 #time step size
    fpart = rand(nm)*h # initialize particle positions
    kon = 5.0 # Binding Constant
    koff = 0.5 # Unbinding Constant
    iters =  dt:dt:totaltime # number of iterations for simulation
    s = 1000 # Range of sticking zone: x = [0:s]

    free = zeros(Int, nm) # indexing in integer arrays is slightly cheaper
    for k in eachindex(fpart)
        if fpart[k] > 0.0 # "Boolean" array tracking free particles
            free[k] = 1
        end
    end

    drnd =  Normal(0.0, sqrt(2*D*dt)) # Standard deviation for particle diffusion

    dtkoff = dt * koff
    dtkon = dt * kon

    @inbounds for t in iters
        for k in 1:nm
            if free[k] > 0
                din = rand(drnd) # Calculate diffusion of particles
                fpart[k] += din # Update particle positions

                if fpart[k] > h # Find particles that  diffuse beyond the length of the container
                    fpart[k] -= 2.0*h # Reflect particles back
                # assuming h > 0
                elseif fpart[k] < 0.0 # Find particles that diffuse beyond second boundary
                    fpart[k] *= -1.0 # Reflect particles back
                end
                if (fpart[k] <= s) # Calculate free particles that stick
                    if rand() < dtkon
                        free[k] = 0
                    end
                end
            else
                if rand() < dtkoff # Calculate stuck particles that unbind
                    free[k] = 1
                end
            end
        end
    end
    return fpart
end

tross · August 10, 2017, 1:24am

Wow this is quite a speed up! Well done!
There is a minor error on line 34
fpart[k] -= 2.0*h # Reflect particles back
should be
fpart[k] = 2.0*h - fpart[k] # Reflect particles back

This fix doesn’t appear to make any difference in performance.

I was also wondering if I could get this to scale with a computer that has multi-core processors. I added in @parallel for k in 1:nm on line 28. It is a bit of a strange result, however.

@time stickyparticle();
0.329424 seconds (1.75 M allocations: 82.504 MiB, 33.03% gc time)

A few things confuse me. The first is that the time it spits out is indeed faster than before. However, the actual time it takes to finish in the real world is actually now longer than before. Is there something about starting up and closing down the parallel process that isn’t accounted for in the performance readout? Also, it appears that now we see a lot more time spent on garbage collection than before. Are there other functions I should be using if I want the performance to scale as I increase the number of cores available?

jonas-kr · August 10, 2017, 6:25am

I wonder why you see any increase in performance at all. Performing the same modification you did, the runtime has shown to be far longer than in the initial version (even if “fpart” and “free” are initialized as SharedArrays) which doesn’t surprise me: Every iteration of the inner loop takes only little more than 100µs why the increased overhead due to scheduling should affect the processing time far more than any possible speedup due to parallelization.
However, there might be an increase in performance if the outer loop was distributed across several workers. Sadly, this is not an option since the particle distribution at each timestep depends on the previous one.

tl;dr: It’s probably better to stick to a serial implementation.

tkoolen · August 10, 2017, 6:47am

You could also batch the particles.