Simplifying simulation by clusters

This is partly a question asking for recommendations on an algorithm, ideally something that is implemented some Julia package, or can be done simply using one.

I am simulating a process in \mathbb{R}^3 (I can constrain them to a box, so think [0,1]^3 if that helps), with a number of agents. They split following different paths, and thus their number can quickly explode. I only care about their distribution (their position captures all their state), so I would like to replace “close” points with their means and continue the calculation with a lower number of points.

To make thinks concrete, consider the MWE

using StaticArrays
l(x) = 1 / (1 + exp(-x))        # logistic, to keep samples in box
n = 1000
xs = SVector.(l.(randn(n)), l.(randn(n) .+ 0.1), l.(randn(n)) ./ 2)
ys = sort(vcat([x => 0.2 for x in xs], # weighted, pretend to split
               [x .* 0.9 .+ 0.5 * 0.1 => 0.8 for x in xs]),
          by = first)

I would like to find clusters of “close” points in ys, remove them, and replace them with their mean and added weight, so that I end up with, say, 100–500 points that I can cope with more easily.

“Close” can be any reasonable metric that makes the exercise simple.

My fallback solution is to repeatedly slice up the box along coordinates by using weighted medians (not unlike a decision tree). Eg

function median_split(by, x)
    # quick and dirty example, please don't focus on making this faster
    y = sort(x, by = by ∘ first)
    cum_weight = cumsum(last.(y))
    half_weight = cum_weight[end] / 2
    i = findfirst(x -> x > half_weight, cum_weight)
    y[1:i], y[(i+1):end]

median_split(x -> x[2], ys)     # split on second coordinate

Any advice (including vague ideas, references to papers) is appreciated.

In fluid simulation there are usually two approaches, particle based or grid based. There are also hybrids, where one switches from grid based inside a medium, to particle based near the interface to another medium. Maybe some inspiration to be drawn from there?
Related video on the topic


Do you care about the total distribution or only some subset of the agents that satisfy some condition?

The whole problem works like this, conceptually.

Individuals move in a space (a, i)_t \in \mathbb{R}^3 \times {1, \dots, N}. I have an initial discrete distribution over (a, i)_0.

The law of motion for the a part is

a' = f(a, i)

As for the i, it is stochastic, with a distribution

i' \sim g(a, i, \theta)

where \theta are parameters to the problem I am estimating. And I care about linear functions of

h(a, i, t, \theta)

(eg a sum/integral over everything, but also split by i and/or t).

It is really important to keep everything differentiable in \theta. Naturally it is, but there are about 6 million combinations by the time I am done simulating. I am estimating \theta, so I prefer sub-minute runtimes for the whole thing.

My initial idea was to pretend an equi-distributed g over i (or fix \theta), explore the space for a using this to see where I would bin, then approximate the f part using bins.

We have a running version in Julia (will be public when we submit the paper), it is just that the approximation part is very inaccurate (but differentiably inaccurate, which is a big win :wink:).

This could be yet another instance of engineers/physicists being light years ahead of economists so I want to read up on this. What would be a good crash course in the grid based methods? Is there a tutorial paper or a textbook? I can deal with the math and (classical) physics.

It’s not my area, but I’ve heard of this kind of idea by the name “sequential importance resampling”, so maybe those keywords could help find some literature on it.

1 Like

Unfortunately, my only exposure to this is the two-minute papers channel on YouTube so I can’t be much help regarding references. My first thought was actually to look at heat transfer, and discretization of PDEs in general.

Like Eric said though, it does have some similarities with sequential Monte-Carlo as well.

1 Like

Will a cell list approach be helpful for the task?

That’s a standard technique to facilitate neighbor search in molecular simulations, then the neighbors lists can be used to search clusters of points (using Stillinger’s criterion?)

The objects I am keeping track of do not interact, so I am not sure.

The interaction isn’t implied, the cell lists are only used to spped up finding all pairs (i, j) of particles in [0; L]^3 such that ||\textbf{r}_i - \textbf{r}_j|| < r_0.

And I remembered this article with an implementation, among others, of gravitational simulation with an octree space partitioning, which should be useful for finding clusters.

1 Like

As mentioned by @ericphanson and @baggepinnen, the problem of distances of your agents does sounds like a Particle filter problem.

Maybe take a look at this paper, specifically section 3 about reducing the variance in the weight. i find it very easy to approach. It is targeted at geophysical applications but it might be relevent.

van Leeuwen, P.J., 2009: Particle Filtering in Geophysical Systems. Mon. Wea. Rev., 137 , 4089–4114,

1 Like

would a markov model be of any use?
you could think about discretizing your space, and modeling the probabilities of an agent moving from a particular cell into one of the neighbouring cells.
i think this can work nicely since you have an equation that directly gives a state transition

(a',i')\sim T(a,i;\theta)

the memory cost for this approach would scale cubically w.r.t the lattice size, and the whole algorithm will surely be differentiable by \theta.

1 Like

made a very toyish working example of what i mean:

core function, keep in mind i coded it this morning, and the code is not polished in any way

module TestMarkov 


 function simulate(θ)
        for i in 1:100000
            P=approx_fpos((X,Y),S)  # find approximate position of the state inside the lattice
            L[P]+=1  #accumulate probabilistic knowledge
            S=T(S,θ)  # EVOLVE STATE
            if i%1000 == 0
                # after some iterations interrupt this trajectory and start a new one according to the density in L
                # resampling from the knowledge of L
                #or just reset state


this are two runs of the simulation with different theta, particles which get too near are accumulated in the same cell, so the clustering is automatic, and the nice thing is that where the clustering is low, you can look at individual trajectories, without any need to keep track of each individual trajectory, indeed in the code at any point in time the only thing in memory regarding particles is one single state.

1 Like

Could you try to explain this figure in your link?

I am struggling quite a bit to get it, but I understand it as having a “header” which tells me there is four cells, and for example for cell 1, the entry is “E” i.e. empty, since n o particles are stored in this cell. But for cell 0, it says “4”, and then lscl[4] = 2, which makes sense since cell 0 holds particle 2. But how do I then find particle 0 and 4 then too?

Kind regards

head[0] is 4, meaning that the particle number 4 is in cell 0.
Then, we go to lscl[4] and get 2, then to lscl[2] and get 0, then to lscl[0] and get -1, meaning that there are no more particles in the cell.
The chains for other cells are
6 → 5 → -1
7 → 3 → 1 → -1

So now comes my follow up question, “how do I know before hand that I need to go to lscl[2] = 0?”

For example taking the chain for cell 0, I see clearly that I can find particle 4 and 2, due to the header and the index lscl[4], but how do I find particle 0, at lscl[2]? How do I know that I have to move to lscl[2]?

This is primarily what bothers me, and it seems simple but for some reason I cannot grasp it - maybe I am confusing construction of the list with using it.

Kind regards

head[0] is 4, meaning that particle number 4 is in cell 0 and the number of the next particle is in lscl[4]. From there, you proceed with the same logic: lscl[4] is 2, so particle 2 is in cell and the next particle number is in lscl[2] etc. until you hit EMPTY.


Ah incredible! It finally makes sense now. I can see now why this is a very smart approach since I have to store very little in memory and if used properly it reduces complexity to O(N), without understanding too much what that actually means.

I will try to implement the list constructor for this example now, I might ask more if some issue pops up.

Kind regards

So I managed to get the cell list algorithm to work and here is a showcase:


Where altering colors denote connection a specific cell. My question is now how to go about selecting the correct cells?

Say I want to look at cell zero, then I care about these cells (red circle):

I can’t figure out a robust way to ensure that no matter where I am in my “grid” that I will only interact with adjacent and diagonal elements - and in the cases where no elements exist (shown by red x), I just want to not include it.

I’ve been looking through the documnt and they mention something about constructing a vector cell index, but I can’t really see how to go about this.

Kind regards

1 Like

And for anyone interested this is my implementation:

using Plots
using LoopVectorization
# Data from example
#data =
#0.234	     0.318
#0.734	     0.811
#0.393	     0.396
#0.649	     0.604
#0.359	     0.147
# 0.68	     0.258
#0.814	     0.106
#0.882	     0.691

N = 2^13

pg = fill(tuple(0.0,0.0,0.0),N)

for i = 1:N
    pg[i] = map(rand, (Float64,Float64,Float64))
#pg = map(x-> x .* -1,pg)

rc = [0.25;0.25]
# The total length of the simulation domain
La = [1;1]

function CellList2D(pg,rc,La)
    # Calculate the amount of cells in each direction
    Lca = [div(La[1],rc[1]); div(La[2],rc[2])]
    # Convert it to Int64
    Lca = convert(Array{Int64,1},Lca)
    # Number of cells in total
    nCell = Lca[1] * Lca[2]

    np = length(pg)
    # Preallocate header
    head = zeros(Int64,nCell)
    lscl = zeros(Int64,np)
    for i = 1:np
        # Calculate serial index
        c = div(pg[i][1],rc[1])*Lca[2] + div(pg[i][3],rc[2])
        # Convert to Julia indexing from C
        c = convert(Int64,c) + 1
        # Input serial index to header
        lscl[i] = head[c];
        # In header insert the last value, to keep track correctly
        head[c] = i;
    return head,lscl

head,lscl = @time CellList2D(pg,rc,La)

a = zeros(Int64,0)
for i in head
    while i != 0
        i = lscl[i]
    Plots.display(scatter!(first.(pg[a]),last.(pg[a]),aspect_ratio=0.75,xlims=(0,1),ylims=(0,1),xticks=[0,0.25,0.5,0.75,1], yticks=[0,0.25,0.5,0.75,1],show=true,legend=:outertopright))
    global a = zeros(Int64,0)

With performance of:

@benchmark CellList2D($pg,$rc,$La)
  memory estimate:  64.50 KiB
  allocs estimate:  6
  minimum time:     236.100 μs (0.00% GC)
  median time:      246.201 μs (0.00% GC)
  mean time:        254.215 μs (1.24% GC)
  maximum time:     5.767 ms (94.38% GC)
  samples:          10000
  evals/sample:     1

For N = 2^13