This is partly a question asking for recommendations on an algorithm, ideally something that is implemented some Julia package, or can be done simply using one.
I am simulating a process in \mathbb{R}^3 (I can constrain them to a box, so think [0,1]^3 if that helps), with a number of agents. They split following different paths, and thus their number can quickly explode. I only care about their distribution (their position captures all their state), so I would like to replace “close” points with their means and continue the calculation with a lower number of points.
To make thinks concrete, consider the MWE
using StaticArrays
l(x) = 1 / (1 + exp(-x)) # logistic, to keep samples in box
n = 1000
xs = SVector.(l.(randn(n)), l.(randn(n) .+ 0.1), l.(randn(n)) ./ 2)
ys = sort(vcat([x => 0.2 for x in xs], # weighted, pretend to split
[x .* 0.9 .+ 0.5 * 0.1 => 0.8 for x in xs]),
by = first)
I would like to find clusters of “close” points in ys
, remove them, and replace them with their mean and added weight, so that I end up with, say, 100–500 points that I can cope with more easily.
“Close” can be any reasonable metric that makes the exercise simple.
My fallback solution is to repeatedly slice up the box along coordinates by using weighted medians (not unlike a decision tree). Eg
function median_split(by, x)
# quick and dirty example, please don't focus on making this faster
y = sort(x, by = by ∘ first)
cum_weight = cumsum(last.(y))
half_weight = cum_weight[end] / 2
i = findfirst(x -> x > half_weight, cum_weight)
y[1:i], y[(i+1):end]
end
median_split(x -> x[2], ys) # split on second coordinate
Any advice (including vague ideas, references to papers) is appreciated.