How to cluster points with a maximum cluster size? Or find points with maximum diversity?

Hi all,
Used to do Data Science / Analytics in a previous role but haven’t done much (any) over the past 3 years. At work, I’d offload this problem to our Data Science team, but they don’t have the bandwidth and I have a little free time, so I thought this would be a good opportunity to refresh my analytics and Julia skills.

Problem: We have a dataset of points (lat / longs) and we need to create groups or clusters of these points where each group contains no more than 200 points. Additionally, the final selection of groups should be selected that maximizes the distance between the center of each group. (This is to minimize the chance of any group overlapping.) As an example, say we have 2000 lat/long points. This creates 10 groups. If we decide to use 5 of these groups, the 5 groups selected should be as far away from each other as possible.

My first thought was to do some sort of constrained K-Means that limits each cluster to 200, but this seemed a bit overly complicated. My next thought was to assign points randomly as the starting point for groups and then do a KNN model and find the nearest 200 points to create groups. This seems a lot simpler but I’m not sure of the best way to then find which set of groups have the maximum distance between each other.

Thank you and open to your thoughts!

(This is to minimize the chance of any group overlapping.)

What exactly do you mean by overlapping?

See my question from a few weeks back Balanced clustering or assignment of 3D points in space. Let me know if it can be of help (the code is not yet ready to push)

1 Like

Good question. So additional context here is this is a marketing problem. The lat/long points are just representations of people we can target digitally but the platform we use doesn’t necessarily target people that “live” by one of the lat/long points but rather “live by or have recently visited” a lat/long point. The reason why this is important is because if there are groups of points next to each other then it’s very possible that one person would end up in both groups (e.g. they have traveled to multiple lat/long points). These groups would “bid” against each other, thus raising the cost of advertising.

Thanks, this did catch my eye when I was searching around, but I’m still not sure the best way to select a subset of the clusters / groups that would maximize the total distance between clusters. I could brute force it with a distance matrix but kind of thought there might be an easier or more efficient way.

1 Like

given what you wrote in your first post

you could have a look at the BKM++ algorithm that enforces no overlapping as much as possible.

Maybe you could utilize k-medoids? Step 1 would be to find the optimal centers (where the distance between the centers is greatest). Step 2 would then be to use those centers as pre-defined medoids and then perform the clustering around those medoids to minimize the total within-cluster distances. You would have to choose the number of clusters up front though.

Hi all,
Thanks for your replies thus far. After doing some more research, I think maybe I’ve described the problem incorrectly or inefficiently. Let me try again:

I’m looking for a way to select a subset of points in such a way that maximizes the diversity / entropy among the selected points. For example, if I had a set of 100 points (on x,y axis) and wanted to select a subset of 4 of them, the 4 selected would probably be the 4 points in the “corners” furthest away from each other. This problem becomes increasingly more difficult the more points or subsets required and this is where I’m struggling.

After finding the the points that maximize the diversity amongst the subset, I can use KNN to find the next closest N points to create my larger “groups” so that each one contains no more than 200 points (as described in the original post).

I believe what I’m describing is the maximum diversity problem, but I haven’t come across this implemented in an Julia packages or a suggestion on an easy proxy to it. Perhaps this problem is a bit niche.

Thanks for your help

Yes - the best solution is going to depend very much on the total number of points, as well as the number that you want to select. Choosing 4 points from 100 (where order doesn’t matter and you can’t choose the same point more than once) already results in 3,921,225 possibilities.

julia> binomial(100,4)
3921225

If you have 2000 points and you want to choose 10 of them:

julia> binomial(BigInt(2000),BigInt(10))
275898785946005613288829800

I would bet that you’re going to end up needing to find some heuristic or approximate algorithm that produces a result you can stomach as opposed to searching for the optimal solution, but there is plenty of expertise in this domain here in the Julia community, so hopefully someone else can chime in.

I threw together a little JuMP.jl model just for grins and giggles and it solves choosing 3 points from 10 very quickly (as you would expect) but then when I ask it to find 10 points out of 60, it looks like it would take a couple of hours at least.

1 Like

This problem objective is to find a subset of k indices p so that sum(D[p,p]) is maximum, where D is the all-to-all distances of your points.

Whenever I am desperate and I have no idea what I am doing, I lean on Genetic Optimization.

Just start with a pool of 50 random p “genes”. Then,

  • “Mate” random pairs where you get half of the indices from one partner and half from another.
  • “Mutate” randomly some genes where you replace a single index with another one.

In the above always make sure you have valid genes (values 1:n without repeated values). Say now you have a pool of 100 such genes. Evaluate your objective function on them and keep the best 50 for the next generation.

Repeat. After a few generations you’ll stop improving and probably the best of that generation will be a good solution.

Thanks for the reply… When I read your post i thought it was a good idea but either I fully don’t understand or it’s not practical in this problem.

If I have 1000 points and need the best 50 so that the sum of these 50 is maximum, I would:

  1. choose 50 points at random
  2. sum the distance between all these 50 points
  3. choose another 50 points, repeat step 2, and evaluate whether this 50 is better than the original 50

I don’t think I can “mate” half of these two random selections together as there is no “best 25 points” within each selection that I can combine together as it’s the sum of all 50 together.

Of course, if I wanted the best 50, I could choose 25 points at random, then combine the 25s together and calculate the total distance of the 50, but I don’t see how that is any more efficient than just choosing 50 at random to begin with.

What about hclust(D, :complete) from Clustering.jl? The :complete cluster linkage function might give results that are acceptable and it’s very fast so it seems worth a shot…

This sounds like the kcenters problem https://en.m.wikipedia.org/wiki/Metric_k-center

The SimilaritySearch package implements the farthest first traversal algorithm.