How to identify clusters of values in a dataframe (or grouped dataframe)

I am trying to identify clusters of values within a dataframe (it’s actually grouped but I’m handling it in sections). The data is of road defects so the actual data is large, I know that isn’t relevant for this but it gives some idea of the volume involved.

#Create DataFrame with multiple ranges of the number

data = Dict(
    "A" => [1, 7, 3, 4, 7, 7],
    "B" => [5, 6, 1, 8, 9, 9],
    "C" => [7, 10, 10, 11, 7, 7],
    "D" => [13, 13, 14, 15, 7, 7],
    "E" => [7, 10, 10, 11, 7, 7],
)
df = DataFrame(data)'

The values I need to identify in this case are the clusters of the digit 7. That is col A, row 5 and 6 also cols C,D and E in rows 5 and 6.

6×5 DataFrame
Row │ A B C D E
│ Int64 Int64 Int64 Int64 Int64
─────┼─────────────────
1 │ 1 5 7 13 7
2 │ 7 6 10 13 10
3 │ 3 1 10 14 10
4 │ 4 8 11 15 11
5 │ 7 9 7 7 7
6 │ 7 9 7 7 7

The value to find for the cluster will change and there may be any number of clusters.

I can identify where the number 7 is by using enumerate along the rows and cols but I’m trying to find something easier as the result gives the rows and columns without the clusters they belong to.

The output/return value should be a cluster number and it’s position as I need to calculate the max width and height of the cluster.

Can you elaborate on why Clustering.jl algorithms can’t be used directly? Or geostatistical clustering algorithms?

What outcome would you expect in this case?

1 Like