I am trying to identify clusters of values within a dataframe (it’s actually grouped but I’m handling it in sections). The data is of road defects so the actual data is large, I know that isn’t relevant for this but it gives some idea of the volume involved.
#Create DataFrame with multiple ranges of the number
data = Dict(
"A" => [1, 7, 3, 4, 7, 7],
"B" => [5, 6, 1, 8, 9, 9],
"C" => [7, 10, 10, 11, 7, 7],
"D" => [13, 13, 14, 15, 7, 7],
"E" => [7, 10, 10, 11, 7, 7],
)
df = DataFrame(data)'
The values I need to identify in this case are the clusters of the digit 7. That is col A, row 5 and 6 also cols C,D and E in rows 5 and 6.
6×5 DataFrame
Row │ A B C D E
│ Int64 Int64 Int64 Int64 Int64
─────┼─────────────────
1 │ 1 5 7 13 7
2 │ 7 6 10 13 10
3 │ 3 1 10 14 10
4 │ 4 8 11 15 11
5 │ 7 9 7 7 7
6 │ 7 9 7 7 7
The value to find for the cluster will change and there may be any number of clusters.
I can identify where the number 7 is by using enumerate along the rows and cols but I’m trying to find something easier as the result gives the rows and columns without the clusters they belong to.
The output/return value should be a cluster number and it’s position as I need to calculate the max width and height of the cluster.