What data structure do I need? (and is it implemented in Julia?)

I have a collection of coordinates which is a fixed set. I collect these into groups using some arbitrary clustering scheme. Consequently I now hold some collection of arrays. I need to be able to merge groups together, or split groups into smaller groups.

Obviously this is Dict like behaviour. I require that keys are integers which count from 1 -> n, with no gaps. Splits or merges may happen many, many times, so if a split occurs, I can’t simply delete a key and replace it with two new (higher valued) keys, as this will lead to a growing maximum value of a key, and consequently an eventual overflow.

What data structure has this property? Essentially each time I merge groups I want to permit a gap in the sequence of keys to form, and each time I split a group I want to preferentially fill any existing gaps before creating new key(s). Obviously I can implement this with a queue to which deleted keys are added, and drawn from preferentially before generating new keys, but this is both clumsy and probably reinventing the wheel.

Any advice would be hugely helpful, as the last thing I need is some weird custom behaviour happening behind the scenes.

How about you have your groups::Dict{Int,Whatever} but also have an unused_keys::Vector{Int}. When you do a merge, push! the result onto unused_keys. When you do a split, pop! a key if unused_keys is not empty. I’m not sure if there’s a data structure like this already implemented.

2 Likes

Are your coordinates of some given dimensionality e.g. 2D, 3D, 4D, or is the dimensionality always one of some few e.g. 2D or 3D?
Are the values of each coordinate always of some type or some supertype e.g. Ints, Floats32s, Int64s and Float64s or wider e.g. Complex, Quaternion?

Coordinates are always fixed in dimensionality for a given data structure. This is usually 2D or 3D but I guess it’s generalisable to kD. Elsewhere I’m representing these with SVectors, but it seems the payload is secondary to the question?

There may be a particular structure that solves this, but a Vector{MyGroupType} in conjunction with an unused_keys queue like cjdoris suggested is my intuition. Vector gives you Dict-like behavior for the “keys” as indices, and you can extend the vector as needed with resize!.

1 Like

This was the solution I hoped to avoid, but I guess it’s the sensible solution after all. With a little bit of tinkering, I ended up with:

function merge!(groups, unused, candidates)
    min = minimum(candidates)
    sd = setdiff(candidates, min)
    for c in sd
        append!(groups[min], groups[c])
        empty!(groups[c])
    end
    append!(unused, sd)
end

#split! could use an array of coordinates :newgroup to describe the new split clusters
#logic is a bit yikes
function split!(groups, unused, candidate, newgroup)
    len = length(newgroup)
    ulen = length(unused)
    if isempty(unused)
        k = length(groups)
        for i in len
            groups[k+i] = newgroup[i]
        end
    elseif len < ulen
        push!(unused, candidate)
        foreach(((k, v), ) -> groups[k] = v, zip(unused, newgroup))
        splice!(unused, ulen-len+1:ulen+1)
    elseif len == ulen
        push!(unused, candidate)
        foreach(((k, v), ) -> groups[k] = v, zip(unused, newgroup))
        empty!(unused)
    else
        push!(unused, candidate)
        foreach(((k, v), ) -> groups[k] = v, zip(unused, newgroup))
        empty!(unused)
        k = length(groups)
        for i in len
            groups[k+i] = newgroup[i]
        end
    end
end

clusts #Dict{Int64, Vector{SVector{2, Int64}}}
uk = Int[]

#merge cluster 3 into 2, add key 3 to :uk
merge!(clusts, uk, (2,3)) 

# split clusts[10] to contain the values from clusts[5] and clusts[6]
# this will prefrentially write to clusts[3] as it was freed by the
# prior operation, with the second cluster written to the candidate clust[10]
split!(clusts, uk, 10, (clusts[5], clusts[6])) #dummy values

Needs a bit of testing and maybe a performance pass, but that seems to be the idea everyone suggested. Any e.g. type stability improving suggestions very welcome.

Edit: After some trimming to simplify the big messy split! function, we get this, which seems to behave for arbitrary collections.

function merge!(groups, unused, candidates)
    min = minimum(candidates)
    sd = setdiff(candidates, min)
    for c in sd
        append!(groups[min], groups[c])
        empty!(groups[c])
    end
    append!(unused, sd)
end

function split!(groups, unused, candidate, newgroup)
    len = length(newgroup)
    push!(unused, candidate)
    ulen = length(unused)
    foreach(((k, v), ) -> groups[k] = v, zip(unused, Base.Iterators.take(newgroup, ulen)))
    splice!(unused, max(ulen-len+1,1):ulen)
    if len > ulen
        k = length(groups)
        for i in ulen+1:len
            groups[k+i] = newgroup[i]
        end
    end
end