Permutations of a vector that retain the vector "structure"

Is there a built-in function that returns the permutations of a vector that are based on the unique elements of the vector, i.e. where the “structure” of the vector is retained ?

For example given y = ["a","a","a","b","b","c","c","c"], it would return ["a","a","a","b","b","c","c","c"], ["b","b","b","a","a","c","c","c"], ["a","a","a","c","c","b","b","b"], etc., but not for example ["a","a","b","a","b","c","c","c"].

I did implemented it in the following function, but I wonder if there is a better or built-in approach:

function getPermutations(v::AbstractArray{T,1};keepStructure=false) where {T}
    if !keepStructure
        return Combinatorics.permutations(v)
    else
        classes       = unique(v)
        nCl           = length(classes)
        N             = size(v,1)
        pSet          = Combinatorics.permutations(1:nCl)
        nP            = length(pSet)
        vPermutations = fill(similar(v),nP)
        vOrigIdx      = [findfirst(x -> x == v[i] , classes) for i in 1:N]
        for (pIdx,perm) in enumerate(pSet)
            vPermutations[pIdx] = classes[perm[vOrigIdx]] # permuted specific version
        end
        return vPermutations
    end
end
julia> yp = collect(getPermutations(y))
40320-element Vector{Vector{String}}:
 ["a", "a", "a", "b", "b", "c", "c", "c"]
 ["a", "a", "a", "b", "b", "c", "c", "c"]
 ["a", "a", "a", "b", "b", "c", "c", "c"]
 ["a", "a", "a", "b", "b", "c", "c", "c"]
 ["a", "a", "a", "b", "b", "c", "c", "c"]
 ["a", "a", "a", "b", "b", "c", "c", "c"]
 ["a", "a", "a", "b", "c", "b", "c", "c"]
 ["a", "a", "a", "b", "c", "b", "c", "c"]
 ⋮
 ["c", "c", "c", "b", "a", "b", "a", "a"]
 ["c", "c", "c", "b", "b", "a", "a", "a"]
 ["c", "c", "c", "b", "b", "a", "a", "a"]
 ["c", "c", "c", "b", "b", "a", "a", "a"]
 ["c", "c", "c", "b", "b", "a", "a", "a"]
 ["c", "c", "c", "b", "b", "a", "a", "a"]
 ["c", "c", "c", "b", "b", "a", "a", "a"]

julia> yp = getPermutations(y,keepStructure=true)
6-element Vector{Vector{String}}:
 ["a", "a", "a", "b", "b", "c", "c", "c"]
 ["a", "a", "a", "c", "c", "b", "b", "b"]
 ["b", "b", "b", "a", "a", "c", "c", "c"]
 ["b", "b", "b", "c", "c", "a", "a", "a"]
 ["c", "c", "c", "a", "a", "b", "b", "b"]
 ["c", "c", "c", "b", "b", "a", "a", "a"]

(ps: the fact I do not collect on the whole permutation is intentional… it can become huge very quickly!)

Actually, why some permutations are reported multiple times ? Is it an overflow issue ?

julia> y = ["a","a","a","b","b","c","c","c"]
8-element Vector{String}:
 "a"
 "a"
 "a"
 "b"
 "b"
 "c"
 "c"
 "c"

julia> yp = collect(Combinatorics.permutations(y))
40320-element Vector{Vector{String}}:
 ["a", "a", "a", "b", "b", "c", "c", "c"]
 ["a", "a", "a", "b", "b", "c", "c", "c"]
 ["a", "a", "a", "b", "b", "c", "c", "c"]
 ["a", "a", "a", "b", "b", "c", "c", "c"]
 ["a", "a", "a", "b", "b", "c", "c", "c"]
 ["a", "a", "a", "b", "b", "c", "c", "c"]
 ["a", "a", "a", "b", "c", "b", "c", "c"]
 ["a", "a", "a", "b", "c", "b", "c", "c"]
 ["a", "a", "a", "b", "c", "c", "b", "c"]
 ⋮
 ["c", "c", "c", "b", "a", "a", "b", "a"]
 ["c", "c", "c", "b", "a", "b", "a", "a"]
 ["c", "c", "c", "b", "a", "b", "a", "a"]
 ["c", "c", "c", "b", "b", "a", "a", "a"]
 ["c", "c", "c", "b", "b", "a", "a", "a"]
 ["c", "c", "c", "b", "b", "a", "a", "a"]
 ["c", "c", "c", "b", "b", "a", "a", "a"]
 ["c", "c", "c", "b", "b", "a", "a", "a"]
 ["c", "c", "c", "b", "b", "a", "a", "a"]

EDIT: I got it, permutations don’t look what there is inside the vector, so for example the first “a” is different than the second “a”…

1 Like

Is ["a", "b", "a"] a valid input vector? What should the output be?
Assuming each unique element can only appear in consecutive indexes, this is easier and probably faster (didn’t benchmark):

using StatsBase
using Combinatorics

function run_permutations(v)
    vals, lens = rle(v)
    [inverse_rle(vals[k], lens) for k in permutations(eachindex(vals))]
end

If your vectors are long, you may want to encode the result more compactly using RLEVectors.jl

No, I don’t make this assumption.

For example I may have:

8-element Vector{String}:
 "a"
 "a"
 "a"
 "b"
 "b"
 "c"
 "c"
 "a"

julia> getPermutations(y2;keepStructure=true)
6-element Vector{Vector{String}}:
 ["a", "a", "a", "b", "b", "c", "c", "a"]
 ["a", "a", "a", "c", "c", "b", "b", "a"]
 ["b", "b", "b", "a", "a", "c", "c", "b"]
 ["b", "b", "b", "c", "c", "a", "a", "b"]
 ["c", "c", "c", "a", "a", "b", "b", "c"]
 ["c", "c", "c", "b", "b", "a", "a", "c"]

I need it for evaluating clusters (and in general unsupervised classification algorithms) accuracy when the true labels are actually known (but unused by the algorithm). A unsupervised algorithm knows nothing about the “label”, so what I am using is the maximum accuracy whatever is the way we call the label themselves, to judge the algorithm ability to learn the structure of my data.
The actual implementation is here. I was wondering if there was a built-in method for returning only permutations that keep the same structure of the original vector data.

Or maybe there is a totally different way to compare the structure of multi-class vectors ??

It might make sense to pass a permutation on labels as a parameter to your accuracy function, so you can go over possible label permutations without actually creating the label-permuted vectors.
Something like this (assuming y and already contain class indexes rather than arbitrary identifiers)

accuracy(y,ŷ,perm) = mean(y .== perm[ŷ])
function best_perm(y,ŷ)
    nclasses = length(unique(y))
    accuracies = [accuracy(y, ŷ, perm) for perm in permutations(1:nclasses)]
    i = argmax(accuracies)
    nthperm(1:nclasses, i), accuracies[i]
end

1 Like

I’ve had a similar problem (well, I stumbled across a similar problem in my code and I thought about it for a while, but I didn’t code something up yet that I could just give you).

I don’t have a totally different approach, but I think I can improve your runtime by a fair bit:

You only need to make 1 iteration over zip(truth, prediction), build a coincidence matrix and then you can just permute the rows of that and recalculate the accuracy without materializing all those permutations. A little more elaborate:

Let k be the number of classes, make a k×k matrix where your first dimension corresponds to the class found in the prediction vector and the second dimension corresponds to the class found in the truth vector. Initialize with all zeros, iterate over zip(truth, prediction) and increment the coincidence matrix at M[predicted_class, true_class] by 1 every time. (need to convert classes to Ints first) Now you have a matrix for which the sum of the diagonal divided by the number of entries is the accuracy, and when you permute the rows of that you get the accuracy for the corresponding permutation, but it’s vastly cheaper to calculate if the number of classes is lower than the length of your vector.

Oh and I read somewhere that this corresponds to an assignment problem, so if you happen to have an algorithm to efficiently solve those, you can probably just stick this matrix (I think negated) into it and solve for the best permutation directly.

1 Like

For the future and because I kinda have to do it anyways:

julia> using OrderedCollections: LittleDict, freeze

julia> using Combinatorics: permutations

julia> function map_to_ints(uniquevals)
           ld = LittleDict(uniquevals, 1:length(uniquevals))
           dl = LittleDict(1:length(uniquevals), uniquevals) # reverse mapping
           return (freeze(ld), freeze(dl))
       end
map_to_ints (generic function with 1 method)

julia> function fill_M!(M, imap, truths, predictions)
           foreach(
               ((tru, pre),) -> M[imap[tru], imap[pre]] += 1,
               zip(truths, predictions)
           )
       end
fill_M! (generic function with 1 method)

julia> function permuted_accuracy(M, perm, n)
           sum(M[i,j] for (i,j) in zip(1:length(perm), perm)) / n
       end
permuted_accuracy (generic function with 2 methods)

julia> function best_accuracy(truths, predictions)
           @assert length(truths) == length(predictions)
           n = length(truths)
           classes = unique(truths) # predictions should not contain something that's not in the target I guess
           l = length(classes)
           indexmap, reversemap = map_to_ints(classes)
           M = zeros(Int, (l,l))
           fill_M!(M, indexmap, truths, predictions)
           perms = permutations(1:l)
           bestaccuracy = -Inf
           bestperm = nothing
           for p in perms
               acc = permuted_accuracy(M, p, n)
               if acc > bestaccuracy
                   bestaccuracy = acc
                   bestperm = p
               end
           end
           expressivebestperm = Dict([reversemap[orig] => reversemap[matched] for (orig, matched) in zip(1:l, bestperm)])
           return (bestaccuracy, expressivebestperm)
       end
best_accuracy (generic function with 1 method)
julia> best_accuracy([1, 1, 1, 2, 2, 3, 1, 3],
                     [1, 1, 1, 3, 3, 2, 1, 3])
(0.875, Dict(2 => 3, 3 => 2, 1 => 1))

using Random, BenchmarkTools

julia> R = ("a", "b", "c", "d", "e")
("a", "b", "c", "d", "e")

julia> n = 10_000_000
10000000

julia> @benchmark best_accuracy(A, B) setup = begin A = rand(R, n); B = rand(R, n) end
BenchmarkTools.Trial:
  memory estimate:  38.75 KiB
  allocs estimate:  421
  --------------
  minimum time:     741.684 ms (0.00% GC)
  median time:      858.165 ms (0.00% GC)
  mean time:        879.968 ms (0.00% GC)
  maximum time:     1.061 s (0.00% GC)
  --------------
  samples:          5
  evals/sample:     1

julia> R = ("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")

julia> @benchmark best_accuracy(A, B) setup = begin A = rand(R, n); B = rand(R, n) end
BenchmarkTools.Trial:
  memory estimate:  1.19 GiB
  allocs estimate:  10886467
  --------------
  minimum time:     2.688 s (16.65% GC)
  median time:      2.704 s (13.32% GC)
  mean time:        2.704 s (13.32% GC)
  maximum time:     2.719 s (10.02% GC)
  --------------
  samples:          2
  evals/sample:     1

Not perfect as it scales kinda badly with the number of classes, but it’s okay I think. No guarantee for correctness ^^ Hope it helps, maybe you can come up with an even better version

1 Like