Sort vector by frequency

I am trying to sort a vector of strings by the frequency of tis values.

For instance:

x = ["a", "b", "b", "c", "c", "c"]

Since there are 3 "c", 2 "b" and 1 "a", I would like to get a vector of uniques in that order:

["c", "b", "a"]

Currently, I’ve managed to count the existing values:

using StatsBase

x = ["a", "b", "b", "c", "c", "c"]
string_count = StatsBase.countmap(x)
Dict("c"=>3,"b"=>2,"a"=>1)

This returns a dict with frequencies (well, the number of each element), but I am stuck at trying to transform this into a sorted vector of uniques…

I believe one could extract the keys and values as two columns of a dataframe, sort this dataframe by the number and then extract the column of values, but it seems a bit inefficient…

2 Likes

You can obtain the keys and values from a Dict like so

keys1 = [k for k in keys(string_count)]
sortperm_vals = sortperm([v for v in values(string_count)])

strings_sorted_by_freq = keys1[sortperm_vals]
1 Like
collect(keys(StatsBase.countmap(x)))

but maybe countmap does not guarantee the order (?), so to sort explicitly, you can:

collect from Dict to Array of Pairs, sort by anon function returning count descending, get first. element from each Pair returned.

Like this:

string_count = StatsBase.countmap(x)

sortedvals =first.(sort(collect(string_count), by = e -> e[2], rev=true))
2 Likes

Slightly cleaner:

using StatsBase

x = ["a", "b", "b", "c", "c", "c"]
string_count = StatsBase.countmap(x)
keys(sort(string_count, by = last, rev=true))
6 Likes

Or using FreqTables.jl:

julia> sort(freqtable(x))
3-element Named Array{Int64,1}
Dim1  │ 
──────┼──
a     │ 1
b     │ 2
c     │ 3
3 Likes

How do I get “a” from an array?

julia> z = sort(freqtable(x))
3-element Named Vector{Int64}
Dim1  │
──────┼──
a     │ 1
b     │ 2
c     │ 3

julia> z[1]
1

@Rafael_Brus, try this:

ft = sort(freqtable(x))
names(ft,1)[1]
1 Like

Here is a solution without dependencies. Credits to @rvasil for noting the keyword arguments to sort.

function count_unique(V::AbstractVector{T}) where T
    U = unique(V)
    l = length(U)
    counts = Dict{T,Int}(zip(U, zeros(l)))
    for v in V
        counts[v] += 1
    end
    return counts
end

function frequency_sort(V::AbstractVector)
    counts = count_unique(V)
    sorted = sort(collect(counts); by=last, rev=true)
    return first.(sorted)
end

Benchmarks (Julia 1.8-rc1):

julia> using BenchmarkTools

julia> @btime frequency_sort(rand(1:100, 1_000));
  30.355 μs (28 allocations: 24.55 KiB)

julia> @btime frequency_sort(rand(1:100, 100_000));
  2.374 ms (29 allocations: 797.91 KiB)

julia>  @btime frequency_sort(rand(1:1000, 100_000));
  2.423 ms (38 allocations: 942.39 KiB)
1 Like

Cleaner perhaps, but note that it will not get the correct results for input:

x = ["a", "b", "b", "c", "c", "c", "d"]