I want to sort elements by frequency . For example let
x = ["a", "d", "d", "c", "c", "c", "f", "b", "b" ,"b", "b"]
I want to have output sorted as
x = ["b", "b" ,"b", "b","c", "c", "c, "d", "d", "a", "f"]
where we have 4 b
, 3 c
, 2d
, 1a
and 1f
.
You can use StatsBase
:
using StatsBase
function to_sort_x(x)
counts = countmap(x)
sorted_x = sort(x, by=x -> counts[x], rev=true)
return sorted_x
end
sorted_x = to_sort_x(x)
If you’re using DataFrames
, you can also do this
using DataFrames
x = ["a", "d", "d", "c", "c", "c", "f", "b", "b" ,"b", "b"]
df = DataFrame(x = x)
gdf = groupby(df, :x)
transform!(gdf, nrow => :count)
sort!(df, :count, rev=true)
sorted_x = copy(df.x)
2 Likes
Also using StatsBase, but with rle
and its inverse instead of countmap
:
function sort_by_count(x)
v, l = rle(x)
idxs = sortperm(l; rev=true)
return @views inverse_rle(v[idxs], l[idxs])
end
Some minimal benchmarking suggests that this performs a few times faster than the countmap
based method.
2 Likes
@digital_carver
Nice! I think you need to sort the incoming x
otherwise
the run length encoding won’t aggregate non-adjacent values.