Consideirng a DataFrame
using DataFrames
using Random: randstring
M = 100_000_000
str_base = [randstring(8) for i in 1:1_000_000]
df = DataFrame(int = rand(Int32, M), float=rand(M), str = rand(str_base, M))
@time sort!(df, :int);
# 80s on my machine
@time sort!(df, :str);
# 170son my machine
using CSV
CSV.write("tmp.csv", df)
The same operation using R’s data.table is like 3s
library(data.table)
df = fread("tmp.csv")
setkey(df, "int")
# 3s
setkey(df, "str")
# 25s
So based on this the performance of data.table is still much better.
Now the sort!
algorithm is really simple which I can replicate here
]add https://github.com/xiaodaigh/SortingLab.jl
using SortingLab
using Base.Threads: @spawn
function another_sort!(df, col)
@time ordering = fsortperm(df[!, col])
channel_lock = Channel{Bool}(length(names(df)))
for c in names(df)
@spawn begin
v = df[!, c]
@inbounds v = v[ordering]
put!(channel_lock, true)
end
end
for _ in names(df)
take!(channel_lock)
end
df
end
@time another_sort!(df, :int); # sortperm is 10s total 12s~18s
@time another_sort!(df, :str); # sortperm is 10s total 12s~18s
You can see that (f)sortperm
takes 10s. So using a more optimise sortperm
like SortingLab.fsortperm
can get much better results already.
The solution seems to be about finding a more efficient sortperm
. For a start, perhaps adapting SortingLab.fsortperm
would be a good start.