I had this problem where I need to dedup a dataframe using a column, but there is a timestamp column for all the entries. If two entries have the same data, but different timestamp, I would want ot keep only the one with the latest timestamp. After this dedup there shouldn’t any dups in the dedup column.
Obviously unique wouldn’t work due to timestamp and other columns. This is how i solved it with DataFrames.jl
using DataFrames
data=DataFrame(dedup = rand(1:8, 1000), timestamp = rand(1:8, 1000), othervals = rand(1000))
using Chain
@chain data begin
groupby(:dedup)
combine(subdf -> begin
sort(subdf, :timestamp, rev=true)[1, :] # could've used partialsort but this is easier to read
end)
end
which I thought was a very nice and readable way to do it.
How do you guys do it? Any examples from other languages? I think data.table nor dplyr is as nice. And don’t get me started on pandas or spark… Can’t rule out i am just bad at pandas and spark though
Couldn’t be easier with Base Julia Tables as well (:
using SplitApplyCombine, Tables, DataPipes
tbl = (dedup = rand(1:8, 1000), timestamp = rand(1:8, 1000), othervals = rand(1000)) |> rowtable
# closest to your solution:
@p begin
tbl
group(_.dedup)
map() do subtbl
@p subtbl |> sort(by=_.timestamp) |> last
end
rowtable
end
# write the short lambda inline:
@p begin
tbl
group(_.dedup)
map(sort(_, by=x->x.timestamp) |> last)
rowtable
end
# same without macros, less clear:
map(
subtbl -> sort(subtbl, by=x->x.timestamp)[end],
group(x -> x.dedup, tbl)
) |> rowtable
@p begin
tbl
sort(by=_.timestamp, rev=true)
unique(_.dedup)
end
Its docs say:
Return an array containing only the unique elements of collection itr , as determined by isequal, in the order that the first of each set of equivalent elements originally appears.