A nice use case for DataFrames.jl - flexible dedup

I had this problem where I need to dedup a dataframe using a column, but there is a timestamp column for all the entries. If two entries have the same data, but different timestamp, I would want ot keep only the one with the latest timestamp. After this dedup there shouldn’t any dups in the dedup column.

Obviously unique wouldn’t work due to timestamp and other columns. This is how i solved it with DataFrames.jl

using DataFrames

data=DataFrame(dedup = rand(1:8, 1000), timestamp = rand(1:8, 1000), othervals = rand(1000))

using Chain
@chain data begin
  groupby(:dedup)
  combine(subdf -> begin
    sort(subdf, :timestamp, rev=true)[1, :] # could've used partialsort but this is easier to read
  end)
end

which I thought was a very nice and readable way to do it.

How do you guys do it? Any examples from other languages? I think data.table nor dplyr is as nice. And don’t get me started on pandas or spark… Can’t rule out i am just bad at pandas and spark though :slight_smile:

4 Likes

Couldn’t be easier with Base Julia Tables as well (:

using SplitApplyCombine, Tables, DataPipes

tbl = (dedup = rand(1:8, 1000), timestamp = rand(1:8, 1000), othervals = rand(1000)) |> rowtable

# closest to your solution:
@p begin
	tbl
	group(_.dedup)
	map() do subtbl
		@p subtbl |> sort(by=_.timestamp) |> last
	end
	rowtable
end

# write the short lambda inline:
@p begin
	tbl
	group(_.dedup)
	map(sort(_, by=x->x.timestamp) |> last)
	rowtable
end

# same without macros, less clear:
map(
	subtbl -> sort(subtbl, by=x->x.timestamp)[end],
	group(x -> x.dedup, tbl)
) |> rowtable
2 Likes

Still essentially a group by approach. Wondering if there are non group by approaches

Of course, one could just use unique:

@p begin
	tbl
	sort(by=_.timestamp, rev=true)
	unique(_.dedup)
end

Its docs say:

Return an array containing only the unique elements of collection itr , as determined by isequal, in the order that the first of each set of equivalent elements originally appears.

1 Like

If you need performance argmin instead of sort would be faster.

3 Likes