A nice use case for DataFrames.jl - flexible dedup

xiaodai · July 16, 2021, 1:59am

I had this problem where I need to dedup a dataframe using a column, but there is a timestamp column for all the entries. If two entries have the same data, but different timestamp, I would want ot keep only the one with the latest timestamp. After this dedup there shouldn’t any dups in the dedup column.

Obviously unique wouldn’t work due to timestamp and other columns. This is how i solved it with DataFrames.jl

using DataFrames

data=DataFrame(dedup = rand(1:8, 1000), timestamp = rand(1:8, 1000), othervals = rand(1000))

using Chain
@chain data begin
  groupby(:dedup)
  combine(subdf -> begin
    sort(subdf, :timestamp, rev=true)[1, :] # could've used partialsort but this is easier to read
  end)
end

which I thought was a very nice and readable way to do it.

How do you guys do it? Any examples from other languages? I think data.table nor dplyr is as nice. And don’t get me started on pandas or spark… Can’t rule out i am just bad at pandas and spark though

aplavin · July 16, 2021, 7:04am

Couldn’t be easier with Base Julia Tables as well (:

using SplitApplyCombine, Tables, DataPipes

tbl = (dedup = rand(1:8, 1000), timestamp = rand(1:8, 1000), othervals = rand(1000)) |> rowtable

# closest to your solution:
@p begin
	tbl
	group(_.dedup)
	map() do subtbl
		@p subtbl |> sort(by=_.timestamp) |> last
	end
	rowtable
end

# write the short lambda inline:
@p begin
	tbl
	group(_.dedup)
	map(sort(_, by=x->x.timestamp) |> last)
	rowtable
end

# same without macros, less clear:
map(
	subtbl -> sort(subtbl, by=x->x.timestamp)[end],
	group(x -> x.dedup, tbl)
) |> rowtable

xiaodai · July 16, 2021, 7:21am

Still essentially a group by approach. Wondering if there are non group by approaches

aplavin · July 16, 2021, 7:26am

Of course, one could just use unique:

@p begin
	tbl
	sort(by=_.timestamp, rev=true)
	unique(_.dedup)
end

Its docs say:

Return an array containing only the unique elements of collection itr , as determined by isequal, in the order that the first of each set of equivalent elements originally appears.

bkamins · July 16, 2021, 7:53am

If you need performance argmin instead of sort would be faster.

Topic		Replies	Views
Frustrated using DataFrames New to Julia dataframes , data_structures	97	10007	April 22, 2022
Easier way to split-apply-combine in DataFrames.jl General Usage dataframes	5	1099	December 14, 2020
[ANN] Cleaner.jl: A toolbox of simple solutions for common data cleaning problems v1.0 Package Announcements package , announcement , dataframes , tables	1	670	February 6, 2022
Comparing DataFrames native API and Query Data	4	1520	September 1, 2017
Data Cleaning: Split, Combine, Apply? New to Julia dataframes	9	769	October 28, 2021

A nice use case for DataFrames.jl - flexible dedup

Related topics