How would I sort user transactions by date, and only keep the N most recent entries

Hi how’s it going?

I have a dataframe where I have unique user entries and the date as one of the features. I know how to sort by date, but I’m having difficulty extracting say the last 5 entries per user. In Pandas there’s a GroupBy Nth method that does the trick.

Any help would be greatly appreciated!

You can do things like map(x->x[(end-4):end,:],groupby(df,:user)) to get them separately or wrap that in a DataFrame() call to get the result as one.

This requires DataFramesMeta.jl I believe (but you should definitely be using that package anyway).

This assumes df was sorted beforehand.

One solution (for the last two entries) would be:


julia> using Random, Dates, DataFrames

julia> df = DataFrame(name = rand(["joe","jill"],11),
                      time = shuffle(Date(now()):Day(1):Date(now()+Day(10))))
11×2 DataFrame
│ Row │ name   │ time       │
│     │ String │ Date       │
├─────┼────────┼────────────┤
│ 1   │ jill   │ 2020-02-13 │
│ 2   │ jill   │ 2020-02-06 │
│ 3   │ joe    │ 2020-02-10 │
│ 4   │ jill   │ 2020-02-09 │
│ 5   │ joe    │ 2020-02-16 │
│ 6   │ joe    │ 2020-02-11 │
│ 7   │ joe    │ 2020-02-14 │
│ 8   │ jill   │ 2020-02-15 │
│ 9   │ joe    │ 2020-02-07 │
│ 10  │ joe    │ 2020-02-12 │
│ 11  │ joe    │ 2020-02-08 │

julia> by(df,:name) do group
       sort(group,:time)[end-2:end,:]
       end
6×2 DataFrame
│ Row │ name   │ time       │
│     │ String │ Date       │
├─────┼────────┼────────────┤
│ 1   │ jill   │ 2020-02-09 │
│ 2   │ jill   │ 2020-02-13 │
│ 3   │ jill   │ 2020-02-15 │
│ 4   │ joe    │ 2020-02-12 │
│ 5   │ joe    │ 2020-02-14 │
│ 6   │ joe    │ 2020-02-16 │

You can also probably use Query to do this, but I find the documentation for @mapmany (which you would need to use, along with @groupby and @orderby) to be so opaque that I can’t figure out how to do it right now. The Query solution might run faster, but I don’t know for sure.

Thanks, now I’m facing the issue where some groups have less than 5 users and I’m getting a bounds error. How would I add a conditional to this? (I’m brand new to Julia, always worked in Python/Pandas)

Well that will sort of break us away from the one-liner function but no worries.

function myselect(x)
    N = size(x,1)
    if N <= 5
        return x    # alternatively, you could return nothing if you didn't want these cases
    else
        return x[(end-4):end,:]
    end
end
map(myselect,groupby(df,:user))

There are probably better ways to do all of this but the great thing about Julia is that you can just freely express what you want to do and it will (almost surely) work.

1 Like
by(sdf -> last(sort(sdf, :time), 5), df, :user)
5 Likes

See, I told you there was a better way. :laughing:

But the key point is what you have commented - most of the time if you do not know the library in detail it is easy enough to implement a solution by hand :smile:.

1 Like

Thanks so much guys! Very refreshing how helpful and friendly the community is here.

1 Like