I have a dataframe where I have unique user entries and the date as one of the features. I know how to sort by date, but I’m having difficulty extracting say the last 5 entries per user. In Pandas there’s a GroupBy Nth method that does the trick.
julia> using Random, Dates, DataFrames
julia> df = DataFrame(name = rand(["joe","jill"],11),
time = shuffle(Date(now()):Day(1):Date(now()+Day(10))))
11×2 DataFrame
│ Row │ name │ time │
│ │ String │ Date │
├─────┼────────┼────────────┤
│ 1 │ jill │ 2020-02-13 │
│ 2 │ jill │ 2020-02-06 │
│ 3 │ joe │ 2020-02-10 │
│ 4 │ jill │ 2020-02-09 │
│ 5 │ joe │ 2020-02-16 │
│ 6 │ joe │ 2020-02-11 │
│ 7 │ joe │ 2020-02-14 │
│ 8 │ jill │ 2020-02-15 │
│ 9 │ joe │ 2020-02-07 │
│ 10 │ joe │ 2020-02-12 │
│ 11 │ joe │ 2020-02-08 │
julia> by(df,:name) do group
sort(group,:time)[end-2:end,:]
end
6×2 DataFrame
│ Row │ name │ time │
│ │ String │ Date │
├─────┼────────┼────────────┤
│ 1 │ jill │ 2020-02-09 │
│ 2 │ jill │ 2020-02-13 │
│ 3 │ jill │ 2020-02-15 │
│ 4 │ joe │ 2020-02-12 │
│ 5 │ joe │ 2020-02-14 │
│ 6 │ joe │ 2020-02-16 │
You can also probably use Query to do this, but I find the documentation for @mapmany (which you would need to use, along with @groupby and @orderby) to be so opaque that I can’t figure out how to do it right now. The Query solution might run faster, but I don’t know for sure.
Thanks, now I’m facing the issue where some groups have less than 5 users and I’m getting a bounds error. How would I add a conditional to this? (I’m brand new to Julia, always worked in Python/Pandas)
Well that will sort of break us away from the one-liner function but no worries.
function myselect(x)
N = size(x,1)
if N <= 5
return x # alternatively, you could return nothing if you didn't want these cases
else
return x[(end-4):end,:]
end
end
map(myselect,groupby(df,:user))
There are probably better ways to do all of this but the great thing about Julia is that you can just freely express what you want to do and it will (almost surely) work.
But the key point is what you have commented - most of the time if you do not know the library in detail it is easy enough to implement a solution by hand .