I would be happy to help, to the extent possible, but I’m still in the process of understanding the interfaces, let alone the internals of DataFrames.jl. I’m also a few months away from my tenure review, so perhaps not the most reliable contributor…
Suppose you have unbalanced panel on hourly wages of workers (indexed by :worker
) employed at firms (:firm
) in a particular year (:year
):
julia> a = DataFrame(firm = [1,1,1,1,2,2,2,2], worker = [1,1,2,2,3,3,4,4], year = [2000,2001,2000,2001,2000,2001,2000,2001], wage = 20 .+ 20.0 .* rand(Float64,8))
allowmissing!(a,[:wage])
a[[2,3,6,8],:wage] .= missing
8×4 DataFrame
│ Row │ firm │ worker │ year │ wage │
│ │ Int64 │ Int64 │ Int64 │ Float64⍰ │
├─────┼───────┼────────┼───────┼──────────┤
│ 1 │ 1 │ 1 │ 2000 │ 38.3329 │
│ 2 │ 1 │ 1 │ 2001 │ missing │
│ 3 │ 1 │ 2 │ 2000 │ missing │
│ 4 │ 1 │ 2 │ 2001 │ 30.5861 │
│ 5 │ 2 │ 3 │ 2000 │ 35.4777 │
│ 6 │ 2 │ 3 │ 2001 │ missing │
│ 7 │ 2 │ 4 │ 2000 │ 35.0093 │
│ 8 │ 2 │ 4 │ 2001 │ 25.7551 │
You would like to calculate average wages paid by each firm; first for all years, then separately for the years 2000 and 2001.
In Stata, that’s simply
bysort firm: egen averageWage = mean(wage)
bysort firm year: egen averageWageInYear = mean(wage)
In doing so, Stata ignores missing values in the means, which is what you probably want to do.
Using DataFrames.jl and DataFramesMeta.jl in Julia, I naively start with:
julia> gd = groupby(a, [:firm])
@transform(gd, averageWage = mean(:wage))
8×5 DataFrame
│ Row │ firm │ worker │ year │ wage │ averageWage │
│ │ Int64 │ Int64 │ Int64 │ Float64⍰ │ Missing │
├─────┼───────┼────────┼───────┼──────────┼─────────────┤
│ 1 │ 1 │ 1 │ 2000 │ 38.3329 │ missing │
│ 2 │ 1 │ 1 │ 2001 │ missing │ missing │
│ 3 │ 1 │ 2 │ 2000 │ missing │ missing │
│ 4 │ 1 │ 2 │ 2001 │ 30.5861 │ missing │
│ 5 │ 2 │ 3 │ 2000 │ 35.4777 │ missing │
│ 6 │ 2 │ 3 │ 2001 │ missing │ missing │
│ 7 │ 2 │ 4 │ 2000 │ 35.0093 │ missing │
│ 8 │ 2 │ 4 │ 2001 │ 25.7551 │ missing │
Eh. Not what you want. Using skipmissing
gives you what you want:
gd = groupby(a, [:firm])
@transform(gd, averageWage = mean(skipmissing(:wage)))
8×5 DataFrame
│ Row │ firm │ worker │ year │ wage │ averageWage │
│ │ Int64 │ Int64 │ Int64 │ Float64⍰ │ Float64 │
├─────┼───────┼────────┼───────┼──────────┼─────────────┤
│ 1 │ 1 │ 1 │ 2000 │ 38.3329 │ 34.4595 │
│ 2 │ 1 │ 1 │ 2001 │ missing │ 34.4595 │
│ 3 │ 1 │ 2 │ 2000 │ missing │ 34.4595 │
│ 4 │ 1 │ 2 │ 2001 │ 30.5861 │ 34.4595 │
│ 5 │ 2 │ 3 │ 2000 │ 35.4777 │ 32.0807 │
│ 6 │ 2 │ 3 │ 2001 │ missing │ 32.0807 │
│ 7 │ 2 │ 4 │ 2000 │ 35.0093 │ 32.0807 │
│ 8 │ 2 │ 4 │ 2001 │ 25.7551 │ 32.0807 │
That works, but compared to the Stata version it requires two lines and a function from DataFramesMeta that is scantly documented. The first function someone would look for is probably by
from DataFrames.jl, which even has an option skipmissing
, but that unfortunately does something completely different.
Now suppose you want to calculate the increase in hourly wages for each worker. In our data here, that’s only defined for worker 4. In Stata you do
bysort worker (year): gen wageIncrease = wage[_n] - wage[_n-1]
In Julia, the best solution I came up with is:
a[!,:wageIncrease] = missings(Float64,size(a,1))
sort!(a, [:worker, :year])
f = _df -> @with _df begin
for i = 2:size(_df,1)
:wageIncrease[i] = :wage[i] - :wage[i-1]
end
_df
end
by(a, [:worker], f)
8×6 DataFrame
│ Row │ worker │ firm │ worker_1 │ year │ wage │ wageIncrease │
│ │ Int64 │ Int64 │ Int64 │ Int64 │ Float64⍰ │ Float64⍰ │
├─────┼────────┼───────┼──────────┼───────┼──────────┼──────────────┤
│ 1 │ 1 │ 1 │ 1 │ 2000 │ 25.0556 │ missing │
│ 2 │ 1 │ 1 │ 1 │ 2001 │ missing │ missing │
│ 3 │ 2 │ 1 │ 2 │ 2000 │ missing │ missing │
│ 4 │ 2 │ 1 │ 2 │ 2001 │ 32.9265 │ missing │
│ 5 │ 3 │ 2 │ 3 │ 2000 │ 30.7823 │ missing │
│ 6 │ 3 │ 2 │ 3 │ 2001 │ missing │ missing │
│ 7 │ 4 │ 2 │ 4 │ 2000 │ 31.2793 │ missing │
│ 8 │ 4 │ 2 │ 4 │ 2001 │ 32.7701 │ 1.49086 │
I’m sure there must be an easier way to achieve the same. The above is nice because it allows for relatively easy way of addition conditions to the assignment operation.
In this case the presence and propagation of missing
s is not a problem.
Overall, I think my grouped operations tend to fall into one of two categories: either I want to do operations on the whole vector of variables in each group, such as in the example with mean()
above, or it’s a row-wise operation. I cannot think of an example in the latter case where the presence of missing
s would make my code above not do what I want, and propagation of the missings
is fine. In the former case, it’s tricky. Suppose you want to calculate correlations of two variables within groups. Then you’d want to limit yourself to rows where both variables are present (completecases
). In other cases it’s again different.
Bottom line: the current way of doing it requires skipmissing()
and completecases()
in the right places on vector-based operations-- which is not the end of the world, but requires (1) awareness of the fact that missing
s spread faster than the Coronavirus in vector-based operations, (2) more code.
Stata is, like with many other things, somewhat inconsistent: some egen
operations (the closest equivalent of split-apply-combine with vector-valued operations) work only on rows where all variables are non-missing (e.g. cor()
), others on where at least one variable is nonmissing (rowtotal()
). On vector-valued operations that use one vector only (total()
, mean()
, etc) I think it’s equivalent to enclosing the variable in a skipmissing()
.
I have not thought carefully enough about this issue to suggest a solution. My intuition (and what I plan to implement in my package) is that a separation of split-apply-combine for row-wise operations on scalars and for vector-valued operations is useful; also because it helps the user to know whether she needs to use broadcasting or not.