I’m struggling to figure this out. I have some data that look like this:
using DataFrames
using DataFramesMeta
using StatsBase
df = DataFrame(a=[1,1,2,2,3,3,4,4,5,5,6,6], b=rand(vcat(1:3, missing), 12), c=rand(10:50, 12))
and I need to compute a weighted median of :b
for each :a
in the presence of missing
values. I would compute the non-weighted version like so:
@chain df begin
@by(:a,
median_b = StatsBase.median(skipmissing(:b))
)
end
I’ve tried to several ways to compute the weighted median but without any success. Here are a couple of examples:
@chain df begin
@by(:a,
median_b = StatsBase.median(
skipmissing(:b),
pweights(Array{Int64,1}(getindex(:c, map(x -> !ismissing(x), :b))))
)
)
end
# produces a MethodError
@chain df begin
@by(:a,
median_b = StatsBase.median(
skipmissing(:b),
pweights(:c[:b .!== missing])
)
)
end
# also produces a MethodError
This always happens. I banged my head against the wall for half an hour trying to figure this out, and then 5 minutes after posting here I figured it out:
@chain df begin
@by(:a,
median_b = StatsBase.median(
collect(skipmissing(:b)),
pweights(:c[:b .!== missing])
)
)
end
Just didn’t think long enough about why I was getting a MethodError
(or take the time to carefully read the error message ). The call to skipmissing
results in ::Base.SkipMissing{SubArray{Union{Missing, Int64}
so you just have to collect
it to get Array{Int64,1}
1 Like
This is something we want to make easier! See a stale PR here.
As an aside, note that with more recent versions of DataFramesMeta, you can use begin ... end
blocks, with transformations on separate lines, to avoid ugly parentheses and commas (as well as use macro-flags like @byrow
and @passmissing
easier).
2 Likes
In general alternatively you can add dropmissing(:b, view=true)
(view to avoid allocations alternatively just dropmissing(:b)
) as a first step in the chain and things should simplify.
3 Likes