DataFramesMeta computing weighted median within @by

I’m struggling to figure this out. I have some data that look like this:

using DataFrames
using DataFramesMeta
using StatsBase

df = DataFrame(a=[1,1,2,2,3,3,4,4,5,5,6,6], b=rand(vcat(1:3, missing), 12), c=rand(10:50, 12))

image

and I need to compute a weighted median of :b for each :a in the presence of missing values. I would compute the non-weighted version like so:

@chain df begin
	@by(:a,
		median_b = StatsBase.median(skipmissing(:b))
	)
end

I’ve tried to several ways to compute the weighted median but without any success. Here are a couple of examples:

@chain df begin
	@by(:a,
		median_b = StatsBase.median(
			skipmissing(:b),
			pweights(Array{Int64,1}(getindex(:c, map(x -> !ismissing(x), :b))))
		)
	)
end

# produces a MethodError

@chain df begin
	@by(:a,
		median_b = StatsBase.median(
			skipmissing(:b),
			pweights(:c[:b .!== missing])
		)
	)
end

# also produces a MethodError

This always happens. I banged my head against the wall for half an hour trying to figure this out, and then 5 minutes after posting here I figured it out:

@chain df begin
	@by(:a,
		median_b = StatsBase.median(
			collect(skipmissing(:b)),
			pweights(:c[:b .!== missing])
		)
	)
end

Just didn’t think long enough about why I was getting a MethodError (or take the time to carefully read the error message :relaxed:). The call to skipmissing results in ::Base.SkipMissing{SubArray{Union{Missing, Int64} so you just have to collect it to get Array{Int64,1}

1 Like

This is something we want to make easier! See a stale PR here.

As an aside, note that with more recent versions of DataFramesMeta, you can use begin ... end blocks, with transformations on separate lines, to avoid ugly parentheses and commas (as well as use macro-flags like @byrow and @passmissing easier).

2 Likes

In general alternatively you can add dropmissing(:b, view=true) (view to avoid allocations alternatively just dropmissing(:b)) as a first step in the chain and things should simplify.

3 Likes