DataFramesMeta computing weighted median within @by

mthelm85 · August 27, 2021, 3:58pm

I’m struggling to figure this out. I have some data that look like this:

using DataFrames
using DataFramesMeta
using StatsBase

df = DataFrame(a=[1,1,2,2,3,3,4,4,5,5,6,6], b=rand(vcat(1:3, missing), 12), c=rand(10:50, 12))

and I need to compute a weighted median of :b for each :a in the presence of missing values. I would compute the non-weighted version like so:

@chain df begin
	@by(:a,
		median_b = StatsBase.median(skipmissing(:b))
	)
end

I’ve tried to several ways to compute the weighted median but without any success. Here are a couple of examples:

@chain df begin
	@by(:a,
		median_b = StatsBase.median(
			skipmissing(:b),
			pweights(Array{Int64,1}(getindex(:c, map(x -> !ismissing(x), :b))))
		)
	)
end

# produces a MethodError

@chain df begin
	@by(:a,
		median_b = StatsBase.median(
			skipmissing(:b),
			pweights(:c[:b .!== missing])
		)
	)
end

# also produces a MethodError

mthelm85 · August 27, 2021, 4:06pm

This always happens. I banged my head against the wall for half an hour trying to figure this out, and then 5 minutes after posting here I figured it out:

@chain df begin
	@by(:a,
		median_b = StatsBase.median(
			collect(skipmissing(:b)),
			pweights(:c[:b .!== missing])
		)
	)
end

Just didn’t think long enough about why I was getting a MethodError (or take the time to carefully read the error message ). The call to skipmissing results in ::Base.SkipMissing{SubArray{Union{Missing, Int64} so you just have to collect it to get Array{Int64,1}

pdeffebach · August 27, 2021, 8:18pm

This is something we want to make easier! See a stale PR here.

As an aside, note that with more recent versions of DataFramesMeta, you can use begin ... end blocks, with transformations on separate lines, to avoid ugly parentheses and commas (as well as use macro-flags like @byrow and @passmissing easier).

bkamins · August 27, 2021, 8:39pm

In general alternatively you can add dropmissing(:b, view=true) (view to avoid allocations alternatively just dropmissing(:b)) as a first step in the chain and things should simplify.

Topic		Replies	Views
RE: Weighted Statistics with Missings Statistics dataframes	19	742	December 13, 2023
DataFramesMeta @linq does not dispatch to StatsBase.mean? Data question	2	708	April 15, 2018
Row wise median (or sum or mean) with missings New to Julia	3	1274	June 21, 2019
How to calculate a weighted mean with missing observations Statistics	17	5034	January 5, 2019
Row wise median for julia dataframes Data dataframes	18	635	November 30, 2023

DataFramesMeta computing weighted median within @by

Related topics