Creating a DataFrame column as the running mean of another column

StatisticalMouse · January 23, 2022, 8:33am

The original code for this is a Pluto notebook, and the code in this post is supposed to be three notebook cells. The code works, and I’m pretty sure there’s a nicer syntax for the cell three, but I just can’t seem to figure out what it is.

using DataFrames, CSV, Chain, Dates, RollingFunctions

longdf = DataFrame(CSV.File(download("https://raw.githubusercontent.com/StatisticalMice/finland-covid-19/main/fact_epirapo_covid19case.csv")))

begin
	df = @chain longdf begin
		unstack(:Mittari, :val)
		select(:Aika => :Date, "Tapausten lukumäärä" => :Cases, "Testausmäärä" => :Tests, "Kuolemantapausten lukumäärä" => :Deaths)
		subset(:Date => d -> d.>= Date(2021, 8, 1))
		disallowmissing()
	end
	df.CasesRunMean = runmean(df.Cases, 5)
	df
end

bkamins · January 23, 2022, 7:17pm

With vanilla DataFrames.jl it would be in your chain:

transform!(:Cases => (x -> runmean(x, 5)) => :CasesRunMean)

or

@aside _.CasesRunMean = runmean(_.Cases, 5)

With e.g. DataFramesMeta.jl it would be:

@transform!(:CasesRunMean = runmean(:Cases, 5))

(not tested as your code is not reproducible, so please comment if it does not work )

and

subset(:Date => d -> d.>= Date(2021, 8, 1))

can be written as

subset(:Date => ByRow(>=(Date(2021, 8, 1)))

or in DataFramesMeta.jl

@rsubset(:Date >= (Date(2021, 8, 1))

rafael.guerra · January 23, 2022, 7:23pm

@StatisticalMouse, the code already looks nice (except possibly for the Finish words we do not understand…).

As a side note, there doesn’t seem to be a big payoff here in using @chain? The following plain code produces the same result:

dg = unstack(longdf,:Mittari, :val)
select!(dg, :Aika => :Date, "Tapausten lukumäärä" => :Cases, "Testausmäärä" => :Tests, "Kuolemantapausten lukumäärä" => :Deaths)
subset!(dg, :Date => d -> d.>= Date(2021, 8, 1))
dg.CasesRunMean = runmean(dg.Cases, 5)
dg == df    # true

bkamins · January 23, 2022, 7:46pm

As a side note: it is more efficient as it uses select! and subset! which do less copying . We can safely do this as unstack allocates a new data frame so there is no risk of mutating the source.

StatisticalMouse · January 23, 2022, 7:50pm

Hmm, thanks, where to start replying… From the least important maybe.

The Finnish words are from the gov agency CSV. ‘Määrä’ and ‘lukumäärä’ both mean ‘number of’, and otherwise it should be a direct translation. You’re welcome.

When I started writing this yesterday evening it was like any other time I’ve wanted to do something with DataFrames, I was unable to decide which combination of packages to use.

I like using @chain because it uses begin and end, and then I don’t need to add them because Pluto requires them. It just feels cleaner. Except of course in this case I needed the extra pair.

I did make an attempt to make it reproducible, but of course I’ve forgotten to post the link to the repo itself. https://github.com/StatisticalMice/finland-covid-19

I did some versions with DataFramesMeta.jl and DataFrameMacros.jl. Both seem to have benefits. In the end I wanted just the base package as I can’t choose. I’ll fix the code tomorrow, it’s late.

StatisticalMouse · January 24, 2022, 7:35pm

I settled for this version as it seems cleanest to me.

df = @chain longdf begin
	unstack(:Mittari, :val)
	select(:Aika => :Date, "Tapausten lukumäärä" => :Cases, "Testausmäärä" => :Tests, "Kuolemantapausten lukumäärä" => :Deaths)
	@aside _.CasesRunMean = runmean(_.Cases, 5)
	@aside _.TestsRunMean = runmean(_.Tests, 5)
	subset(:Date => ByRow(>=(Date(2021, 8, 1))))
	disallowmissing()
end

Topic		Replies	Views
How to compute a "cumulative" in a dataframe (without a for loop) Data question , dataframes	44	9516	September 11, 2021
DataFrame creating a rowmean for 95 columns General Usage question , dataframes	6	483	March 7, 2021
Row-wise mean of columns in a DataFrame Data	4	1709	August 13, 2021
DataFramesMeta.jl version 0.11.0 Release Package Announcements dataframesmeta	0	531	April 18, 2022
DataFramesMeta.jl and the state of the DataFrames ecosystem Data	36	4022	April 24, 2020

Creating a DataFrame column as the running mean of another column

Related topics