Creating a DataFrame column as the running mean of another column

The original code for this is a Pluto notebook, and the code in this post is supposed to be three notebook cells. The code works, and I’m pretty sure there’s a nicer syntax for the cell three, but I just can’t seem to figure out what it is.

using DataFrames, CSV, Chain, Dates, RollingFunctions
longdf = DataFrame(CSV.File(download("https://raw.githubusercontent.com/StatisticalMice/finland-covid-19/main/fact_epirapo_covid19case.csv")))
begin
	df = @chain longdf begin
		unstack(:Mittari, :val)
		select(:Aika => :Date, "Tapausten lukumäärä" => :Cases, "Testausmäärä" => :Tests, "Kuolemantapausten lukumäärä" => :Deaths)
		subset(:Date => d -> d.>= Date(2021, 8, 1))
		disallowmissing()
	end
	df.CasesRunMean = runmean(df.Cases, 5)
	df
end

2 Likes

With vanilla DataFrames.jl it would be in your chain:

transform!(:Cases => (x -> runmean(x, 5)) => :CasesRunMean)

or

@aside _.CasesRunMean = runmean(_.Cases, 5)

With e.g. DataFramesMeta.jl it would be:

@transform!(:CasesRunMean = runmean(:Cases, 5))

(not tested as your code is not reproducible, so please comment if it does not work :smile:)


and

subset(:Date => d -> d.>= Date(2021, 8, 1))

can be written as

subset(:Date => ByRow(>=(Date(2021, 8, 1)))

or in DataFramesMeta.jl

@rsubset(:Date >= (Date(2021, 8, 1))
1 Like

@StatisticalMouse, the code already looks nice (except possibly for the Finish words we do not understand…).

As a side note, there doesn’t seem to be a big payoff here in using @chain? The following plain code produces the same result:

dg = unstack(longdf,:Mittari, :val)
select!(dg, :Aika => :Date, "Tapausten lukumäärä" => :Cases, "Testausmäärä" => :Tests, "Kuolemantapausten lukumäärä" => :Deaths)
subset!(dg, :Date => d -> d.>= Date(2021, 8, 1))
dg.CasesRunMean = runmean(dg.Cases, 5)
dg == df    # true

As a side note: it is more efficient as it uses select! and subset! which do less copying :smile:. We can safely do this as unstack allocates a new data frame so there is no risk of mutating the source.

1 Like

Hmm, thanks, where to start replying… From the least important maybe. :smile:

The Finnish words are from the gov agency CSV. ‘Määrä’ and ‘lukumäärä’ both mean ‘number of’, and otherwise it should be a direct translation. You’re welcome. :smiley:

When I started writing this yesterday evening it was like any other time I’ve wanted to do something with DataFrames, I was unable to decide which combination of packages to use.

I like using @chain because it uses begin and end, and then I don’t need to add them because Pluto requires them. It just feels cleaner. Except of course in this case I needed the extra pair.

I did make an attempt to make it reproducible, but of course I’ve forgotten to post the link to the repo itself. https://github.com/StatisticalMice/finland-covid-19

I did some versions with DataFramesMeta.jl and DataFrameMacros.jl. Both seem to have benefits. In the end I wanted just the base package as I can’t choose. I’ll fix the code tomorrow, it’s late.

I settled for this version as it seems cleanest to me.

df = @chain longdf begin
	unstack(:Mittari, :val)
	select(:Aika => :Date, "Tapausten lukumäärä" => :Cases, "Testausmäärä" => :Tests, "Kuolemantapausten lukumäärä" => :Deaths)
	@aside _.CasesRunMean = runmean(_.Cases, 5)
	@aside _.TestsRunMean = runmean(_.Tests, 5)
	subset(:Date => ByRow(>=(Date(2021, 8, 1))))
	disallowmissing()
end
1 Like