I’m trying to write a macro that provides a convenient way for a user to compute certain measures/statistics across different levels of aggregation. For example, there are roughly 3,200 counties in the USA and we have different data sets with interesting information at the county level. Counties can be aggregated a variety of different ways; we can aggregate them to the state level, to the region level, or to any other arbitrary level (districts, along political lines, etc.).
What I would like is for the user to be able to bring a data set with county-level data, write a function that determines how to compute some measure of interest for a grouping of counties, and then I would like to provide macros that allow the user to then pass their function to the macro of interest, e.g., @by_state
or @by_district
, etc.
Here’s a MWE. Let’s say I have a DataFrame
like this:
df = DataFrame(
county_id=1:10,
val=rand(10)
)
10×2 DataFrame
Row │ county_id val
│ Int64 Float64
─────┼─────────────────────
1 │ 1 0.556378
2 │ 2 0.943081
3 │ 3 0.689384
4 │ 4 0.289988
5 │ 5 0.872941
6 │ 6 0.862853
7 │ 7 0.786719
8 │ 8 0.658907
9 │ 9 0.815363
10 │ 10 0.721837
I want to define a function that will compute some statistic of interest for a group of counties:
function some_stat(data; counties=[])
s = filter(row -> row.county_id in counties, data)
sum(s.val) / size(s, 1)
end
some_stat(df; counties=[1, 3])
0.6228810924276855
Now, let’s say I want to compute this statistic of interest for each of three groups: A, B, and C:
groups = Dict("A" => [1,2], "B" => [3,4], "C" => [5,6,7,8,9,10])
Here’s one way I can do it:
[k => some_stat(df; counties=v) for (k, v) in groups]
3-element Vector{Pair{String, Float64}}:
"B" => 0.4896859648435802
"A" => 0.7497293978863859
"C" => 0.7864367044382828
I have lots of different groupings for my counties, and I don’t want the user to have to worry about maintaining these groupings (which change from time to time). Also, the data that the user brings may not always be a DataFrame
and it may not even be tabular. What I want is to provide a package that provides macros that allow them to do something like this:
@by_group some_stat(df; counties)
The macro should accept an arbitrary function that has a counties
keyword argument, and then calls the user-provided function for each grouping of counties. I tried this:
macro by_group(original_function)
quote
[k => $original_function(args...; counties=v) for (k,v) in groups]
end
end
but it throws this error:
ERROR: AbstractDataFrame is not iterable. Use eachrow(df) to get a row iterator or eachcol(df) to get a column iterator
I don’t understand what’s causing the error…