Macro that takes a function and calls it repeatedly with different data

I’m trying to write a macro that provides a convenient way for a user to compute certain measures/statistics across different levels of aggregation. For example, there are roughly 3,200 counties in the USA and we have different data sets with interesting information at the county level. Counties can be aggregated a variety of different ways; we can aggregate them to the state level, to the region level, or to any other arbitrary level (districts, along political lines, etc.).

What I would like is for the user to be able to bring a data set with county-level data, write a function that determines how to compute some measure of interest for a grouping of counties, and then I would like to provide macros that allow the user to then pass their function to the macro of interest, e.g., @by_state or @by_district, etc.

Here’s a MWE. Let’s say I have a DataFrame like this:

df = DataFrame(
    county_id=1:10,
    val=rand(10)
)

10×2 DataFrame
 Row │ county_id  val      
     │ Int64      Float64
─────┼─────────────────────
   1 │         1  0.556378
   2 │         2  0.943081
   3 │         3  0.689384
   4 │         4  0.289988
   5 │         5  0.872941
   6 │         6  0.862853
   7 │         7  0.786719
   8 │         8  0.658907
   9 │         9  0.815363
  10 │        10  0.721837

I want to define a function that will compute some statistic of interest for a group of counties:

function some_stat(data; counties=[])
    s = filter(row -> row.county_id in counties, data)
    sum(s.val) / size(s, 1)
end

some_stat(df; counties=[1, 3])

0.6228810924276855

Now, let’s say I want to compute this statistic of interest for each of three groups: A, B, and C:

groups = Dict("A" => [1,2], "B" => [3,4], "C" => [5,6,7,8,9,10])

Here’s one way I can do it:

[k => some_stat(df; counties=v) for (k, v) in groups]

3-element Vector{Pair{String, Float64}}:
 "B" => 0.4896859648435802
 "A" => 0.7497293978863859
 "C" => 0.7864367044382828

I have lots of different groupings for my counties, and I don’t want the user to have to worry about maintaining these groupings (which change from time to time). Also, the data that the user brings may not always be a DataFrame and it may not even be tabular. What I want is to provide a package that provides macros that allow them to do something like this:

@by_group some_stat(df; counties)

The macro should accept an arbitrary function that has a counties keyword argument, and then calls the user-provided function for each grouping of counties. I tried this:

macro by_group(original_function)
    quote
        [k => $original_function(args...; counties=v) for (k,v) in groups]
    end
end

but it throws this error:

ERROR: AbstractDataFrame is not iterable. Use eachrow(df) to get a row iterator or eachcol(df) to get a column iterator

I don’t understand what’s causing the error…

At a high level: it’s not obvious why you need a macro. Why not tell them to pass the function directly?

Your non-macro solution seems fine.

4 Likes

So in this toy example, I would agree. In my real problem though, there are many different groupings and they’re not always going to be dictionaries. My goal is for the user to not have to worry about any of those details and to just pass their function to @by_groupsA or @by_groupsB. They just need to know that grouping A and B exist, but the macros handle the details of dealing with the groupings. Does that make sense?

Macros work on expressions parsed from text, they know very little details of the groupings. There’s only expressions, symbols, and a couple literals, not values and types in general. It’s more likely the transformed expression that will deal with the details, and that itself can be a function, no need to transform from another function call.

If you really can’t refactor to a consistent type, you should find or make consistent methods to interact with them. Different methods of a function can handle the details of different types, but their calls will look the same.

2 Likes

Can you provide an example? Let’s say I have created a consistent structure for the groupings, I’ll use Dicts in this case:

groupsA = Dict("A" => [1,2], "B" => [3,4], "C" => [5,6,7,8,9,10])
groupsB = Dict("Low" => [1,2,3], "Medium" => [4,5,6,7], "High" => [8,9,10])

If I try to write a function, I get the exact same error:

function by_groupsA(f)
    [k => f(args...; counties=v) for (k,v) in groupsA]
end

by_groupsA(some_stat(df; counties))

ERROR: AbstractDataFrame is not iterable. Use eachrow(df) to get a row iterator or eachcol(df) to get a column iterator

One of the goals is to have the user be able to provide any function that outputs a scalar value based on some data source and a grouping of counties, and then the package will provide all groupings, so the user just needs to know that groupsA and groupsB are available, but doesn’t need to worry about importing or managing the groupings.

There is quite a lot going wrong there. For one, I don’t get that error, the call some_stat(df; counties) call should fail at the incorrect keyword argument counties; compare with your earlier line some_stat(df; counties=[1, 3]). Is that really the entire code you’re running? What is the full stacktrace of that error?

Beyond that, you’re not doing a higher order function call correctly. by_groupsA(f) seems intended to take a function f, but you provide some_stat(df; counties), which if fixed gives a Float64 value. by_groupsA does not take groupsA and args as arguments and will attempt to find them in the global scope; chances are you’d rather pass df and groupsA in as arguments to by_groupsA.

Not without a comprehensive list of the different tables and groupings and a description of what you want to do with them. So far it’s just been filtering and column access of DataFrames and iterating Dictionaries.

1 Like

I’ve provided a full MWE. As long as you’re using DataFrames, it should be reproducible.

That’s what I’ve provided…it’s a MWE…I want to get it working and then I can apply the solution to my real problem. Here’s a slightly more thorough example, all in one block:

using DataFrames

df1 = DataFrame(
    county_id=1:10,
    val=rand(10)
)

df2 = DataFrame(
    county_id=rand(1:5, 10),
    year=rand(2017:2018, 10),
    val=rand(10)
)

function some_stat1(data; counties=[])
    s = filter(row -> row.county_id in counties, eachrow(data))
    sum(s.val) / size(s, 1)
end

function some_stat2(data, year; counties=[])
    s = filter(row -> row.county_id in counties && row.year == year, eachrow(data))
    sum(s.val) / size(s, 1)
end

# ideal scenario (can't figure out how to do it):

@by_groupsA some_stat1(df1)
@by_groupsB some_stat2(df2, 2017)

# or

some_stat1(df1) |> by_groupsA

# or

by_groupsA(some_stat1, df1) # you get the idea

# what works:
groupsA = Dict("A" => [1,2], "B" => [3,4], "C" => [5,6,7,8,9,10])
groupsB = Dict("Low" => [1,2,3], "Medium" => [4,5,6,7], "High" => [8,9,10])

stat1_groupsA = [k => some_stat1(df1; counties=v) for (k,v) in groupsA]
stat1_groupsB = [k => some_stat1(df1; counties=v) for (k,v) in groupsB]
stat2_groupsA = [k => some_stat2(df2, 2017; counties=v) for (k,v) in groupsA]
stat2_groupsB = [k => some_stat2(df2, 2017; counties=v) for (k,v) in groupsB]

The groupings are used in many areas in our work, so it makes sense to keep them separate somewhere and I think it would be nice to provide some convenience functions/macros/whatevers to be able to do what I’m showing above.

Not sure if this is what you want.


using DataFrames

# Information
df = DataFrame(
    county_id=1:10,
    val=rand(10)
)

#your groups 
groups = Dict("A" => [1,2], "B" => [3,4], "C" => [5,6,7,8,9,10])

#Now I'd identify the groups in the dataframe:
    # Dict where key is an element in county_id and value is the group index
    dict_groups = Dict(el => k for (k, group) in pairs(groups) for el in group)

    # Create gr column in df, based on the county_id column using the dictionary
    df.gr = [get(dict_groups, v, missing) for v in df.county_id]

You now have

julia> df
10×3 DataFrame
 Row │ county_id  val        gr     
     │ Int64      Float64    String
─────┼──────────────────────────────
   1 │         1  0.79117    A
   2 │         2  0.387188   A
   3 │         3  0.225699   B
   4 │         4  0.935953   B
   5 │         5  0.54986    C
   6 │         6  0.28277    C
   7 │         7  0.636189   C
   8 │         8  0.0735388  C
   9 │         9  0.249462   C
  10 │        10  0.623065   C

And you can take some statistic for a group in this way.

# Exmaple of how to use it
    # group df by A, B, C
    dfg = groupby(df, :gr)

    #example of computing mean of "val" for group A
    target_group = dfg[("A",)] 
    stats = mean(target_group.val)

1 Like

Your MWE still only has DataFrame df’s and Dict groups, there doesn’t appear to be a need to do multimethods or metaprogramming. You can just pass them as arguments.

# stat must be callable, and have a method with a keyword counties
function statbygroups(stat, groups::AbstractDict, df::AbstractDataFrame, moreargs...)
  [k => stat(df, moreargs...; counties=v) for (k,v) in groups]
end

stat1_groupsA = statbygroups(some_stat1, groupsA, df1)
stat1_groupsB = statbygroups(some_stat1, groupsB, df1)
stat2_groupsA = statbygroups(some_stat2, groupsA, df2, 2017)
stat2_groupsB = statbygroups(some_stat2, groupsB, df2, 2017)

If groupsA and groupsB are special so it’s worth not passing them as arguments in the call:

struct Statby{D<:AbstractDict}
  groups::D
end

function (s::Statby)(stat, df::AbstractDataFrame, moreargs...)
  statbygroups(stat, s.groups, df, moreargs...)
end

const statbygroupsA = Statby(groupsA)
const statbygroupsB = Statby(groupsB)

stat1_groupsA = statbygroupsA(some_stat1, df1)
stat1_groupsB = statbygroupsB(some_stat1, df1)
stat2_groupsA = statbygroupsA(some_stat2, df2, 2017)
stat2_groupsB = statbygroupsB(some_stat2, df2, 2017)
2 Likes