Iterate fitting linear model and extract coefficients over grouped DataFrame

guysutton · July 26, 2022, 11:41am

I am trying to iteratively fit many models (e.g. a linear model [GLM.jl] or mixed model [MixedModels.jl]) to a DataFrame, whereby a new model is fit to each level of a grouping column. For example: if I wanted to fit one model to the data for rep1 and another model to rep2 from the grp column…

# Load packages 
using DataFrames, GLM, StatsBase

# Simulate fake data
data = DataFrame(
    # Response variable
    y =[1, 2, 3, 4, 2, 4, 7, 8], 
    # Predictor variable 
    x=["A1", "A2", "A1", "A2", "A1", "A2", "A1", "A2",], 
    # Grouping variable 
    grp = ["rep1", "rep1", "rep1", "rep1", "rep2", "rep2", "rep2", "rep2"]
    );
data

I can fit a model to the full dataset.

GLM.lm(@formula(y ~ x), data)

But, I can’t seem to do this for each level of grp.

# One approach
grouped_data = groupby(data, [:grp])
for grp in grouped_data
    result = GLM.lm(@formula(y ~ x), grouped_data)
end

# Another approach 
results = [lm((@eval @formula(y ~ x)), grouped_data) for grp in grouped_data]

Both approaches throw the following error:

ArgumentError: expected data in a Table, got GroupedDataFrame{DataFrame}

I realise that this error is saying that the data input (grouped_data) is not in the correct format to be readable by the lm function call. I am not sure how to pass the grouped data as a table though? Maybe storing the grouped datasets in a list and iterating over the list would be better?

I have seen this [How to save results for each outcome? - #10 by nilshg] on how to iterate over different columns. However, I can’t seem to figure out how to translate this to iteratve over groups in a column.

Ultimately, I would like to produce a table with co-efficients from the fitted models, e.g.

results
2×2 DataFrame
 Row │ grp         beta1             
         │ Symbol  Float64     
─── ┼────────────
   1    │ rep1       5.17577   
   2    │ rep2       0.97570

Any advice or resources would be much appreciated.

bkamins · July 26, 2022, 12:00pm

This is covered in my tutorial that will start in 2 hours from now. See around cell [82] in the notebook (you would then need to change this code a bit to get what you want, but I hope you will see the idea how to do it).

bkamins · July 26, 2022, 12:02pm

The link to the tutorial is A Complete Guide to Efficient Transformations of data frames | JuliaCon 2022

guysutton · July 27, 2022, 10:37am

Thanks for the resources, @bkamins. This was really helpful (and the live session was excellent, thanks).

The solution for anyone who comes across this thread:

# Load packages 
using DataFrames, DataFramesMeta, GLM, StatsBase

# Simulate fake data
data = DataFrame(
    # Response variable
    y = [1, 2, 3, 4, 2, 4, 7, 8], 
    # Predictor variable 
    x = ["A1", "A2", "A1", "A2", "A1", "A2", "A1", "A2",], 
    # Grouping variable 
    grp = ["rep1", "rep1", "rep1", "rep1", "rep2", "rep2", "rep2", "rep2"]
    );
data

# Code is adapted from Bogumił Kamiński's excellent tutorial and his Discourse post
# URL: https://discourse.julialang.org/t/iterate-fitting-linear-model-and-extract-coefficients-over-grouped-dataframe/84814

# Analysis (store the fitted model coefficients in a variable called 'results') 
results = 

    # Combine allows us to process the dataframe (passed here as the first argument to groupby), and group
    # the dataframe by the different elements of the `grp` column 
    # R code: data %>% dplyr::group_by(grp)
    # The 'do sdf' part seems to be like lapply(), and 'sdf' seems to be a placeholder for our grouped dataframe?
    combine(groupby(data, :grp)) do sdf
    
    # Specify the model formula to be passed to the model fitting function
    # R code: model_formula = "y ~ x" 
    model_formula = Term(:y) ~ Term(:x)

    # Pass the grouped data to the model fitting function
    # - Fits a seperate model to each unique element in the `grp` column 
    # R code: model <- lm(model_formula, data = .)
    # - Don't forget that we need to pass the grouped dataframe (`sdf`) to the function, not the original data 
    model = GLM.lm(model_formula, sdf)

    # Extract a summary table of all the model co-efficients 
    # R code: Nothing exactly the same, but similar to calling: broom::broom(model)
    return DataFrame(coeftable(model))

end # End the function call

bkamins · July 27, 2022, 10:45am

Nice,

For your reference, in my talk I presented:

model_formula = Term(:y) ~ Term(:x)

as an advanced method to specify model formula.

In such simple case as this one the following is enough:

julia> model_formula = @formula(y~x)
FormulaTerm
Response:
  y(unknown)
Predictors:
  x(unknown)

Topic		Replies	Views
How to create a linear regression model for each of the group General Usage question	4	643	June 26, 2021
Iteration over 10 columns to calculate LinearRegression Machine Learning	4	377	March 27, 2021
Error when using fit(LinearModel,...) through GLM General Usage glm	1	585	August 19, 2021
Iterating over (key, value) with grouped DataFrame General Usage data , dataframes	8	6050	October 18, 2022
Univariate linear regression with each covariate of dataframe Statistics dataframes , regression	5	804	October 15, 2019

Iterate fitting linear model and extract coefficients over grouped DataFrame

Related topics