How to fit a GLM to all (unnamed) features of arbitrary design matrix?

murphyk · February 6, 2019, 1:13am

I want to call the GLM.jl library as a blackbox, which means I want to use a formula of the form “y ~ .”, where y is the target column of my dataframe, and . is all the other features.
Specifically, what I would like to write is this:

function logreg_binary_fit_glm(X, y)
	D, N = size(X)
	Xt = permutedims(X)
	df = DataFrame(Xt)
	df[:y] = y
	# regress on all inputs :x1 ... :xd
	model = glm(@formula(y ~.), df, Binomial(), LogitLink())
	return coef(model)
end

Unfortunately dot syntax is not supported by the latest StatsModels v0.5, and the solution proposed here does not work.
It is easy to generate the string “y ~ x1 + x2 + … xD” where D is the number of features (eg using the code below).

function make_formula_all_features(df, target)
	col_symbols = Set(names(df))
	feature_symbols = setdiff(col_symbols, Set([target]))
	feature_strings = [string(s) for s in feature_symbols]
	all_features = join(feature_strings, " + " )
	formula = string(target) * " ~ " * all_features
	return formula
end

However - I don’t know how to convert this to an expression to pass to the @formula macro, and applying Meta.parse to the output does not work:

n = 10
df = DataFrame(X1=randn(n), X2=randn(n), Y=randn(n))
ols = lm(@formula(Y ~ X1 + X2), df) # works
formula = make_formula_all_features(df, :Y) # "Y ~ X2 + X1"
f_expr = Meta.parse(formula) # :(Y ~ X2 + X1)
ols2 = lm(@formula(f_expr), df) # ERROR: type Symbol has no field head

Any ideas?

Nosferican · February 6, 2019, 4:43am

Based on example, can’t you just

glm(X::AbstractMatrix, y::AbstractVector, Binomial()) |>
    coef

?
You can pass a matrix and a vector directly if you so which.

For example,

using Random
Random.seed!(0);
X = hcat(ones(100), rand(100, 2));
y = rand(0:1, 100);
fit(GeneralizedLinearModel, X, y, Binomial()) |>
    coef

kmmate · February 6, 2019, 9:18pm

First name your columns, e.g.

using DataFrames, GLM
df_y = DataFrame(y, [:yname])  # or DataFrame(y[:, :], [:yname]) if y is Array{T, 1}
df_x_names = [Symbol("x$i") for i in 1:size(x)[2]]
df_x = DataFrame(x, df_x_names)
df = hcat(df_y, df_x)

and then try the following workaround to, for e.g., construct a logit with main effects

lhs = :yname
rhs = Expr(:call, :+, df_x_names...)
logit = glm(@eval(@formula($(lhs) ~ $(rhs))), df, Bernoulli(), LogitLink())

Topic		Replies	Views
How to get a GLM where formula is programmatically generated New to Julia dataframes , glm	3	1309	September 1, 2020
How to define a formula with a vector of string names for GLM.jl? General Usage	3	1255	August 14, 2020
Stepwise logistic regress - GLM - non-callable --> callable (Non-call expression encountered) Optimization (Mathematical) regression , glm	7	1273	October 10, 2018
GLM.jl with unknown column names Statistics statistics , regression , glm	4	1874	February 19, 2019
Extract @formula / model formula from GLM object General Usage glm	2	533	October 8, 2021

How to fit a GLM to all (unnamed) features of arbitrary design matrix?

Related topics