How to fit a GLM to all (unnamed) features of arbitrary design matrix?


#1

I want to call the GLM.jl library as a blackbox, which means I want to use a formula of the form “y ~ .”, where y is the target column of my dataframe, and . is all the other features.
Specifically, what I would like to write is this:

function logreg_binary_fit_glm(X, y)
	D, N = size(X)
	Xt = permutedims(X)
	df = DataFrame(Xt)
	df[:y] = y
	# regress on all inputs :x1 ... :xd
	model = glm(@formula(y ~.), df, Binomial(), LogitLink())
	return coef(model)
end

Unfortunately dot syntax is not supported by the latest StatsModels v0.5, and the solution proposed here does not work.
It is easy to generate the string “y ~ x1 + x2 + … xD” where D is the number of features (eg using the code below).

function make_formula_all_features(df, target)
	col_symbols = Set(names(df))
	feature_symbols = setdiff(col_symbols, Set([target]))
	feature_strings = [string(s) for s in feature_symbols]
	all_features = join(feature_strings, " + " )
	formula = string(target) * " ~ " * all_features
	return formula
end

However - I don’t know how to convert this to an expression to pass to the @formula macro, and applying Meta.parse to the output does not work:

n = 10
df = DataFrame(X1=randn(n), X2=randn(n), Y=randn(n))
ols = lm(@formula(Y ~ X1 + X2), df) # works
formula = make_formula_all_features(df, :Y) # "Y ~ X2 + X1"
f_expr = Meta.parse(formula) # :(Y ~ X2 + X1)
ols2 = lm(@formula(f_expr), df) # ERROR: type Symbol has no field head

Any ideas?


#2

Based on example, can’t you just

glm(X::AbstractMatrix, y::AbstractVector, Binomial()) |>
    coef

?
You can pass a matrix and a vector directly if you so which.

For example,

using Random
Random.seed!(0);
X = hcat(ones(100), rand(100, 2));
y = rand(0:1, 100);
fit(GeneralizedLinearModel, X, y, Binomial()) |>
    coef

#3

First name your columns, e.g.

using DataFrames, GLM
df_y = DataFrame(y, [:yname])  # or DataFrame(y[:, :], [:yname]) if y is Array{T, 1}
df_x_names = [Symbol("x$i") for i in 1:size(x)[2]]
df_x = DataFrame(x, df_x_names)
df = hcat(df_y, df_x)

and then try the following workaround to, for e.g., construct a logit with main effects

lhs = :yname
rhs = Expr(:call, :+, df_x_names...)
logit = glm(@eval(@formula($(lhs) ~ $(rhs))), df, Bernoulli(), LogitLink())