Your second question is a bit easier so I’ll answer it first. You can combine Term
s with +
(or sum
), like
julia> sum(term.([:x1, :x2, :x3, :x4])) |> string
"x1 + x2 + x3 + x4"
If you really need to use an Expr
then you’ll have to do some parsing of it yourself, or use @formula
.
For 1, you’ll have to do the filtering yourself. You can convert those entries to missing
in the underlying data source and they’ll be dropped automatically, or you can post-filter the generated model matrix (but as you said that will require holding separate copies of the data). If you’re really worried about memory usage, you could always stream your data row-wise, call modelcols
one row at a time, and re-assemble them on the other side. Something like:
julia> using StatsModels, Tables, DataFrames
julia> df = DataFrame(y = rand(10), x = rand(Bool, 10) .* rand(10))
10×2 DataFrame
│ Row │ y │ x │
│ │ Float64 │ Float64 │
├─────┼───────────┼──────────┤
│ 1 │ 0.209279 │ 0.0 │
│ 2 │ 0.424313 │ 0.0 │
│ 3 │ 0.355565 │ 0.372775 │
│ 4 │ 0.298203 │ 0.0 │
│ 5 │ 0.322313 │ 0.0 │
│ 6 │ 0.0742158 │ 0.49498 │
│ 7 │ 0.55074 │ 0.0 │
│ 8 │ 0.747198 │ 0.990947 │
│ 9 │ 0.225106 │ 0.21548 │
│ 10 │ 0.280559 │ 0.363834 │
julia> f = @formula(y ~ x + log(x))
FormulaTerm
Response:
y(unknown)
Predictors:
x(unknown)
(x)->log(x)
julia> # use StatisticalModel as the context to get the intercept term
f = apply_schema(f, schema(f, df), StatisticalModel)
FormulaTerm
Response:
y(continuous)
Predictors:
1
x(continuous)
(x)->log(x)
julia> y, X = modelcols(f, df)
([0.209279, 0.424313, 0.355565, 0.298203, 0.322313, 0.0742158, 0.55074, 0.747198, 0.225106, 0.280559], [1.0 0.0 -Inf; 1.0 0.0 -Inf; … ; 1.0 0.21548 -1.53489; 1.0 0.363834 -1.01106])
julia> X
10×3 Array{Float64,2}:
1.0 0.0 -Inf
1.0 0.0 -Inf
1.0 0.372775 -0.986781
1.0 0.0 -Inf
1.0 0.0 -Inf
1.0 0.49498 -0.703238
1.0 0.0 -Inf
1.0 0.990947 -0.00909459
1.0 0.21548 -1.53489
1.0 0.363834 -1.01106
julia> # method 1: using Iterators and reduce:
allfinite(x) = all(isfinite, Iterators.flatten(x))
allfinite (generic function with 1 method)
julia> y3, X3 = reduce((xy, xyr) -> (append!(xy[1], xyr[1]), append!(xy[2], xyr[2])),
Iterators.filter(allfinite, modelcols(f, row) for row in rowtable(df)),
init = (Float64[], Float64[]))
([0.355565, 0.0742158, 0.747198, 0.225106, 0.280559], [1.0, 0.372775, -0.986781, 1.0, 0.49498, -0.703238, 1.0, 0.990947, -0.00909459, 1.0, 0.21548, -1.53489, 1.0, 0.363834, -1.01106])
julia> X3 = reshape(X3, 3, :)'
5×3 LinearAlgebra.Adjoint{Float64,Array{Float64,2}}:
1.0 0.372775 -0.986781
1.0 0.49498 -0.703238
1.0 0.990947 -0.00909459
1.0 0.21548 -1.53489
1.0 0.363834 -1.01106
julia> # method 2: using a for loop:
y2, X2 = zeros(0), zeros(0)
(Float64[], Float64[])
julia> for row in rowtable(df)
yr, Xr = modelcols(f, row)
if allfinite((yr, Xr))
append!(y2, yr)
append!(X2, Xr)
end
end
julia> X2 = reshape(X2, 3, :)'
5×3 LinearAlgebra.Adjoint{Float64,Array{Float64,2}}:
1.0 0.372775 -0.986781
1.0 0.49498 -0.703238
1.0 0.990947 -0.00909459
1.0 0.21548 -1.53489
1.0 0.363834 -1.01106
Unfortunately, as @oxinabox pointed out on slack, this way you’re stuck either using the lazy Adjoint, or using permutedims
which copies, because of how a matrix is laid out in memory means you can’t build up a observations-as-rows matrix one row at a time unless you know the size ahead of time.