Hello.
What’s the difference between using StatLearn and LinReg for linear regression.
fit!(StatLearn(5, MSPI()), (x, y))
fit!(LinReg(), (x,y))
both are supposed to be able to “Fit a model that is linear in the parameters.”
What method is internally used when you use LinRegBuilder?
Can I choose an algorithm different from SGD?
I’m looking for the fastest method able to fit regression models using large datasets that don’t fit on memory. What is my best option? and for logistic regression? Should I use Flux.jl instead?
OnlineStats docs have this example:
x = randn(100_000, 10)
y = x * linspace(-1, 1, 10) + randn(100_000)
o = StatLearn(10, .5 * L2DistLoss(), L1Penalty(), fill(.1, 10), SGD())
s = Series(o)
fit!(s, x, y)
But my real data is on a csv file and it’s very big, sometimes larger than memory. How can I tell fit! to use the data directly from the disk without first loading everything on memory?
Imagine I create this data and I save it on a file called mydata.csv:
N=30000
x1 = repeat(1:N, outer=N)
x2 = repeat(1:N, inner=N)
x3 = sqrt.(repeat(1:N^2))
x1x2 = x1 .* x2
gg = repeat(1:5,inner=div(N^2,5))
y = 1 .- 2 x1 + 3 x2 + 0.5*x1x2 + rand(N^2) + x3.*rand(N^2)
data = DataFrame(x1=x1, x2=x2, x3=x3, x1x2=x1x2, y=y,gg=gg)
categorical!(data, :gg)
And I want to fit it as if was using the command
fit(LinearModel, @formula(y ~ x1+x2+x3+x1x2+gg), data)
but I want to do it loading the data directly from mydata.csv,
how would you do it? It doesn’t fit on memory. I don’t know how to create a stream or however you call it.