OnlineStats: StatLearn vs LinReg vs other options for linear regression on large datasets

Juan · November 21, 2018, 8:20pm

Hello.

What’s the difference between using StatLearn and LinReg for linear regression.

fit!(StatLearn(5, MSPI()), (x, y))

fit!(LinReg(), (x,y))

both are supposed to be able to “Fit a model that is linear in the parameters.”

What method is internally used when you use LinRegBuilder?
Can I choose an algorithm different from SGD?

I’m looking for the fastest method able to fit regression models using large datasets that don’t fit on memory. What is my best option? and for logistic regression? Should I use Flux.jl instead?

OnlineStats docs have this example:

x = randn(100_000, 10)
y = x * linspace(-1, 1, 10) + randn(100_000)
o = StatLearn(10, .5 * L2DistLoss(), L1Penalty(), fill(.1, 10), SGD())
s = Series(o)
fit!(s, x, y)

But my real data is on a csv file and it’s very big, sometimes larger than memory. How can I tell fit! to use the data directly from the disk without first loading everything on memory?

Imagine I create this data and I save it on a file called mydata.csv:

N=30000
x1 = repeat(1:N, outer=N)
x2 = repeat(1:N, inner=N)
x3 = sqrt.(repeat(1:N^2))
x1x2 = x1 .* x2
gg = repeat(1:5,inner=div(N^2,5))
y = 1 .- 2 x1 + 3 x2 + 0.5*x1x2 + rand(N^2) + x3.*rand(N^2)
data = DataFrame(x1=x1, x2=x2, x3=x3, x1x2=x1x2, y=y,gg=gg)
categorical!(data, :gg)

And I want to fit it as if was using the command

fit(LinearModel, @formula(y ~ x1+x2+x3+x1x2+gg), data)

but I want to do it loading the data directly from mydata.csv,
how would you do it? It doesn’t fit on memory. I don’t know how to create a stream or however you call it.

joshday · November 21, 2018, 8:35pm

LinReg is exact regression.
Everything StatLearn does is approximate.
LinRegBuilder is a more general version of LinReg.
StatLearn has many algorithm options: SGD, ADAGRAD, ADAM, ADAMAX, RMSPROP, MSPI, …
StatLearn will be faster, LinReg will be more correct. For linear regression I would use LinReg. For logistic regression your only option (in OnlineStats) is StatLearn(p, LogitMarginLoss()).
You can repeatedly fit! an OnlineStats object on new batches of data, but OnlineStats is agnostic on how you get your data into Julia. It doesn’t have helpers for streaming a CSV file.
If you want to iterate through the rows of a CSV file one by one without loading it into memory, see CSV.File.

Juan · November 21, 2018, 8:41pm

but this would only fit the regression to each batch of data, I want to fit it using the whole dataset.

And what of those algorithm (SGD, ADAGRAD, ADAM, ADAMAX, RMSPROP, MSPI…) would you suggest to quickly fit a linear regression model on a large dataset?

Is there any tutorial with more examples on how to use OnlineStats for these kind of tasks and where I can read more details of all options?

joshday · November 21, 2018, 8:44pm

I think you’re misunderstanding how OnlineStats works. Suppose you have a giant dataset (x, y) that is split into batches (x1,y1), (x2,y2), (x3,y3).

The following two things will give you the same exact answer:

fit!(LinReg(), (x,y))

o = LinReg()
fit!(o, (x1, y1))
fit!(o, (x2, y2))
fit!(o, (x3, y3))

You’ll need to read about those algorithms yourself to determine what’s best for you. They’re all comparable in terms of speed.

Juan · November 21, 2018, 8:48pm

OK, I’ll see how to do it.
I thought I could do it easily from OnlineStats or JuliaDB.

StefanKarpinski · November 21, 2018, 9:29pm

What are you missing? CSV.File lets you load a CSV file incrementally and OnlineStats lets you fit the linear regression incrementally. Is the thing that’s missing the outer loop that calls CSV.File and passes blocks of it to OnlineStats?

piever · November 21, 2018, 10:00pm

Actually none of that is necessary (at least as far as I can tell, I’ve never used this feature in real applications but from quick REPL experimentation it seems everything is in place). One can use loadtable to load many CSVs together, say:

t = loadtable(myfiles)

and then one can “reduce” a table by an online-stat, so if we want LinReg() we would do:

reduce(LinReg(), t, select = (:x, :y))

with the extra advantage that, as our table is loaded in a distributed way, the fitting can happen in parallel.

Juan · November 21, 2018, 11:02pm

Is it the CSV.jl way?
I guess if I have just one file I don’t need to use the reduce.

How is it done with JuliaDB instead?

Topic		Replies	Views
GLM is slow on large datasets. Using OnlineStats for regressions? MixedModels? Performance glm	25	5092	November 26, 2018
Question: JuliaDB and regression models General Usage regression , juliadb	16	1488	July 16, 2019
Online/out-of-core machine learning (ML) algorithms needs to compete with H20 & Spark Data	13	2309	March 1, 2018
Efficient way of doing linear regression Performance regression	44	20590	February 7, 2022
JuliaDB & OnlineStats syntax for linear regression General Usage package	5	759	June 15, 2018

OnlineStats: StatLearn vs LinReg vs other options for linear regression on large datasets

Related topics