OnlineStats: StatLearn vs LinReg vs other options for linear regression on large datasets

package

#1

Hello.

What’s the difference between using StatLearn and LinReg for linear regression.

fit!(StatLearn(5, MSPI()), (x, y))
fit!(LinReg(), (x,y))

both are supposed to be able to “Fit a model that is linear in the parameters.”

What method is internally used when you use LinRegBuilder?
Can I choose an algorithm different from SGD?

I’m looking for the fastest method able to fit regression models using large datasets that don’t fit on memory. What is my best option? and for logistic regression? Should I use Flux.jl instead?

OnlineStats docs have this example:

x = randn(100_000, 10)
y = x * linspace(-1, 1, 10) + randn(100_000)
o = StatLearn(10, .5 * L2DistLoss(), L1Penalty(), fill(.1, 10), SGD())
s = Series(o)
fit!(s, x, y)

But my real data is on a csv file and it’s very big, sometimes larger than memory. How can I tell fit! to use the data directly from the disk without first loading everything on memory?

Imagine I create this data and I save it on a file called mydata.csv:

N=30000
x1 = repeat(1:N, outer=N)
x2 = repeat(1:N, inner=N)
x3 = sqrt.(repeat(1:N^2))
x1x2 = x1 .* x2
gg = repeat(1:5,inner=div(N^2,5))
y = 1 .- 2 x1 + 3 x2 + 0.5*x1x2 + rand(N^2) + x3.*rand(N^2)
data = DataFrame(x1=x1, x2=x2, x3=x3, x1x2=x1x2, y=y,gg=gg)
categorical!(data, :gg)

And I want to fit it as if was using the command

fit(LinearModel, @formula(y ~ x1+x2+x3+x1x2+gg), data)

but I want to do it loading the data directly from mydata.csv,
how would you do it? It doesn’t fit on memory. I don’t know how to create a stream or however you call it.


#2
  • LinReg is exact regression.

  • Everything StatLearn does is approximate.

  • LinRegBuilder is a more general version of LinReg.

  • StatLearn has many algorithm options: SGD, ADAGRAD, ADAM, ADAMAX, RMSPROP, MSPI, …

  • StatLearn will be faster, LinReg will be more correct. For linear regression I would use LinReg. For logistic regression your only option (in OnlineStats) is StatLearn(p, LogitMarginLoss()).

  • You can repeatedly fit! an OnlineStats object on new batches of data, but OnlineStats is agnostic on how you get your data into Julia. It doesn’t have helpers for streaming a CSV file.

  • If you want to iterate through the rows of a CSV file one by one without loading it into memory, see CSV.File.


#3

but this would only fit the regression to each batch of data, I want to fit it using the whole dataset.

And what of those algorithm (SGD, ADAGRAD, ADAM, ADAMAX, RMSPROP, MSPI…) would you suggest to quickly fit a linear regression model on a large dataset?

Is there any tutorial with more examples on how to use OnlineStats for these kind of tasks and where I can read more details of all options?


#4

I think you’re misunderstanding how OnlineStats works. Suppose you have a giant dataset (x, y) that is split into batches (x1,y1), (x2,y2), (x3,y3).

The following two things will give you the same exact answer:

fit!(LinReg(), (x,y))
o = LinReg()
fit!(o, (x1, y1))
fit!(o, (x2, y2))
fit!(o, (x3, y3))

You’ll need to read about those algorithms yourself to determine what’s best for you. They’re all comparable in terms of speed.


#5

OK, I’ll see how to do it.
I thought I could do it easily from OnlineStats or JuliaDB.


#6

What are you missing? CSV.File lets you load a CSV file incrementally and OnlineStats lets you fit the linear regression incrementally. Is the thing that’s missing the outer loop that calls CSV.File and passes blocks of it to OnlineStats?


#7

Actually none of that is necessary :slight_smile: (at least as far as I can tell, I’ve never used this feature in real applications but from quick REPL experimentation it seems everything is in place). One can use loadtable to load many CSVs together, say:

t = loadtable(myfiles)

and then one can “reduce” a table by an online-stat, so if we want LinReg() we would do:

reduce(LinReg(), t, select = (:x, :y))

with the extra advantage that, as our table is loaded in a distributed way, the fitting can happen in parallel.


#8

Is it the CSV.jl way?
I guess if I have just one file I don’t need to use the reduce.

How is it done with JuliaDB instead?