Julia run using terminal for 1GB dataset showing out of memory error

question

#1

Can Julia actually support large dataset. My machine has RAM size of 15GB and most of the time when I go for parallel processing with large dataset around 1GB only, it shows out of memory error… If Julia supports, how much can it intake?

Is it because RAM size is insufficient or can someone say How can I specify the memory required for Julia. Where am I getting stuck. Thanx in advance.


Julia Execution get out of memory error
Julia Execution get out of memory error
#2

The answer to your first question is yes. To help with your issue, we’ll need more info. What format is the data in? What else is your code doing? Ideally, reproducible code is best. You could have a script that generates dummy data for folks to try.


#3

Dataset is a csv file which is the common diabetics data replicated.

 6,148,72,35,0,33.6,0.627,50,1
 1,85,66,29,0,26.6,0.351,31,0
 8,183,64,0,0,23.3,0.672,32,1

I am implementing Logistic regression. I create 4 processes accept the data, split the data for each process and assign. It is a simple program containing 30 lines with only two functions used for the algorithm. The code works at times but most of the other times "out of memory’ error occurs. What could be the reason??

By the way how much large data can Julia support ? and any configuration changes possible to allocate memory specifically for julia as in hadoop and spark.


#4

With this it is meant that you post your code, right here, or if larger as a gist.github.com. In the gist you can also post a part of your csv file. Including the jl-file should then just run the test case.

One option maybe to use memory mapped files (ut I’ve never used them, so I cannot be more specific).


#5

Julia can load as much data as you have memory. Unlike the JVM (Hadoop, Spark) you do not need to pass any options to tell it how much memory it will need. It’s hard to know why you’re running out of memory without you posting some code showing what you’re trying to do and what error output you get.


Julia Execution get out of memory error
#6

Snippet of the code…Note:The code works and shows output but not everytime I run

     function Master()
               println(nprocs())
               table = readdlm("diabetics2.csv",',')
               data = convert(DataFrame, table)
               features = convert(Array, data[:,1:8])
               labels = convert(Array, data[:,9])
               procs = nprocs()
               i = 1
               n,m = size(features)
               #println(n,"--",features[i:Int(n/procs)])
               while procs != 0
                       k = Int(n/procs)
                       println(k)
                       data_procs = features[i:k,:]
                       labels_procs = labels[i:k,1]
                       i += k
                       n -= k
                       procs -= 1        
               end
               println("Assigning to processes the work..")
               for i in 1:nprocs()
                       @spawnat i trainLogisticRegression(data_i,label_i)
               end                
       end

#7

Error

julia> Master()
4
ERROR: OutOfMemoryError()
 in convert(::Type{Array{Float64,2}}, ::DataFrames.DataFrame) at /home/devisree/.julia/v0.6/DataFrames/src/abstractdataframe/abstractdataframe.jl:519
 in Master() at ./REPL[132]:5

Next on executing…

julia> Master()
4
ERROR: OutOfMemoryError()
 in Base.DataFmt.DLMStore{T}(::Type{Float64}, ::Tuple{Int64,Int64}, ::Bool, ::String, ::Bool, ::Char) at ./datafmt.jl:217
 in readdlm_string(::String, ::Char, ::Type{T}, ::Char, ::Bool, ::Dict{Symbol,Union{Char,Integer,Tuple{Integer,Integer}}}) at ./datafmt.jl:339
 in readdlm_string(::String, ::Char, ::Type{T}, ::Char, ::Bool, ::Dict{Symbol,Union{Char,Integer,Tuple{Integer,Integer}}}) at ./datafmt.jl:361
 in #readdlm_auto#11(::Array{Any,1}, ::Function, ::String, ::Char, ::Type{T}, ::Char, ::Bool) at ./datafmt.jl:132
 in #readdlm#7(::Array{Any,1}, ::Function, ::String, ::Char, ::Char) at ./datafmt.jl:81
 in #readdlm#6(::Array{Any,1}, ::Function, ::String, ::Char) at ./datafmt.jl:73
 in Master() at ./REPL[132]:3

Further

julia> Master()
4
ERROR: OutOfMemoryError()
 in Base.DataFmt.DLMStore{T}(::Type{Float64}, ::Tuple{Int64,Int64}, ::Bool, ::String, ::Bool, ::Char) at ./datafmt.jl:217
 in readdlm_string(::String, ::Char, ::Type{T}, ::Char, ::Bool, ::Dict{Symbol,Union{Char,Integer,Tuple{Integer,Integer}}}) at ./datafmt.jl:339
 in readdlm_string(::String, ::Char, ::Type{T}, ::Char, ::Bool, ::Dict{Symbol,Union{Char,Integer,Tuple{Integer,Integer}}}) at ./datafmt.jl:361
 in #readdlm_auto#11(::Array{Any,1}, ::Function, ::String, ::Char, ::Type{T}, ::Char, ::Bool) at ./datafmt.jl:132
 in #readdlm#7(::Array{Any,1}, ::Function, ::String, ::Char, ::Char) at ./datafmt.jl:81
 in #readdlm#6(::Array{Any,1}, ::Function, ::String, ::Char) at ./datafmt.jl:73
 in Master() at ./REPL[132]:3

julia> Master()
4
ERROR: OutOfMemoryError()
 in convert(::Type{Array{Float64,2}}, ::DataFrames.DataFrame) at /home/devisree/.julia/v0.6/DataFrames/src/abstractdataframe/abstractdataframe.jl:519
 in Master() at ./REPL[132]:5

julia> Master()
4
*** Error in `julia': corrupted double-linked list: 0x00007f7c80a06ca0 ***

#8

Yes I could see Julia consuming so much of space as upto my RAM size (using free - h -s 5). How to modify in the configurations of Julia to specify that it use only a certain limit of RAM size?? I see when running julia all the other processes running seem to get stuck because of Julia’s memory consumption. Please guide.


#9

You cannot tell Julia to use less RAM: it always uses the amount of RAM it needs, not more, but not less either. Indeed if your data does not fit in RAM, limiting the amount of memory Julia is allowed to use would just make the error appear earlier. There’s no way to handle data if it’s too big for your RAM.

That said, in the present case, it looks like you could use less memory for the same task. I don’t understand why you convert table to a DataFrame before extracting some columns. This creates lots of copies of your data. table is already an array, so you shouldn’t need this step.


#10

Distributing this also seems like overkill and unlikely to give any speedup – not because of Julia but just because 1GB of data is really not very large these days and the overhead of splitting the data up and communicating between processes is unlikely to be worth any speedup. If there’s some computational kernel where threading might help, you could try Base.Threads.@threads on a for loop.


#11

If you really want the data as a DataFrame, just use DataFrames.readtable directly.


#12

Thanxx very much!! for the detailed reply… Julia is said to perform very well for large datasets and for ML algorithms. So does julia language perform better than scala in spark ?


#13

It would be great to be able to work with larger than memory datasets, leting Julia to decie how to stream/chunk/reflow/map it between disk and memory


#14

Check out https://github.com/joshday/OnlineStats.jl . Online algorithms are well suited for streaming data or when data is too large to hold in memory.

As to the original question: I agree with Stefan’s post earlier, doing logistic regression in parallel is likely an overkill for 1GB of data. You consume a lot of unnecessary memory by first couple of lines of code in function Master().


#15

Hello

But can we use it for any kind of computation or just for a small subset of functions already implemented there?
For example can I use the package MixedModels with it or do something similar?


#16

This is like asking “Does Scala do streaming data processing?”. The question doesn’t make sense for a programming language, it only makes sense for a computational framework. If you use Scala with an in-memory framework, then it will require you to load all data in memory. If you use Spark – which is a framework written in Scala – since it’s a framework designed for out-of-memory streamed computation, yes, you can do that. OnlineStats is a Julia package (read lightweight framework) for online computation which can do exactly what you’re asking about – e.g. logistic regression on extremely large data without loading it all into memory at once. You can also use Spark from Julia, which has access to everything built into Spark and allows you to write Spark jobs in Julia.


#17

I expected that Julia is both a language and a framework with packages


#18

Julia is a language, there are frameworks and packages, such as OnlineStats and Spark.jl. Hopefully that helps.


#19

How can I fit a Mixed-effects model with OnlineStats?

For example, imagine the data from this example
https://dmbates.github.io/MixedModels.jl/latest/man/fitting/#A-simple-example-1
were 10000 times larger and didn’t fit on memory.

What would you write instead of this:
fit!(lmm(@formula(Yield ~ 1 + (1 | Batch)), ds), true);