Julia run using terminal for 1GB dataset showing out of memory error

Devi_Sree · August 3, 2017, 11:15am

Can Julia actually support large dataset. My machine has RAM size of 15GB and most of the time when I go for parallel processing with large dataset around 1GB only, it shows out of memory error… If Julia supports, how much can it intake?

Is it because RAM size is insufficient or can someone say How can I specify the memory required for Julia. Where am I getting stuck. Thanx in advance.

tshort · August 3, 2017, 11:23am

The answer to your first question is yes. To help with your issue, we’ll need more info. What format is the data in? What else is your code doing? Ideally, reproducible code is best. You could have a script that generates dummy data for folks to try.

Devi_Sree · August 3, 2017, 11:48am

Dataset is a csv file which is the common diabetics data replicated.

 6,148,72,35,0,33.6,0.627,50,1
 1,85,66,29,0,26.6,0.351,31,0
 8,183,64,0,0,23.3,0.672,32,1

I am implementing Logistic regression. I create 4 processes accept the data, split the data for each process and assign. It is a simple program containing 30 lines with only two functions used for the algorithm. The code works at times but most of the other times "out of memory’ error occurs. What could be the reason??

By the way how much large data can Julia support ? and any configuration changes possible to allocate memory specifically for julia as in hadoop and spark.

mauro3 · August 3, 2017, 12:05pm

With this it is meant that you post your code, right here, or if larger as a gist.github.com. In the gist you can also post a part of your csv file. Including the jl-file should then just run the test case.

One option maybe to use memory mapped files (ut I’ve never used them, so I cannot be more specific).

StefanKarpinski · August 3, 2017, 1:51pm

Julia can load as much data as you have memory. Unlike the JVM (Hadoop, Spark) you do not need to pass any options to tell it how much memory it will need. It’s hard to know why you’re running out of memory without you posting some code showing what you’re trying to do and what error output you get.

Devi_Sree · August 4, 2017, 3:51am

Snippet of the code…Note:The code works and shows output but not everytime I run

     function Master()
               println(nprocs())
               table = readdlm("diabetics2.csv",',')
               data = convert(DataFrame, table)
               features = convert(Array, data[:,1:8])
               labels = convert(Array, data[:,9])
               procs = nprocs()
               i = 1
               n,m = size(features)
               #println(n,"--",features[i:Int(n/procs)])
               while procs != 0
                       k = Int(n/procs)
                       println(k)
                       data_procs = features[i:k,:]
                       labels_procs = labels[i:k,1]
                       i += k
                       n -= k
                       procs -= 1        
               end
               println("Assigning to processes the work..")
               for i in 1:nprocs()
                       @spawnat i trainLogisticRegression(data_i,label_i)
               end                
       end

Devi_Sree · August 4, 2017, 6:11am

Error

julia> Master()
4
ERROR: OutOfMemoryError()
 in convert(::Type{Array{Float64,2}}, ::DataFrames.DataFrame) at /home/devisree/.julia/v0.6/DataFrames/src/abstractdataframe/abstractdataframe.jl:519
 in Master() at ./REPL[132]:5

Next on executing…

julia> Master()
4
ERROR: OutOfMemoryError()
 in Base.DataFmt.DLMStore{T}(::Type{Float64}, ::Tuple{Int64,Int64}, ::Bool, ::String, ::Bool, ::Char) at ./datafmt.jl:217
 in readdlm_string(::String, ::Char, ::Type{T}, ::Char, ::Bool, ::Dict{Symbol,Union{Char,Integer,Tuple{Integer,Integer}}}) at ./datafmt.jl:339
 in readdlm_string(::String, ::Char, ::Type{T}, ::Char, ::Bool, ::Dict{Symbol,Union{Char,Integer,Tuple{Integer,Integer}}}) at ./datafmt.jl:361
 in #readdlm_auto#11(::Array{Any,1}, ::Function, ::String, ::Char, ::Type{T}, ::Char, ::Bool) at ./datafmt.jl:132
 in #readdlm#7(::Array{Any,1}, ::Function, ::String, ::Char, ::Char) at ./datafmt.jl:81
 in #readdlm#6(::Array{Any,1}, ::Function, ::String, ::Char) at ./datafmt.jl:73
 in Master() at ./REPL[132]:3

Further

julia> Master()
4
ERROR: OutOfMemoryError()
 in Base.DataFmt.DLMStore{T}(::Type{Float64}, ::Tuple{Int64,Int64}, ::Bool, ::String, ::Bool, ::Char) at ./datafmt.jl:217
 in readdlm_string(::String, ::Char, ::Type{T}, ::Char, ::Bool, ::Dict{Symbol,Union{Char,Integer,Tuple{Integer,Integer}}}) at ./datafmt.jl:339
 in readdlm_string(::String, ::Char, ::Type{T}, ::Char, ::Bool, ::Dict{Symbol,Union{Char,Integer,Tuple{Integer,Integer}}}) at ./datafmt.jl:361
 in #readdlm_auto#11(::Array{Any,1}, ::Function, ::String, ::Char, ::Type{T}, ::Char, ::Bool) at ./datafmt.jl:132
 in #readdlm#7(::Array{Any,1}, ::Function, ::String, ::Char, ::Char) at ./datafmt.jl:81
 in #readdlm#6(::Array{Any,1}, ::Function, ::String, ::Char) at ./datafmt.jl:73
 in Master() at ./REPL[132]:3

julia> Master()
4
ERROR: OutOfMemoryError()
 in convert(::Type{Array{Float64,2}}, ::DataFrames.DataFrame) at /home/devisree/.julia/v0.6/DataFrames/src/abstractdataframe/abstractdataframe.jl:519
 in Master() at ./REPL[132]:5

julia> Master()
4
*** Error in `julia': corrupted double-linked list: 0x00007f7c80a06ca0 ***

Devi_Sree · August 5, 2017, 4:57am

Yes I could see Julia consuming so much of space as upto my RAM size (using free - h -s 5). How to modify in the configurations of Julia to specify that it use only a certain limit of RAM size?? I see when running julia all the other processes running seem to get stuck because of Julia’s memory consumption. Please guide.

nalimilan · August 7, 2017, 10:31am

You cannot tell Julia to use less RAM: it always uses the amount of RAM it needs, not more, but not less either. Indeed if your data does not fit in RAM, limiting the amount of memory Julia is allowed to use would just make the error appear earlier. There’s no way to handle data if it’s too big for your RAM.

That said, in the present case, it looks like you could use less memory for the same task. I don’t understand why you convert table to a DataFrame before extracting some columns. This creates lots of copies of your data. table is already an array, so you shouldn’t need this step.

StefanKarpinski · August 7, 2017, 4:01pm

Distributing this also seems like overkill and unlikely to give any speedup – not because of Julia but just because 1GB of data is really not very large these days and the overhead of splitting the data up and communicating between processes is unlikely to be worth any speedup. If there’s some computational kernel where threading might help, you could try Base.Threads.@threads on a for loop.

StefanKarpinski · August 7, 2017, 4:02pm

If you really want the data as a DataFrame, just use DataFrames.readtable directly.

Devi_Sree · August 8, 2017, 5:15am

Thanxx very much!! for the detailed reply… Julia is said to perform very well for large datasets and for ML algorithms. So does julia language perform better than scala in spark ?

Juan · August 9, 2017, 3:18pm

It would be great to be able to work with larger than memory datasets, leting Julia to decie how to stream/chunk/reflow/map it between disk and memory

Jan_Dolinsky · August 10, 2017, 9:06am

Check out https://github.com/joshday/OnlineStats.jl . Online algorithms are well suited for streaming data or when data is too large to hold in memory.

As to the original question: I agree with Stefan’s post earlier, doing logistic regression in parallel is likely an overkill for 1GB of data. You consume a lot of unnecessary memory by first couple of lines of code in function Master().

Juan · August 10, 2017, 11:10pm

Hello

But can we use it for any kind of computation or just for a small subset of functions already implemented there?
For example can I use the package MixedModels with it or do something similar?

StefanKarpinski · August 11, 2017, 10:28pm

This is like asking “Does Scala do streaming data processing?”. The question doesn’t make sense for a programming language, it only makes sense for a computational framework. If you use Scala with an in-memory framework, then it will require you to load all data in memory. If you use Spark – which is a framework written in Scala – since it’s a framework designed for out-of-memory streamed computation, yes, you can do that. OnlineStats is a Julia package (read lightweight framework) for online computation which can do exactly what you’re asking about – e.g. logistic regression on extremely large data without loading it all into memory at once. You can also use Spark from Julia, which has access to everything built into Spark and allows you to write Spark jobs in Julia.

Juan · August 12, 2017, 12:04am

I expected that Julia is both a language and a framework with packages

StefanKarpinski · August 12, 2017, 1:10am

Julia is a language, there are frameworks and packages, such as OnlineStats and Spark.jl. Hopefully that helps.

Juan · August 31, 2017, 11:26am

How can I fit a Mixed-effects model with OnlineStats?

For example, imagine the data from this example
https://dmbates.github.io/MixedModels.jl/latest/man/fitting/#A-simple-example-1
were 10000 times larger and didn’t fit on memory.

What would you write instead of this:
fit!(lmm(@formula(Yield ~ 1 + (1 | Batch)), ds), true);

Topic		Replies	Views
Julia Execution get out of memory error General Usage	3	4689	August 5, 2017
Using JuliaDB to create larger than memory datasets and work with them? General Usage	3	1053	October 15, 2019
Tell Julia to use more memory than physical RAM, swapping General Usage	7	2734	December 12, 2019
Can Julia efficiently make use of 20+ cores for transforming hundreds of millions of rows for machine learning? Machine Learning question , big-data	27	2996	December 1, 2020
Any Julia's equivalent to R's packages mcgv or mixed-effects models larger than memory? Statistics	9	2957	November 19, 2018

Julia run using terminal for 1GB dataset showing out of memory error

Related topics