Why DataFrames v.0.21.2 (julia v1.4.2) requires more memory than the previous version

BMval · June 10, 2020, 11:55pm

I’ve updated to DataFrame v.0.21.2 and Julia ver 1.4.2.

Try to load CSV file (9GB), as result 32 GB Ram is not enough.

Is it reasonable? Why DataFrames consume so much memory?

dlakelan · June 11, 2020, 12:18am

Which CSV reader Package are you using? I ask because @davidanthoff specifically mentioned in some of his YouTube tutorials that the Queryverse CSVFiles doesn’t pre-read the whole file. so

using CSVFiles
load("my.csv") |> DataFrame()

should use a reasonable quantity of buffering. I don’t know if that’s the case for CSV.

BMval · June 11, 2020, 12:19am

I use CSV reader.

What do you suggest to use?

dlakelan · June 11, 2020, 12:20am

whoops, I edited my comment while you were posting yours… see above.

BMval · June 11, 2020, 12:24am

Yes, I was already trying it. I can load data. Thank you
I wonder, why it use twiсe more memory? 18gb ram for 9gb file

dlakelan · June 11, 2020, 12:26am

My guess is the CSV parser reads the whole file into RAM and then parses it… which results in a lot more memory usage, basically linear in the size of the file. The CSVFiles version reads just a bunch of lines at a time, so it only uses a more or less constant size buffer regardless of the file size. This is just a guess though.

BMval · June 11, 2020, 12:31am

yes, you are right.

In any case, right now 18gb ram is using only for one 9gb file. It’s too many I think.

dlakelan · June 11, 2020, 12:32am

You’re talking about after parsing and just having it in the DataFrames? What happens when you run GC.gc()?

Also, what are the column types that load chose? If they’re Any then that could be an issue.

BMval · June 11, 2020, 12:51am

yes, I am taking about DataFrame, after parsing

GC.gc() doesn’t help.

p.s. during parsing ram is using even more, =))

dlakelan · June 11, 2020, 1:00am

what are the data types? If it’s a file full of numbers but it decides its columns of Any, then you would expect maybe 64 bits for pointers and 64 bits for floats = twice the ram necessary if it were able to figure out the types for example.

BMval · June 11, 2020, 1:02am

much of columns is int64.

Oh, I understand. csv has a char, int64 is twice large.!!!

kevbonham · June 11, 2020, 1:18am

I’m 98% sure this is not the case. @quinnj has put in a lot of work to make CSV.jl as fast as possible, and I’m quite certain that stuff is getting parsed on the fly. Whether there’s another explanation for the memory use, I don’t know.

stillyslalom · June 11, 2020, 1:32am

There’s something strange here. Generating a medium-sized dataframe,

julia> a = rand(10^7);

julia> Base.summarysize(a)
80000040

julia> df = DataFrame(a = a);

julia> Base.summarysize(df)
80000624

julia> CSV.write("test.csv", df)
"test.csv"

…then reading it back,

julia> df2 = CSV.read("test.csv")
10000000×1 DataFrame
│ Row      │ a          │
│          │ Float64    │
├──────────┼────────────┤
│ 1        │ 0.150381   │
│ 2        │ 0.159632   │
│ 3        │ 0.869726   │
 ...

julia> Base.summarysize(df2.a)
281297536

Why does the dataframe’s memory footprint inflate by 3.5x between being written and read from disk?

BMval · June 11, 2020, 2:03am

CSVFiles helps me to load data. Unfortunately, CSV can’t help me.

affans · June 11, 2020, 2:29am

This is an interesting observation and may merit a topic on its own (unless the OP’s problem is similar and willing to use this as a MWE in his original post).

pdeffebach · June 11, 2020, 2:29am

This problem doesn’t occur when you use CSV.File("test.csv") |> DataFrame.

So i guess this is technically fixed.

xiaodai · June 11, 2020, 5:17am

You might wanna try to read the file in chunks using DataConvenience.jl

# read all column as String
for chunk in CsvChunkIterator(filepath)
  # df is a DataFrame where each column is String
  # do something to df
end

pdeffebach · June 11, 2020, 3:21pm

Could you give us more information for how CSV fails? Have you tried using CSV.File rather than CSV.read?

BMval · June 15, 2020, 7:55pm

It’s a table with 2000 columns 0 and 1.
I didn’t try CSV.Files.
I always used CSV.read.

Right now I use CSVFiles.

uwechsler · June 15, 2020, 9:29pm

Sorry for repeating the question.

But can you try this

instead of CSV.read("test.csv")?

If I recall correctly, @quinnj mentioned in a post that CSV.read is going to be deprecated soon and should not be used anymore.

Topic		Replies	Views
CSV.write("*.txt",DataFrame) ReadOnlyMemoryError() General Usage dataframes	14	1012	January 9, 2020
CSV.read extremely slow wrt readtable Data	14	3636	July 27, 2018
CSV : problem to write big dataframes Data csv	20	2811	May 29, 2023
Failing to import (relatively) large CSV file with Julia and VSC Data performance , csv , arrow	24	748	September 22, 2024
Reading Data Is Still Too Slow Data	35	8815	August 2, 2019

Why DataFrames v.0.21.2 (julia v1.4.2) requires more memory than the previous version

Related topics