Why DataFrames v.0.21.2 (julia v1.4.2) requires more memory than the previous version

I’ve updated to DataFrame v.0.21.2 and Julia ver 1.4.2.

Try to load CSV file (9GB), as result 32 GB Ram is not enough.

Is it reasonable? Why DataFrames consume so much memory?

Which CSV reader Package are you using? I ask because @davidanthoff specifically mentioned in some of his YouTube tutorials that the Queryverse CSVFiles doesn’t pre-read the whole file. so

using CSVFiles
load("my.csv") |> DataFrame() 

should use a reasonable quantity of buffering. I don’t know if that’s the case for CSV.

I use CSV reader.

What do you suggest to use?

whoops, I edited my comment while you were posting yours… see above.

Yes, I was already trying it. I can load data. Thank you
I wonder, why it use twiсe more memory? 18gb ram for 9gb file

My guess is the CSV parser reads the whole file into RAM and then parses it… which results in a lot more memory usage, basically linear in the size of the file. The CSVFiles version reads just a bunch of lines at a time, so it only uses a more or less constant size buffer regardless of the file size. This is just a guess though.

yes, you are right.

In any case, right now 18gb ram is using only for one 9gb file. It’s too many I think.

You’re talking about after parsing and just having it in the DataFrames? What happens when you run GC.gc()?

Also, what are the column types that load chose? If they’re Any then that could be an issue.

yes, I am taking about DataFrame, after parsing

GC.gc() doesn’t help.

p.s. during parsing ram is using even more, =))

what are the data types? If it’s a file full of numbers but it decides its columns of Any, then you would expect maybe 64 bits for pointers and 64 bits for floats = twice the ram necessary if it were able to figure out the types for example.

much of columns is int64.

Oh, I understand. csv has a char, int64 is twice large.!!!

I’m 98% sure this is not the case. @quinnj has put in a lot of work to make CSV.jl as fast as possible, and I’m quite certain that stuff is getting parsed on the fly. Whether there’s another explanation for the memory use, I don’t know.

There’s something strange here. Generating a medium-sized dataframe,

julia> a = rand(10^7);

julia> Base.summarysize(a)
80000040

julia> df = DataFrame(a = a);

julia> Base.summarysize(df)
80000624

julia> CSV.write("test.csv", df)
"test.csv"

…then reading it back,

julia> df2 = CSV.read("test.csv")
10000000×1 DataFrame
│ Row      │ a          │
│          │ Float64    │
├──────────┼────────────┤
│ 1        │ 0.150381   │
│ 2        │ 0.159632   │
│ 3        │ 0.869726   │
 ...

julia> Base.summarysize(df2.a)
281297536

Why does the dataframe’s memory footprint inflate by 3.5x between being written and read from disk?

1 Like

CSVFiles helps me to load data. Unfortunately, CSV can’t help me.

This is an interesting observation and may merit a topic on its own (unless the OP’s problem is similar and willing to use this as a MWE in his original post).

This problem doesn’t occur when you use CSV.File("test.csv") |> DataFrame.

So i guess this is technically fixed.

1 Like

You might wanna try to read the file in chunks using DataConvenience.jl

# read all column as String
for chunk in CsvChunkIterator(filepath)
  # df is a DataFrame where each column is String
  # do something to df
end

Could you give us more information for how CSV fails? Have you tried using CSV.File rather than CSV.read?

It’s a table with 2000 columns 0 and 1.
I didn’t try CSV.Files.
I always used CSV.read.

Right now I use CSVFiles.

Sorry for repeating the question.

But can you try this

instead of CSV.read("test.csv")?

If I recall correctly, @quinnj mentioned in a post that CSV.read is going to be deprecated soon and should not be used anymore.