Updated CSV reader benchmarks?

Philogicatician · March 21, 2022, 10:49am

An update for Julia 1.7.2 on Jupyter notebooks:

Philogicatician · March 21, 2022, 10:56am

This wasn’t an exact duplicate since I used CSV.read("file_name.csv", SinkType) for convenience (and I vaguely recall someone mentioning it does slightly better on memory since it avoids some shenanigans with column headers which appears to be true from this benchmark):

I decided to try only using CSV.read and here’s what I got:

Henrique_Becker · March 21, 2022, 2:49pm

I think you should have started a new thread. Also, why the allocations for pydf = pd.read_csv("test_data.csv") are so low. I believe the size of the CSV is probably larger than that, so in the end you seem to be getting two distinct things, i.e., you are not getting a Julia object of the same format with both approaches.

Philogicatician · March 23, 2022, 10:10am

Forgive me a bit, but what do you mean by “started a new thread”? I’m still relatively new to the Discourse platform, so I’m not sure what you mean.

Also, why the allocations for pydf = pd.read_csv("test_data.csv") are so low. I believe the size of the CSV is probably larger than that, so in the end you seem to be getting two distinct things, i.e., you are not getting a Julia object of the same format with both approaches.

Ah, thanks for catching that. I think the notebook was truncating an error or something, because I got something different doing

@btime pydf = pd.read_csv("test_data.csv")

through the terminal in VS Code. It gave this:

C:\Users\kevin\.julia\conda\3\lib\site-packages\pandas\util\_decorators.py:311: DtypeWarning: Columns (3) have mixed types. Specify dtype option on import 
or set low_memory=False.
  return func(*args, **kwargs)

Which repeated a bunch of times in the terminal until it spat out something similar to the benchmark time I posted before:

704.192 ms (5 allocations: 272 bytes)

The weird part is that the terminal displays part of the data correctly (but it maybe it’s defaulting the the last valid input?).

In any case, I adjusted the pd.read_csv("test_data.csv") to

julia> @btime pydf = pd.read_csv("test_data.csv", low_memory=false)
  944.290 ms (19 allocations: 1.09 KiB)

The typeof(pd.read_csv("test_data.csv", low_memory=false)) returns PyObject which makes sense since Julia is reaching into Python with the PyCall.jl package. It is still a bit of a mystery why the allocations and memory are so low… Maybe that’s an effect of using the @btime macro on code that isn’t Julia? It seems plausible that @btime isn’t made to track memory allocations of other languages that Julia invokes, but that’s just a guess.

Henrique_Becker · March 23, 2022, 1:32pm

It means “created a new topic/post”, instead of posting in this thread. The old discussion is one year old, and was already finished. If you found something new related to it, then it would be better to create a new thread and link to this one instead. When you replied to this topic, all the people from the previous conversation are notified, and this is considered a bit rude.
It is possible that the admins will do this for you (i.e., move all the new posts to a new thread). Posting in an old thread is more usual if it is someone reporting, for example, that the bug found was finally fixed in release x.y.z of package A, which may be of interest of the people in the previous discussion.

The weird part is that the terminal displays part of the data correctly (but it maybe it’s defaulting the the last valid input?).

But it has all the data?

The typeof(pd.read_csv(“test_data.csv”, low_memory=false)) returns PyObject which makes sense since Julia is reaching into Python with the PyCall.jl package. It is still a bit of a mystery why the allocations and memory are so low…

It is possible that the memory is allocated elsewhere (i.e., in a Python environment that Julia has to communicate to get the info) or that the Python solution is lazy, this is, it does not read the file, but instead just create a pointer to it and when asked for information it gets it from the file directly. How large is the CSV file? Did you re-run the CSV.File line since you changed the CSV file?

Topic		Replies	Views
Very slow readdlm() General Usage	14	1952	October 2, 2018
CSV.read extremely slow wrt readtable Data	14	3659	July 27, 2018
CSV read performance vs Pandas General Usage	29	8248	May 6, 2019
CSV Reading (rewrite in C?) Internals & Design	50	5165	October 1, 2018
Is python pandas faster than julia CSV? General Usage csv	3	981	June 28, 2020

Updated CSV reader benchmarks?

Related topics