Updated CSV reader benchmarks?

An update for Julia 1.7.2 on Jupyter notebooks:

This wasn’t an exact duplicate since I used CSV.read("file_name.csv", SinkType) for convenience (and I vaguely recall someone mentioning it does slightly better on memory since it avoids some shenanigans with column headers which appears to be true from this benchmark):

I decided to try only using CSV.read and here’s what I got:

I think you should have started a new thread. Also, why the allocations for pydf = pd.read_csv("test_data.csv") are so low. I believe the size of the CSV is probably larger than that, so in the end you seem to be getting two distinct things, i.e., you are not getting a Julia object of the same format with both approaches.

1 Like

Forgive me a bit, but what do you mean by “started a new thread”? :thinking: I’m still relatively new to the Discourse platform, so I’m not sure what you mean.

Also, why the allocations for pydf = pd.read_csv("test_data.csv") are so low. I believe the size of the CSV is probably larger than that, so in the end you seem to be getting two distinct things, i.e., you are not getting a Julia object of the same format with both approaches.

Ah, thanks for catching that. :+1: I think the notebook was truncating an error or something, because I got something different doing

@btime pydf = pd.read_csv("test_data.csv")

through the terminal in VS Code. It gave this:

C:\Users\kevin\.julia\conda\3\lib\site-packages\pandas\util\_decorators.py:311: DtypeWarning: Columns (3) have mixed types. Specify dtype option on import 
or set low_memory=False.
  return func(*args, **kwargs)

Which repeated a bunch of times in the terminal until it spat out something similar to the benchmark time I posted before:

704.192 ms (5 allocations: 272 bytes)

The weird part is that the terminal displays part of the data correctly (but it maybe it’s defaulting the the last valid input?).

In any case, I adjusted the pd.read_csv("test_data.csv") to

julia> @btime pydf = pd.read_csv("test_data.csv", low_memory=false)
  944.290 ms (19 allocations: 1.09 KiB)

The typeof(pd.read_csv("test_data.csv", low_memory=false)) returns PyObject which makes sense since Julia is reaching into Python with the PyCall.jl package.  It is still a bit of a mystery why the allocations and memory are so low… Maybe that’s an effect of using the @btime macro on code that isn’t Julia? It seems plausible that @btime isn’t made to track memory allocations of other languages that Julia invokes, but that’s just a guess. :man_shrugging:

It means “created a new topic/post”, instead of posting in this thread. The old discussion is one year old, and was already finished. If you found something new related to it, then it would be better to create a new thread and link to this one instead. When you replied to this topic, all the people from the previous conversation are notified, and this is considered a bit rude.
It is possible that the admins will do this for you (i.e., move all the new posts to a new thread). Posting in an old thread is more usual if it is someone reporting, for example, that the bug found was finally fixed in release x.y.z of package A, which may be of interest of the people in the previous discussion.

The weird part is that the terminal displays part of the data correctly (but it maybe it’s defaulting the the last valid input?).

But it has all the data?

The typeof(pd.read_csv(“test_data.csv”, low_memory=false)) returns PyObject which makes sense since Julia is reaching into Python with the PyCall.jl package. It is still a bit of a mystery why the allocations and memory are so low…

It is possible that the memory is allocated elsewhere (i.e., in a Python environment that Julia has to communicate to get the info) or that the Python solution is lazy, this is, it does not read the file, but instead just create a pointer to it and when asked for information it gets it from the file directly. How large is the CSV file? Did you re-run the CSV.File line since you changed the CSV file?

2 Likes