CSV read in is too slow than other language

Hello,

The Julia language claims it is a fast language, but in my practice I’m hard to convincing myself of this, and now I meet terrible problem with the speed of julia again, hope someone here can give me some advice.

Let’s say I have a file, mydata.tsv, with a file size of 2GB, the first 2 columns are string, the other columns are Float.

When I read it in R, I use the data.table::fread function and it only takes 3 seconds to read data into a dataframe.

Sys.time();df <- fread("mydata.tsv",sep="\t"); Sys.time()

But it takes 180 seconds to do the same job with CSV.jl and DataFrames.jl in julia.

@time df = CSV.read("mydata.tsv", delim = "\t", DataFrame)

So, is there any problem of my julia code? No offense, but as a user I was really disappointed.

What version of Julia are you running? Have you read through the Performance Tips ? In particular one that might apply here is that the first time you call CSV.read it has to compile. How long does it take to read the CSV if you call it again?

Also, if you expect to have to read/write this file a lot, I might suggest a more performance-amenable format like Arrow.

The Julia language claims it is a fast language, but in my practice I’m hard to convincing myself of this, and now I meet terrible problem with the speed of julia again

As a response to your more abstract concern, I find Julia is indeed incredibly fast for purposes of in-memory computations and long-running hot loops and such, but there are still occasional pieces of the ecosystem that may remain slow, particularly in my experience in “plumbing” like IO and networking. It is possible you ran into such a piece (your problem is possibly related to this known issue). This is not for lack of potential of the language, only for lack of developer resources to design and create performant implementations

5 Likes

You should share a generated random sample of the file if you want others to solve it. There’ s lack of developers in such area partially because as a nice glue language, you can always use PythonCall.jl to do the IO part.

CSV.jl is super fast but can take a little bit to precompile code, so the first read can be slow.

Hi, @tiZ ,

As adienes mentioned, it is always a good idea to check out the Performance Tips, as every language has its idiosyncrasies. However, you are completely right that 180 seconds is incredibly unreasonable, and something more must be happening here.

For starters, I tried to reproduce your issue, but I actually see Julia be much faster than what you observed. Here is my quick test script. Could you please verify that the script I am running also runs fast for you? That would be a quick proof that there is nothing truly broken with the Julia install.

julia> N = 40000000;
       df = DataFrame(
           c1=string.(rand(Int,N),base=20),
           c2=string.(rand(Int,N),base=20),
           c3=rand(N),
           c4=rand(N));
       CSV.write("testfile.tsv",df,delim="\t")

 shell> ls -lh testfile.tsv
-rw-r--r-- 1 stefan stefan 2.7G Jun 20 10:17 testfile.tsv

julia> @time CSV.read("testfile.tsv",DataFrame,delim="\t");
  2.698588 seconds (3.77 M allocations: 3.566 GiB, 229.66% compilation time: 5% of which was recompilation)

julia> @time CSV.read("testfile.tsv",DataFrame,delim="\t");
  1.163380 seconds (3.84 k allocations: 3.315 GiB)

As you see, I seem to be reading a 2.7GB file in 1.16s.

After this first step, I would be happy to spend some more time on trying to debug your particular problem, but that would require some more information on your end. A few things to consider:

  • You are talking about relatively big files. Could you confirm that it is not an issue related to your OS caching the file in memory? E.g. what happens if you run the CSV read command twice in a row – that way the OS filesystem cache should be fresh and you will truly be testing only the performance of the CSV library.
  • Julia has a rather special compilation model that we should discuss. Just to double check whether that “precompilation” issue affects you (it really should not), could you always run your @time twice?
  • Could you share your exact version? You can use versioninfo() and Pkg.status() for that.
  • Could you describe a bit more the data you are reading? Could you make a fake data generator like the one I have created in my example, so that we can test together?
15 Likes

Although especially on 1.9 that shouldn’t matter so much.

Here’s a 2-string, 3-float column csv with 30 millino rows (~2GB):

julia> using CSV, DataFrames, Random

julia> n = 30_000_000;

julia> CSV.write("out.csv", (x1 = [randstring(5) for _ ∈ 1:n], x2 = [randstring(5) for _ ∈ 1:n], x3 = rand(n), x4 = rand(n), x5 = rand(n)));

julia> filesize("out.csv")/1e6
2094.284949

julia> @time CSV.read("out.csv", DataFrame);
 15.820271 seconds (37.30 k allocations: 1.462 GiB, 2.80% gc time, 0.56% compilation time)

julia> @time CSV.read("out.csv", DataFrame);
 12.807349 seconds (1.21 k allocations: 1.460 GiB, 0.06% gc time)

That’s on a single thread. Then with julia -t auto:

julia> Threads.nthreads()
8

julia> using CSV, DataFrames

julia> @time CSV.read("out.csv", DataFrame);
  4.906700 seconds (1.64 M allocations: 1.456 GiB, 0.58% gc time, 122.26% compilation time)

julia> @time CSV.read("out.csv", DataFrame);
  3.687869 seconds (2.98 k allocations: 1.349 GiB, 0.27% gc time)

So 3 seconds is about the right order of magnitude on a reasonable laptop, provided you use multiple threads.

Edited to add - for some reason when I read the file with fread in R I see it take about 100 seconds :person_shrugging: (probably a threading issue as well, I don’t know much about fread)

5 Likes

Hi, @Krastanov Thanks for the suggestion, I tried to do a more in depth comparison but the results confused me even more. it seems the speed is depend on the size of data.

I compared the time to read two data in R VS julia, one is testfile.tsv following your guide, with size of 4e8 x 4 and the other one’s size is 14699 x 37070,

data demo

using Random,CSV,DataFrames
Random.seed!(42)
N = 40000000;
df = DataFrame(
    c1=string.(rand(Int,N),base=20),
    c2=string.(rand(Int,N),base=20),
    c3=rand(N),
    c4=rand(N));
CSV.write("testfile.tsv",df,delim="\t") # 2.7G

df = rand(Float16,14699,37070)
CSV.write("testfile2.tsv", Tables.table(df),delim="\t") # 3.5G

then, in the R we have:

library("data.table")
setDThreads(12)

file1 = "testfile.tsv"
file2 = "testfile2.tsv"

readfile <- function(file) {
  a <- Sys.time() 
  df <- fread(file,sep="\t")
  b <- Sys.time()
  print(difftime(b,a))
}

for (i in 1:4) {
  readfile(file1)
  readfile(file2)
}

and now for the julia, we have:

# julia -t 12 
file1 = "testfile.tsv";
file2 = "testfile2.tsv";

# ls -sh $file1 $file2

function readfile(file)
    df = CSV.read(file,delim="\t",DataFrame);
end

for i in 1:4
    @time readfile(file1);
    @time readfile(file2);
end

now we have the compare result:


here, “R1” means run R fread() in single threads, “R12” means it run in 12 threads, the same as julia.

for the file1, julia is 4X than R fread(), but for the file2, I have no idea why it is such big different😂 .

2 Likes

Great, it seems we do not have to worry about the first test file (the tall one with only a few columns, where Julia seems to be plenty fast), but let’s try to figure out whether we can make the wide one fast as well.

First, here are my results running on your testfile2.tsv (after running the read a second time, to ensure there is no slowdown due to OS file caches, etc):

# 12 threads
julia> @time a = CSV.read("testfile2.tsv", delim="\t", DataFrame);
  4.790267 seconds (15.84 M allocations: 4.713 GiB)

# 1 thread
julia> @time a = CSV.read("testfile2.tsv", delim="\t", DataFrame);
 17.789729 seconds (6.65 M allocations: 4.646 GiB, 0.06% gc time)

Using R was 1.2s and 5.6s respectively (so about a factor of 4 faster), but that still does not explain the much more significant differences in performance you observed.

In both cases this is very far from the 200sec you observe. It seems you have already taken care of ensuring that your tests are not affected by precompilation latency or OS filesystem caches. Thus let’s try to focus on any version differences between our systems. I doubt it is the source of the slowdown, but after we rule it out I can ping some of the DataFrames developers who take reports of slow performance very seriously and will be interested in debugging this.

Anyway, my system is as follows:

julia> versioninfo()
Julia Version 1.10.0-DEV.1524
Commit 427b1236f13 (2023-06-19 23:16 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × AMD Ryzen 9 7950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
  Threads: 1 on 32 virtual cores

julia> Pkg.status()
Status `/tmp/Project.toml`
  [336ed68f] CSV v0.10.11
  [a93c6f00] DataFrames v1.5.0
  [bd369af6] Tables v1.10.1

Could you share the same information about your system? If you are on different versions, I can make sure to test on the same versions as you. If it turns out you have very outdated libraries, that also might be an issue.

On the side, people really should stop using CSV (plaintext) for data set this large, use Arrow feather!

You can use Arrow.jl in Julia, pyarrow in python, and I’m sure there’s R integration

1 Like

Yes, Arrow is wonderful. But I must read the plain txt before I could store it into hdf5 or Arrow. :joy:

Thank @Krastanov and all friends here, with your kindly help, I now find the reason of the poor performace on my case.

In short, I only run the CSV in just 1 threads, while the R::fread() is multi threads here.

Detail as below:
in fact, I miss-understand the threads in Julia, when start a julia session with

julia -t 12

I could check the num of threads in julia REPL, every thing looks well.

> Threads.nthreads()
12

but this message would be misleading, in my case I run the julia in the HPC with SLURM, when I submit the jobs with default params, it only allocated one cpu core for my jobs, so nthreads() do not reflect the real thread num of the julia session.

in this case, if I run

# julia -t auto
> Threads.nthreads()

it will tell me the nthreads is only 1.

Finally, I rerun the comparison with “true multi-thread” in julia.

here is the results (R5: R with 5 thread, j5: julia with 5 thread, and so on)

Thank you again. Hope all you have a nice day.

18 Likes

Some additional information on that topic: For situations where you want to double check how many CPU cores (including hyperthreading) are available to you, you can use

julia> Sys.CPU_THREADS
32

help?> Sys.CPU_THREADS
  Sys.CPU_THREADS::Int

  The number of logical CPU cores available in the system, i.e. the number of
  threads that the CPU can run concurrently. Note that this is not necessarily
  the number of CPU cores, for example, in the presence of hyper-threading
  (https://en.wikipedia.org/wiki/Hyper-threading).

  See Hwloc.jl or CpuId.jl for extended information, including number of
  physical cores.

That is different from the number of threads that Julia will attempt to run on the available cores (which can be bigger or smaller) and which is set with -t and can be checked with Threads.nthreads()

Edit: I had a typo in the thread flag above. Thanks for the correction!

3 Likes

And there is also the quite awesome ThreadPinning package that can be fun to use when you want to have low-level control over how threads are allocated to your cores GitHub - carstenbauer/ThreadPinning.jl: Readily pin Julia threads to CPU processors

I believe you mean -t.

1 Like