Hi, I have a moderately large text file with 1305639×114 entires in a matrix, separated by repeated whitespaces (not always fixed width) per line. About 1.5 GB or so with the spaces.
On julia 1.8.2 The file reads fine with vanilla readdlm from DelimitedFiles.jl but takes 100 s.
@time A = readdlm("myfile.dat", Float64)
106.114928 seconds (446.89 M allocations: 14.807 GiB, 1.00% gc time, 0.67% compilation time)
1305639×114 Matrix{Float64}:
.
But because this is the first step on a cluster with many CPUs already running idle, this wastes a lot of CPU time. I have found the following using DataFrames.jl and CSV.jl a whole lot faster
But I don’t want to add CSV and DataFrames as package dependencies as this means more dependency alertness on my part. Or am I just being finicky / is there a better way than shown above? Thanks.
One thing worth mentioning is that if you can avoid working with test files for matrices you can expect pretty major speedups. Binary formats such as Arrow are often ~100x faster to read and write.
I absolutely agree, but unfortunately I need to write something that reads a horrible legacy format of instrument data for a whole lot of people. I guess one thing to do could be read in text and write out as a binary before I start the cluster job … but would like to avoid duplicating input data if possible.
this isn’t the first call so I’ve taken out the compilation overhead (although I’m on Julia 1.9 so TTFX shouldn’t be a big issue anyway). ((Also the above writes Float64 numbers so the file is 2.6 rather than 1.5 GB, but that won’t make a huge difference to the timings))
Ah sorry, I think I had already done using DataFrames in the session before I tried this. Tables is already a dependency of CSV but not re-exported so you’d have to do CSV.Tables.matrix.