Best practice for reading large matrix with repeated spaces as text

Hi, I have a moderately large text file with 1305639×114 entires in a matrix, separated by repeated whitespaces (not always fixed width) per line. About 1.5 GB or so with the spaces.

On julia 1.8.2 The file reads fine with vanilla readdlm from DelimitedFiles.jl but takes 100 s.

@time A = readdlm("myfile.dat", Float64)
106.114928 seconds (446.89 M allocations: 14.807 GiB, 1.00% gc time, 0.67% compilation time)
1305639×114 Matrix{Float64}:
.

But because this is the first step on a cluster with many CPUs already running idle, this wastes a lot of CPU time. I have found the following using DataFrames.jl and CSV.jl a whole lot faster

A = Array{Float64, 2}(CSV.File("myfile.dat"; ignorerepeated=true, types=Float64, header=false, delim=' ')|>DataFrame)
42.190986 seconds (470.89 k allocations: 3.345 GiB, 35.74% compilation time)
1305639×114 Matrix{Float64}
.

But I don’t want to add CSV and DataFrames as package dependencies as this means more dependency alertness on my part. Or am I just being finicky / is there a better way than shown above? Thanks.

One thing worth mentioning is that if you can avoid working with test files for matrices you can expect pretty major speedups. Binary formats such as Arrow are often ~100x faster to read and write.

2 Likes

I absolutely agree, but unfortunately I need to write something that reads a horrible legacy format of instrument data for a whole lot of people. I guess one thing to do could be read in text and write out as a binary before I start the cluster job … but would like to avoid duplicating input data if possible.

if you are doing to read it more than once, I’d recommend reading as text and writing binary first. otherwise I’d just use the csv solution.

To add to this. You just need CSV.jl. You do not need to add DataFrames.jl as a dependency.

Here’s how you might do it without DataFrames:

julia> using CSV, DelimitedFiles

julia> writedlm("myfile.dat", rand(1_300_000, 114));

julia> @time Tables.matrix((CSV.File("myfile.dat"; ignorerepeated = true, types = Float64, header = false, delim = '\t')));
 15.626529 seconds (48.21 k allocations: 2.209 GiB, 0.59% gc time)

this isn’t the first call so I’ve taken out the compilation overhead (although I’m on Julia 1.9 so TTFX shouldn’t be a big issue anyway). ((Also the above writes Float64 numbers so the file is 2.6 rather than 1.5 GB, but that won’t make a huge difference to the timings))

1 Like

I see, but could you please clarify? I can’t do

julia> using CSV

julia> data = 
       """
       1.0 0.0 1.0
       0.0 1.0 0.0
       0.0 0.0 1.0
       """
"1.0 0.0 1.0\n0.0 1.0 0.0\n0.0 0.0 1.0\n"

julia> CSV.File(IOBuffer(data), header=false)|>DataFrame|>Matrix
ERROR: UndefVarError: DataFrame not defined
Stacktrace:
 [1] top-level scope
   @ REPL[3]:1

as this does not work if I don’t do a using DataFrames first.

This is indeed fast. However, I seem to need to do using Tables to get this to work on Julia 1.8.2. Won’t that add Tables as a dependency then?

Ah sorry, I think I had already done using DataFrames in the session before I tried this. Tables is already a dependency of CSV but not re-exported so you’d have to do CSV.Tables.matrix.