Best practice for reading large matrix with repeated spaces as text

sparrowhawk · April 20, 2023, 2:43am

Hi, I have a moderately large text file with 1305639×114 entires in a matrix, separated by repeated whitespaces (not always fixed width) per line. About 1.5 GB or so with the spaces.

On julia 1.8.2 The file reads fine with vanilla readdlm from DelimitedFiles.jl but takes 100 s.

@time A = readdlm("myfile.dat", Float64)
106.114928 seconds (446.89 M allocations: 14.807 GiB, 1.00% gc time, 0.67% compilation time)
1305639×114 Matrix{Float64}:
.

But because this is the first step on a cluster with many CPUs already running idle, this wastes a lot of CPU time. I have found the following using DataFrames.jl and CSV.jl a whole lot faster

A = Array{Float64, 2}(CSV.File("myfile.dat"; ignorerepeated=true, types=Float64, header=false, delim=' ')|>DataFrame)
42.190986 seconds (470.89 k allocations: 3.345 GiB, 35.74% compilation time)
1305639×114 Matrix{Float64}
.

But I don’t want to add CSV and DataFrames as package dependencies as this means more dependency alertness on my part. Or am I just being finicky / is there a better way than shown above? Thanks.

Oscar_Smith · April 20, 2023, 2:56am

One thing worth mentioning is that if you can avoid working with test files for matrices you can expect pretty major speedups. Binary formats such as Arrow are often ~100x faster to read and write.

sparrowhawk · April 20, 2023, 3:09am

I absolutely agree, but unfortunately I need to write something that reads a horrible legacy format of instrument data for a whole lot of people. I guess one thing to do could be read in text and write out as a binary before I start the cluster job … but would like to avoid duplicating input data if possible.

Oscar_Smith · April 20, 2023, 3:33am

if you are doing to read it more than once, I’d recommend reading as text and writing binary first. otherwise I’d just use the csv solution.

bkamins · April 20, 2023, 6:58am

To add to this. You just need CSV.jl. You do not need to add DataFrames.jl as a dependency.

nilshg · April 20, 2023, 8:35am

Here’s how you might do it without DataFrames:

julia> using CSV, DelimitedFiles

julia> writedlm("myfile.dat", rand(1_300_000, 114));

julia> @time Tables.matrix((CSV.File("myfile.dat"; ignorerepeated = true, types = Float64, header = false, delim = '\t')));
 15.626529 seconds (48.21 k allocations: 2.209 GiB, 0.59% gc time)

this isn’t the first call so I’ve taken out the compilation overhead (although I’m on Julia 1.9 so TTFX shouldn’t be a big issue anyway). ((Also the above writes Float64 numbers so the file is 2.6 rather than 1.5 GB, but that won’t make a huge difference to the timings))

sparrowhawk · April 20, 2023, 8:49am

I see, but could you please clarify? I can’t do

julia> using CSV

julia> data = 
       """
       1.0 0.0 1.0
       0.0 1.0 0.0
       0.0 0.0 1.0
       """
"1.0 0.0 1.0\n0.0 1.0 0.0\n0.0 0.0 1.0\n"

julia> CSV.File(IOBuffer(data), header=false)|>DataFrame|>Matrix
ERROR: UndefVarError: DataFrame not defined
Stacktrace:
 [1] top-level scope
   @ REPL[3]:1

as this does not work if I don’t do a using DataFrames first.

sparrowhawk · April 20, 2023, 8:57am

This is indeed fast. However, I seem to need to do using Tables to get this to work on Julia 1.8.2. Won’t that add Tables as a dependency then?

nilshg · April 20, 2023, 9:00am

Ah sorry, I think I had already done using DataFrames in the session before I tried this. Tables is already a dependency of CSV but not re-exported so you’d have to do CSV.Tables.matrix.

Topic		Replies	Views
Alternative to DataFrame Readtable to read large data files with headers Data	17	4075	November 12, 2018
CSV.read extremely slow wrt readtable Data	14	3678	July 27, 2018
Very slow readdlm() General Usage	14	1968	October 2, 2018
Read file with CSV.read New to Julia	8	19830	September 9, 2019
Reading Data Is Still Too Slow Data	35	8967	August 2, 2019

Best practice for reading large matrix with repeated spaces as text

Related topics