Making string to float conversion faster?

purplishrock · March 14, 2021, 8:56pm

i have to read a 50,000,000 line file and it’s currently taking about 1.5s/million lines, which means it takes on the order of 80s to pull in the whole file (I do have to mention that my fairly non-trivial math on the 50M points takes about 1s )

data format on each line is

,,,<float64>,<float64>

i’m using the following code to parse (this is in a loop of course)

line=readline(f)
fields = split(line[4:end], ",")
x[i] = parse(Float64, fields[1])
y[i] = parse(Float64, fields[2])

Where x and y are pre-allocated.

Is it possible for me to do this any faster (unfortunately the values are NOT fixed length strings) ?

Thank you !

ericphanson · March 14, 2021, 8:58pm

I’d try starting Julia with multiple threads and using CSV.jl

purplishrock · March 14, 2021, 9:24pm

The reason I’m using my own “mini reader” is that in the past, for this kind of file, I have found CSV to be slower.

However i tried it anyway and read the data into a dataframe. It is much, much slower (> 6s per 1M lines). I’m still trying to figure out how to tell CSV to read floats and give me back an array instead of a dataframe.

I also tried readdlm (when i can use readdlm it is really fast), but readdlm doesn’t like the missing values and throws an error. There does not appear to be anything to tell readdlm to ignore missing values.

The other thing to note, is that I am running this code on windows, and the windows code, on the same machine is running about 2x slower, which is something else I’m looking into.

tbeason · March 14, 2021, 9:28pm

CSV.jl and memory map. I don’t think you can do better than that without substantial effort. 50M by 5 is not large. And if you don’t need the first 3 columns, you can only read the last 2.

jling · March 14, 2021, 9:31pm

possible to provide a small sample file (maybe like 5% of the lines) and a complete function you’re using rn? it’s easier for others to improve upon that way.

stevengj · March 14, 2021, 9:36pm

(Under the hood, CSV.jl uses Parsers.jl, which provides a lot more options than the built-in parse function, e.g. for parsing a stream directly without allocating a string.)

purplishrock · March 14, 2021, 10:43pm

Sure, I’ve provided code below. Although it’s trivial, I should have provided it to begin.

Be aware that I know i’m timing the creating and zeroing of the arrays, but it’s a very minor amount of time.

Here’s the full code to read it:

function main()
    f = open("test.csv")
    n = 1000000
    x=zeros(n)
    y=zeros(n)
    for i=1:n
        line = readline(f)
        fields = split(line, ",")
        x[i]=parse(Float64,fields[4])
        y[i]=parse(Float64,fields[5])
    end
    (x,y)
end

and some code to generate the data

using Printf

function write_data(n)
    f = open("test.csv", "w")
    for i=1:n
        @printf(f, ",,,%g,%g\n", randn(), randn())
    end
    close(f)
end

purplishrock · March 14, 2021, 10:48pm

i’ll try mmap. it also occurs to me that this should be a very straightforward thing to multi-thread as @ericphanson was saying. i know how big the file is, i can break this into N threads where each thread is reading 1/N of the file.

that sounds like a fun thing to try

ericphanson · March 14, 2021, 10:54pm

What I was saying was that CSV.jl will do that for you, as long as you start Julia with multiple threads

tbeason · March 14, 2021, 11:05pm

Even without memory mapping, reading in 10M rows with 6 cores active I get this

julia> @time f_in = CSV.File("test.csv";select=[4,5],header=false)
  0.206670 seconds (3.13 k allocations: 295.919 MiB, 4.51% gc time, 3.44% compilation time)

First run takes longer due to precompile. You can materialize into DataFrame or whatever structure you need.

And without threads for completeness:

julia> @time f_no = CSV.File("test.csv";select=[4,5],header=false,threaded=false)
  1.120931 seconds (198 allocations: 302.636 MiB, 2.14% gc time)

jling · March 14, 2021, 11:11pm

can make it just slightly faster by giving it more information:

delim=",", types=[Missing, Missing, Missing,Float64, Float64]

Btw I think select= doesn’t help reduce the parsing time because when I tried to only provide two types= it complained.

ericphanson · March 14, 2021, 11:12pm

I tried

using CSV
using Mmap: mmap
@time file = CSV.File(mmap("test.csv"), header=false, select=[4,5], type=Float64)

and got 5s after compilation time (11s first run) with 2 threads for a 50M line file. Strangely though when I did -t auto which gives 8 threads, it would seem to hang forever (unless I pass threaded=false).

edit: filed as CSV.jl issue for the hang, CSV.jl#817.

purplishrock · March 14, 2021, 11:18pm

i have what i’m sure is a ridiculously simple question but I just can’t find the answer in the CSV documentation.

once i have a CSV.File object, how do i extract arrays from it ?

Oscar_Smith · March 14, 2021, 11:22pm

What version of Julia and CSV are you using?

ericphanson · March 14, 2021, 11:29pm

Both 1.5.3 and 1.6-rc1, with CSV v0.8.4. I suppose I should update to 1.5.4 and rc2 though

ericphanson · March 14, 2021, 11:31pm

You can do file.Column4 (if that’s what the column is called; you can choose by passing a header) to get one column, or Matrix(file) to make the whole thing a matrix. CSV.File’s support the Tables.jl interface, so there is a lot you can do with them.

purplishrock · March 14, 2021, 11:31pm

ok. Things are really improved. I have provided the code . The run time is now 10.7s for all 50,000,000 lines!

if i don’t provide “types=”, then the read time goes up to 12.7 s.

if i proved the “types=” line and do NOT use mmap, the read time is 11.2s.

A great improvement over my original almost 80s!! However this is all in Linux, now i have to go try it in windows

And for the record I am using 1.5.1 and CSV 0.8.4

What’s particularly impressive, is that you don’t even provide a size hint to CSV and it still gets these speeds…

edit: the threading options -t 2 and -t 4 make absolutely no difference.
edit2: windows performance is about 13s. mmap does not seem to make any difference, nor does threading options.

I’m quite surprised threading doesn’t help. This seems like something that can be multi-threaded very efficiently.

using CSV
using DelimitedFiles
using Mmap: mmap

function test3()
    f_no = CSV.File(mmap("test.csv"),
                    delim=",",
                    types=[Missing, Missing, Missing, Float64, Float64],
                    header=[:a, :b, :c, :x, :y],
                    threaded=false)

    f_no
end

function main()
    @time df = test3()
    println(typeof(df))
    println(df)
    x = df.x
    y = df.y
    println(x[1:10])
    println(y[1:10])
end

main()

Thanks you very much everyone. I will try in windows and report back the results.

Topic		Replies	Views
CSV read in is too slow than other language General Usage performance	13	1367	June 21, 2023
CSV Reading (rewrite in C?) Internals & Design	50	5068	October 1, 2018
CSV.read very slow when number of threads changed General Usage multithreading , csv	2	297	September 18, 2023
CSV.Row very slow for reading files line by line Performance package , csv	0	282	May 9, 2023
Skipping a lot of lines in CSV.read() allocates too much memory Performance csv , io	77	2059	February 23, 2024

Making string to float conversion faster?

Related topics