Making string to float conversion faster?

i have to read a 50,000,000 line file and it’s currently taking about 1.5s/million lines, which means it takes on the order of 80s to pull in the whole file (I do have to mention that my fairly non-trivial math on the 50M points takes about 1s :slight_smile: )

data format on each line is

,,,<float64>,<float64>

i’m using the following code to parse (this is in a loop of course)

line=readline(f)
fields = split(line[4:end], ",")
x[i] = parse(Float64, fields[1])
y[i] = parse(Float64, fields[2])

Where x and y are pre-allocated.

Is it possible for me to do this any faster (unfortunately the values are NOT fixed length strings) ?

Thank you !

I’d try starting Julia with multiple threads and using CSV.jl

5 Likes

The reason I’m using my own “mini reader” is that in the past, for this kind of file, I have found CSV to be slower.

However i tried it anyway and read the data into a dataframe. It is much, much slower (> 6s per 1M lines). I’m still trying to figure out how to tell CSV to read floats and give me back an array instead of a dataframe.

I also tried readdlm (when i can use readdlm it is really fast), but readdlm doesn’t like the missing values and throws an error. There does not appear to be anything to tell readdlm to ignore missing values.

The other thing to note, is that I am running this code on windows, and the windows code, on the same machine is running about 2x slower, which is something else I’m looking into.

CSV.jl and memory map. I don’t think you can do better than that without substantial effort. 50M by 5 is not large. And if you don’t need the first 3 columns, you can only read the last 2.

2 Likes

possible to provide a small sample file (maybe like 5% of the lines) and a complete function you’re using rn? it’s easier for others to improve upon that way.

(Under the hood, CSV.jl uses Parsers.jl, which provides a lot more options than the built-in parse function, e.g. for parsing a stream directly without allocating a string.)

6 Likes

Sure, I’ve provided code below. Although it’s trivial, I should have provided it to begin.

Be aware that I know i’m timing the creating and zeroing of the arrays, but it’s a very minor amount of time.

Here’s the full code to read it:

function main()
    f = open("test.csv")
    n = 1000000
    x=zeros(n)
    y=zeros(n)
    for i=1:n
        line = readline(f)
        fields = split(line, ",")
        x[i]=parse(Float64,fields[4])
        y[i]=parse(Float64,fields[5])
    end
    (x,y)
end

and some code to generate the data

using Printf

function write_data(n)
    f = open("test.csv", "w")
    for i=1:n
        @printf(f, ",,,%g,%g\n", randn(), randn())
    end
    close(f)
end

i’ll try mmap. it also occurs to me that this should be a very straightforward thing to multi-thread as @ericphanson was saying. i know how big the file is, i can break this into N threads where each thread is reading 1/N of the file.

that sounds like a fun thing to try :slight_smile:

What I was saying was that CSV.jl will do that for you, as long as you start Julia with multiple threads :slightly_smiling_face:

2 Likes

Even without memory mapping, reading in 10M rows with 6 cores active I get this

julia> @time f_in = CSV.File("test.csv";select=[4,5],header=false)
  0.206670 seconds (3.13 k allocations: 295.919 MiB, 4.51% gc time, 3.44% compilation time)

First run takes longer due to precompile. You can materialize into DataFrame or whatever structure you need.

And without threads for completeness:

julia> @time f_no = CSV.File("test.csv";select=[4,5],header=false,threaded=false)
  1.120931 seconds (198 allocations: 302.636 MiB, 2.14% gc time)
1 Like

can make it just slightly faster by giving it more information:

delim=",", types=[Missing, Missing, Missing,Float64, Float64]

Btw I think select= doesn’t help reduce the parsing time because when I tried to only provide two types= it complained.

I tried

using CSV
using Mmap: mmap
@time file = CSV.File(mmap("test.csv"), header=false, select=[4,5], type=Float64)

and got 5s after compilation time (11s first run) with 2 threads for a 50M line file. Strangely though when I did -t auto which gives 8 threads, it would seem to hang forever (unless I pass threaded=false).

edit: filed as CSV.jl issue for the hang, CSV.jl#817.

1 Like

i have what i’m sure is a ridiculously simple question but I just can’t find the answer in the CSV documentation.

once i have a CSV.File object, how do i extract arrays from it ?

What version of Julia and CSV are you using?

Both 1.5.3 and 1.6-rc1, with CSV v0.8.4. I suppose I should update to 1.5.4 and rc2 though :sweat_smile:

You can do file.Column4 (if that’s what the column is called; you can choose by passing a header) to get one column, or Matrix(file) to make the whole thing a matrix. CSV.File’s support the Tables.jl interface, so there is a lot you can do with them.

1 Like

ok. Things are really improved. I have provided the code . The run time is now 10.7s for all 50,000,000 lines!

if i don’t provide “types=”, then the read time goes up to 12.7 s.

if i proved the “types=” line and do NOT use mmap, the read time is 11.2s.

A great improvement over my original almost 80s!! However this is all in Linux, now i have to go try it in windows :frowning:

And for the record I am using 1.5.1 and CSV 0.8.4

What’s particularly impressive, is that you don’t even provide a size hint to CSV and it still gets these speeds…

edit: the threading options -t 2 and -t 4 make absolutely no difference.
edit2: windows performance is about 13s. mmap does not seem to make any difference, nor does threading options.

I’m quite surprised threading doesn’t help. This seems like something that can be multi-threaded very efficiently.

using CSV
using DelimitedFiles
using Mmap: mmap

function test3()
    f_no = CSV.File(mmap("test.csv"),
                    delim=",",
                    types=[Missing, Missing, Missing, Float64, Float64],
                    header=[:a, :b, :c, :x, :y],
                    threaded=false)

    f_no
end

function main()
    @time df = test3()
    println(typeof(df))
    println(df)
    x = df.x
    y = df.y
    println(x[1:10])
    println(y[1:10])
end

main()

Thanks you very much everyone. I will try in windows and report back the results.

4 Likes