TTFX with DataFrames and CSV

Well, @btime cannot be used to determine the time-to-first-dataframe.

But the solution suggested by @rafael.guerra seams to solve my initial problem. Code:

@time using DataFrames, DelimitedFiles

const input="""
time,ping
1,25.7
2,31.8
"""

const input="""
time,ping
1,25.7
2,31.8
"""

function read_csv(inp)
    @time data, header = readdlm(IOBuffer(inp), ',',header=true)
    # @time df = DataFrame(data, vec(header))
    @time df = identity.(DataFrame(ifelse.(data .== "", missing, data), vec(header)))
    @time df[!,:time] = convert.(Int64,df[:,:time])
    df
end

df = read_csv(input)

Output:

julia> @time include("bench5.jl")

  0.837795 seconds (1.92 M allocations: 132.477 MiB, 4.66% gc time, 0.51% compilation time)
  0.114726 seconds (90.08 k allocations: 4.882 MiB, 99.82% compilation time)
  0.594612 seconds (1.70 M allocations: 93.770 MiB, 9.75% gc time, 99.64% compilation time)
  0.080426 seconds (201.72 k allocations: 11.051 MiB, 99.58% compilation time)
  2.109000 seconds (5.50 M allocations: 327.066 MiB, 8.52% gc time, 60.14% compilation time)
2×2 DataFrame
 Row │ time   ping    
     │ Int64  Float64 
─────┼────────────────
   1 │     1     25.7
   2 │     2     31.8

Summary:

Time-to-first-dataframe

Python (Pandas): 0.3s
DelimitedFiles:  2.1s
CSV:            19.0s

DelimitedFiles does not support two features by default:
a. detecting different column types
b. detecting missing values
The code above handles this correctly for this toy example.

Open questions:

  • is it possible to make CSV faster to avoid the need of two different solutions depending on the size of the problem?
  • if CSV.jl cannot made fast, would it be good to have a package CSVlight.jl that has the same interfaces as CSV.jl and can serve as drop-in replacement?
2 Likes