Well, @btime cannot be used to determine the time-to-first-dataframe.
But the solution suggested by @rafael.guerra seams to solve my initial problem. Code:
@time using DataFrames, DelimitedFiles
const input="""
time,ping
1,25.7
2,31.8
"""
const input="""
time,ping
1,25.7
2,31.8
"""
function read_csv(inp)
@time data, header = readdlm(IOBuffer(inp), ',',header=true)
# @time df = DataFrame(data, vec(header))
@time df = identity.(DataFrame(ifelse.(data .== "", missing, data), vec(header)))
@time df[!,:time] = convert.(Int64,df[:,:time])
df
end
df = read_csv(input)
Output:
julia> @time include("bench5.jl")
0.837795 seconds (1.92 M allocations: 132.477 MiB, 4.66% gc time, 0.51% compilation time)
0.114726 seconds (90.08 k allocations: 4.882 MiB, 99.82% compilation time)
0.594612 seconds (1.70 M allocations: 93.770 MiB, 9.75% gc time, 99.64% compilation time)
0.080426 seconds (201.72 k allocations: 11.051 MiB, 99.58% compilation time)
2.109000 seconds (5.50 M allocations: 327.066 MiB, 8.52% gc time, 60.14% compilation time)
2×2 DataFrame
Row │ time ping
│ Int64 Float64
─────┼────────────────
1 │ 1 25.7
2 │ 2 31.8
Summary:
Time-to-first-dataframe
Python (Pandas): 0.3s
DelimitedFiles: 2.1s
CSV: 19.0s
DelimitedFiles does not support two features by default:
a. detecting different column types
b. detecting missing values
The code above handles this correctly for this toy example.
Open questions:
- is it possible to make CSV faster to avoid the need of two different solutions depending on the size of the problem?
- if CSV.jl cannot made fast, would it be good to have a package CSVlight.jl that has the same interfaces as CSV.jl and can serve as drop-in replacement?