This is a known issue with CSV.jl, still - it is good to keep track of it explicitly in the issue so thank you for opening it.
Use DelimitedFiles
And how would I parse them into a data frame?
julia> data, header = readdlm(IOBuffer(input), ',', header=true)
([1.0 25.7; 2.0 31.8], AbstractString["time" "ping"])
julia> DataFrame(data, vec(header))
2Γ2 DataFrame
Row β time ping
β Float64 Float64
ββββββΌββββββββββββββββββ
1 β 1.0 25.7
2 β 2.0 31.8
and something I will blog post next week is:
julia> input="""
time,ping,name
1,25.7,a
2,,b
"""
"time,ping,name\n1,25.7,a\n2,,b\n"
julia> data, header = readdlm(IOBuffer(input), ',', header=true)
(Any[1 25.7 "a"; 2 "" "b"], AbstractString["time" "ping" "name"])
julia> identity.(DataFrame(ifelse.(data .== "", missing, data), vec(header)))
2Γ3 DataFrame
Row β time ping name
β Int64 Float64? SubStrinβ¦
ββββββΌβββββββββββββββββββββββββββββ
1 β 1 25.7 a
2 β 2 missing b
(I have just realized how far we can go with plain DelimitedFiles
and one line of code)
For OPβs input example, @btime
of your code on a fresh first run shows readlm()
taking ~0.12 s and DataFrame()
taking ~0.25 s.
Well, @btime cannot be used to determine the time-to-first-dataframe.
But the solution suggested by @rafael.guerra seams to solve my initial problem. Code:
@time using DataFrames, DelimitedFiles
const input="""
time,ping
1,25.7
2,31.8
"""
const input="""
time,ping
1,25.7
2,31.8
"""
function read_csv(inp)
@time data, header = readdlm(IOBuffer(inp), ',',header=true)
# @time df = DataFrame(data, vec(header))
@time df = identity.(DataFrame(ifelse.(data .== "", missing, data), vec(header)))
@time df[!,:time] = convert.(Int64,df[:,:time])
df
end
df = read_csv(input)
Output:
julia> @time include("bench5.jl")
0.837795 seconds (1.92 M allocations: 132.477 MiB, 4.66% gc time, 0.51% compilation time)
0.114726 seconds (90.08 k allocations: 4.882 MiB, 99.82% compilation time)
0.594612 seconds (1.70 M allocations: 93.770 MiB, 9.75% gc time, 99.64% compilation time)
0.080426 seconds (201.72 k allocations: 11.051 MiB, 99.58% compilation time)
2.109000 seconds (5.50 M allocations: 327.066 MiB, 8.52% gc time, 60.14% compilation time)
2Γ2 DataFrame
Row β time ping
β Int64 Float64
ββββββΌββββββββββββββββ
1 β 1 25.7
2 β 2 31.8
Summary:
Time-to-first-dataframe
Python (Pandas): 0.3s
DelimitedFiles: 2.1s
CSV: 19.0s
DelimitedFiles does not support two features by default:
a. detecting different column types
b. detecting missing values
The code above handles this correctly for this toy example.
Open questions:
- is it possible to make CSV faster to avoid the need of two different solutions depending on the size of the problem?
- if CSV.jl cannot made fast, would it be good to have a package CSVlight.jl that has the same interfaces as CSV.jl and can serve as drop-in replacement?
Why is that?
Sorry, that was a typo. I mean, @btime cannot be used, because it is explicitly written not to take the compilation overhead into account, but here I am mainly interested in the compilation and inference time, not in the runtime which is neglectable for small datasets.
Just played around with this. Essentially the problem is huge functions with type instability.
Breaking up the functions so that unnecessary parts are not compiled is a really easy win. Every if
block with a bunch of code in it should just call multiple inner functions. Sections with known types should be separated out from larger functions so that only the method for known types has to be compiled, instead of everything being boxed.
Then we could even precompile a bunch of these for specific types.
Moving the ctx.threaded block out of the conditional to another function shaved 4s of the time for me, 16s to 12s. Things like that can be done pretty much everywhere in file.jl, and thatβs where nearly all the time is spent.
https://github.com/JuliaData/CSV.jl/pull/975
Iβm stuck in the paris airport a few more hours, see how far I can get with this
Rooting for further delays on your inbound plane!
I hope plane(inport,outport)
has compilation issues and a looong TTFX.
Didnβt get much further.
A problem with doing this is there is so much noise in timing changes that itβs hard to be sure of small improvements. We need a @ctime
macro that will run code in a fresh session mulltiple times and take an average, although itβs going to take a long time to run.
It also seems like reorganising code is only effective for really large blocks.
Well, I think laptops also change their speed depending on the CPU temperature more then desktops⦠Easier to do the benchmarking on a desktop.
I was trying something like that with https://github.com/jkrumbiegel/VersionBenchmarks.jl/ to compare code across different versions or commits, with different Julia versions if thatβs desired. For my test case of improving GridLayoutBase.jl latency, it has still been pretty noisy so far, however. More between trial variation than I would have liked. Maybe one needs to collect 10 runs or more so the average is meaningful.
Interestingly, fixing type stability and reorganising things didnβt do that much for the timing. But after doing that, adding some precompile methods had a large affect (where they had none previously) - Iβm getting full TTFX including using
of 8 seconds, and 7 if we remove @refargs
macros. There are a few more patches of instability preventing further precompilation, but hopefully they are fixable and it can mostly precompile away.
Finally, to round out this saga, nearly all of the time ends up being resolved by precompilaiton in Parsers.jl, making most of the other changes I made much less effective.
https://github.com/JuliaData/Parsers.jl/pull/108
This is also the case for JSON.jl, and Blink.jl from the other TTFX thread. Its nearly all Parsers.jl.
Great job! Hopefully your PR gets merged soon!
Wow, thatβs really impressive! Turns out that lots of ttfx time improvements donβt even require any compiler optimizations.