TTFX with DataFrames and CSV

This is a known issue with CSV.jl, still - it is good to keep track of it explicitly in the issue so thank you for opening it.

3 Likes

Use DelimitedFiles

And how would I parse them into a data frame?

julia> data, header = readdlm(IOBuffer(input), ',', header=true)
([1.0 25.7; 2.0 31.8], AbstractString["time" "ping"])

julia> DataFrame(data, vec(header))
2Γ—2 DataFrame
 Row β”‚ time     ping
     β”‚ Float64  Float64
─────┼──────────────────
   1 β”‚     1.0     25.7
   2 β”‚     2.0     31.8
3 Likes

and something I will blog post next week is:

julia> input="""
       time,ping,name
       1,25.7,a
       2,,b
       """
"time,ping,name\n1,25.7,a\n2,,b\n"

julia> data, header = readdlm(IOBuffer(input), ',', header=true)
(Any[1 25.7 "a"; 2 "" "b"], AbstractString["time" "ping" "name"])

julia> identity.(DataFrame(ifelse.(data .== "", missing, data), vec(header)))
2Γ—3 DataFrame
 Row β”‚ time   ping       name
     β”‚ Int64  Float64?   SubStrin…
─────┼─────────────────────────────
   1 β”‚     1       25.7  a
   2 β”‚     2  missing    b

(I have just realized how far we can go with plain DelimitedFiles and one line of code)

3 Likes

For OP’s input example, @btime of your code on a fresh first run shows readlm() taking ~0.12 s and DataFrame() taking ~0.25 s.

Well, @btime cannot be used to determine the time-to-first-dataframe.

But the solution suggested by @rafael.guerra seams to solve my initial problem. Code:

@time using DataFrames, DelimitedFiles

const input="""
time,ping
1,25.7
2,31.8
"""

const input="""
time,ping
1,25.7
2,31.8
"""

function read_csv(inp)
    @time data, header = readdlm(IOBuffer(inp), ',',header=true)
    # @time df = DataFrame(data, vec(header))
    @time df = identity.(DataFrame(ifelse.(data .== "", missing, data), vec(header)))
    @time df[!,:time] = convert.(Int64,df[:,:time])
    df
end

df = read_csv(input)

Output:

julia> @time include("bench5.jl")

  0.837795 seconds (1.92 M allocations: 132.477 MiB, 4.66% gc time, 0.51% compilation time)
  0.114726 seconds (90.08 k allocations: 4.882 MiB, 99.82% compilation time)
  0.594612 seconds (1.70 M allocations: 93.770 MiB, 9.75% gc time, 99.64% compilation time)
  0.080426 seconds (201.72 k allocations: 11.051 MiB, 99.58% compilation time)
  2.109000 seconds (5.50 M allocations: 327.066 MiB, 8.52% gc time, 60.14% compilation time)
2Γ—2 DataFrame
 Row β”‚ time   ping    
     β”‚ Int64  Float64 
─────┼────────────────
   1 β”‚     1     25.7
   2 β”‚     2     31.8

Summary:

Time-to-first-dataframe

Python (Pandas): 0.3s
DelimitedFiles:  2.1s
CSV:            19.0s

DelimitedFiles does not support two features by default:
a. detecting different column types
b. detecting missing values
The code above handles this correctly for this toy example.

Open questions:

  • is it possible to make CSV faster to avoid the need of two different solutions depending on the size of the problem?
  • if CSV.jl cannot made fast, would it be good to have a package CSVlight.jl that has the same interfaces as CSV.jl and can serve as drop-in replacement?
2 Likes

Why is that?

Sorry, that was a typo. I mean, @btime cannot be used, because it is explicitly written not to take the compilation overhead into account, but here I am mainly interested in the compilation and inference time, not in the runtime which is neglectable for small datasets.

2 Likes

Just played around with this. Essentially the problem is huge functions with type instability.

Breaking up the functions so that unnecessary parts are not compiled is a really easy win. Every if block with a bunch of code in it should just call multiple inner functions. Sections with known types should be separated out from larger functions so that only the method for known types has to be compiled, instead of everything being boxed.

Then we could even precompile a bunch of these for specific types.

Moving the ctx.threaded block out of the conditional to another function shaved 4s of the time for me, 16s to 12s. Things like that can be done pretty much everywhere in file.jl, and that’s where nearly all the time is spent.

7 Likes

https://github.com/JuliaData/CSV.jl/pull/975

I’m stuck in the paris airport a few more hours, see how far I can get with this

15 Likes

Rooting for further delays on your inbound plane!

11 Likes

I hope plane(inport,outport) has compilation issues and a looong TTFX.

2 Likes

Didn’t get much further.

A problem with doing this is there is so much noise in timing changes that it’s hard to be sure of small improvements. We need a @ctime macro that will run code in a fresh session mulltiple times and take an average, although it’s going to take a long time to run.

It also seems like reorganising code is only effective for really large blocks.

2 Likes

Well, I think laptops also change their speed depending on the CPU temperature more then desktops… Easier to do the benchmarking on a desktop.

I was trying something like that with https://github.com/jkrumbiegel/VersionBenchmarks.jl/ to compare code across different versions or commits, with different Julia versions if that’s desired. For my test case of improving GridLayoutBase.jl latency, it has still been pretty noisy so far, however. More between trial variation than I would have liked. Maybe one needs to collect 10 runs or more so the average is meaningful.

3 Likes

Interestingly, fixing type stability and reorganising things didn’t do that much for the timing. But after doing that, adding some precompile methods had a large affect (where they had none previously) - I’m getting full TTFX including using of 8 seconds, and 7 if we remove @refargs macros. There are a few more patches of instability preventing further precompilation, but hopefully they are fixable and it can mostly precompile away.

7 Likes

Finally, to round out this saga, nearly all of the time ends up being resolved by precompilaiton in Parsers.jl, making most of the other changes I made much less effective.

https://github.com/JuliaData/Parsers.jl/pull/108

This is also the case for JSON.jl, and Blink.jl from the other TTFX thread. Its nearly all Parsers.jl.

https://github.com/JuliaIO/JSON.jl/pull/337

11 Likes

Great job! Hopefully your PR gets merged soon! :grinning:

Wow, that’s really impressive! Turns out that lots of ttfx time improvements don’t even require any compiler optimizations.