TTFX with DataFrames and CSV

bkamins · January 30, 2022, 10:18pm

This is a known issue with CSV.jl, still - it is good to keep track of it explicitly in the issue so thank you for opening it.

rafael.guerra · January 30, 2022, 10:25pm

Use DelimitedFiles

ufechner7 · January 30, 2022, 10:31pm

And how would I parse them into a data frame?

bkamins · January 30, 2022, 10:44pm

julia> data, header = readdlm(IOBuffer(input), ',', header=true)
([1.0 25.7; 2.0 31.8], AbstractString["time" "ping"])

julia> DataFrame(data, vec(header))
2×2 DataFrame
 Row │ time     ping
     │ Float64  Float64
─────┼──────────────────
   1 │     1.0     25.7
   2 │     2.0     31.8

bkamins · January 30, 2022, 10:54pm

and something I will blog post next week is:

julia> input="""
       time,ping,name
       1,25.7,a
       2,,b
       """
"time,ping,name\n1,25.7,a\n2,,b\n"

julia> data, header = readdlm(IOBuffer(input), ',', header=true)
(Any[1 25.7 "a"; 2 "" "b"], AbstractString["time" "ping" "name"])

julia> identity.(DataFrame(ifelse.(data .== "", missing, data), vec(header)))
2×3 DataFrame
 Row │ time   ping       name
     │ Int64  Float64?   SubStrin…
─────┼─────────────────────────────
   1 │     1       25.7  a
   2 │     2  missing    b

(I have just realized how far we can go with plain DelimitedFiles and one line of code)

rafael.guerra · January 30, 2022, 11:47pm

For OP’s input example, @btime of your code on a fresh first run shows readlm() taking ~0.12 s and DataFrame() taking ~0.25 s.

ufechner7 · January 31, 2022, 7:18am

Well, @btime cannot be used to determine the time-to-first-dataframe.

But the solution suggested by @rafael.guerra seams to solve my initial problem. Code:

@time using DataFrames, DelimitedFiles

const input="""
time,ping
1,25.7
2,31.8
"""

const input="""
time,ping
1,25.7
2,31.8
"""

function read_csv(inp)
    @time data, header = readdlm(IOBuffer(inp), ',',header=true)
    # @time df = DataFrame(data, vec(header))
    @time df = identity.(DataFrame(ifelse.(data .== "", missing, data), vec(header)))
    @time df[!,:time] = convert.(Int64,df[:,:time])
    df
end

df = read_csv(input)

Output:

julia> @time include("bench5.jl")

  0.837795 seconds (1.92 M allocations: 132.477 MiB, 4.66% gc time, 0.51% compilation time)
  0.114726 seconds (90.08 k allocations: 4.882 MiB, 99.82% compilation time)
  0.594612 seconds (1.70 M allocations: 93.770 MiB, 9.75% gc time, 99.64% compilation time)
  0.080426 seconds (201.72 k allocations: 11.051 MiB, 99.58% compilation time)
  2.109000 seconds (5.50 M allocations: 327.066 MiB, 8.52% gc time, 60.14% compilation time)
2×2 DataFrame
 Row │ time   ping    
     │ Int64  Float64 
─────┼────────────────
   1 │     1     25.7
   2 │     2     31.8

Summary:

Time-to-first-dataframe

Python (Pandas): 0.3s
DelimitedFiles:  2.1s
CSV:            19.0s

DelimitedFiles does not support two features by default:
a. detecting different column types
b. detecting missing values
The code above handles this correctly for this toy example.

Open questions:

is it possible to make CSV faster to avoid the need of two different solutions depending on the size of the problem?
if CSV.jl cannot made fast, would it be good to have a package CSVlight.jl that has the same interfaces as CSV.jl and can serve as drop-in replacement?

jzr · January 31, 2022, 7:19am

Why is that?

ufechner7 · January 31, 2022, 7:55am

Sorry, that was a typo. I mean, @btime cannot be used, because it is explicitly written not to take the compilation overhead into account, but here I am mainly interested in the compilation and inference time, not in the runtime which is neglectable for small datasets.

Raf · January 31, 2022, 11:28am

Just played around with this. Essentially the problem is huge functions with type instability.

Breaking up the functions so that unnecessary parts are not compiled is a really easy win. Every if block with a bunch of code in it should just call multiple inner functions. Sections with known types should be separated out from larger functions so that only the method for known types has to be compiled, instead of everything being boxed.

Then we could even precompile a bunch of these for specific types.

Moving the ctx.threaded block out of the conditional to another function shaved 4s of the time for me, 16s to 12s. Things like that can be done pretty much everywhere in file.jl, and that’s where nearly all the time is spent.

Raf · January 31, 2022, 11:57am

https://github.com/JuliaData/CSV.jl/pull/975

I’m stuck in the paris airport a few more hours, see how far I can get with this

nilshg · January 31, 2022, 12:29pm

Rooting for further delays on your inbound plane!

lrnv · January 31, 2022, 2:13pm

I hope plane(inport,outport) has compilation issues and a looong TTFX.

Raf · January 31, 2022, 7:53pm

Didn’t get much further.

A problem with doing this is there is so much noise in timing changes that it’s hard to be sure of small improvements. We need a @ctime macro that will run code in a fresh session mulltiple times and take an average, although it’s going to take a long time to run.

It also seems like reorganising code is only effective for really large blocks.

ufechner7 · January 31, 2022, 9:35pm

Well, I think laptops also change their speed depending on the CPU temperature more then desktops… Easier to do the benchmarking on a desktop.

jules · January 31, 2022, 10:27pm

I was trying something like that with https://github.com/jkrumbiegel/VersionBenchmarks.jl/ to compare code across different versions or commits, with different Julia versions if that’s desired. For my test case of improving GridLayoutBase.jl latency, it has still been pretty noisy so far, however. More between trial variation than I would have liked. Maybe one needs to collect 10 runs or more so the average is meaningful.

Raf · February 1, 2022, 8:47am

Interestingly, fixing type stability and reorganising things didn’t do that much for the timing. But after doing that, adding some precompile methods had a large affect (where they had none previously) - I’m getting full TTFX including using of 8 seconds, and 7 if we remove @refargs macros. There are a few more patches of instability preventing further precompilation, but hopefully they are fixable and it can mostly precompile away.

Raf · February 5, 2022, 1:56am

Finally, to round out this saga, nearly all of the time ends up being resolved by precompilaiton in Parsers.jl, making most of the other changes I made much less effective.

https://github.com/JuliaData/Parsers.jl/pull/108

This is also the case for JSON.jl, and Blink.jl from the other TTFX thread. Its nearly all Parsers.jl.

https://github.com/JuliaIO/JSON.jl/pull/337

ufechner7 · February 5, 2022, 4:57am

Great job! Hopefully your PR gets merged soon!

aplavin · February 5, 2022, 5:56am

Wow, that’s really impressive! Turns out that lots of ttfx time improvements don’t even require any compiler optimizations.

Topic		Replies	Views
First try seems a bit sluggish Performance	5	619	February 21, 2021
Profiling module precompilation Performance	6	1267	February 18, 2020
Julia v1.9.0-beta2 is fast Performance	44	5734	January 5, 2023
Startup Speed New to Julia question	6	1923	March 15, 2022
Real runtime vs. @btime output New to Julia question , performance , ttfp	11	1617	January 23, 2022

TTFX with DataFrames and CSV

Related topics