TTFX with DataFrames and CSV

I have a very simple little program:

@time using CSV, DataFrames

input="""
time,ping
1,25.7
2,31.8
"""

io = IOBuffer(input)
@time df = CSV.read(io, DataFrame)
close(io)
df

Output:

julia> include("bench.jl")
  3.062944 seconds (7.52 M allocations: 453.843 MiB, 5.73% gc time, 67.24% compilation time)
 16.101461 seconds (49.23 M allocations: 2.039 GiB, 6.23% gc time, 99.98% compilation time)
2×2 DataFrame
 Row │ time   ping    
     │ Int64  Float64 
─────┼────────────────
   1 │     1     25.7
   2 │     2     31.8

So it needs about 19s on my machine (Intel® Core™ i7-7700K on Linux).

Is there a way to reduce this time?

3 Likes

You can certainly help by profiling. Maybe you can narrow the problem down to either CSV.jl or DataFrames.jl separately?

How can I create a profile? Or where is that documented?

Still recommended: The Juno.jl Front-End · Juno Documentation

Well, I thought @profile is only for profiling runtime performance and not to debug interference and compilation time. Am I wrong?

@profile can do both (but if you want to measure compiletime you need to run it in a fresh session)

1 Like

Adding to Oscar’s answer: sometimes you have to be careful not to include compilation in your profile (so you run your code twice: first to compile and then to measure).

Thanks for your answers. Next issue:

julia> using ProfileView
Gtk-Message: 23:20:15.044: Failed to load module “unity-gtk-module”
Gtk-Message: 23:20:15.061: Failed to load module “canberra-gtk-module”
Gtk-Message: 23:20:15.062: Failed to load module “canberra-gtk-module”

Any idea?

Is this VS Code or Juno? (VS Code profiling still seems to be lagging behind Juno and that is the reason I’m using both IDEs).

You can install those packages via apt-get on linux. Not sure why they aren’t downloaded automatically.

But I mean more basic profiling. Timing using CSV and using DataFrames separetely. In the CSV.read call, separating out CSV.File and DataFrame(...).

These are all things you can do to narrow down the problem without a complicated flame graph like ProfileView.jl

1 Like

OK, and you can debug with println (which I do often enough). But IMHO it would be better if adequate tools were available and users educated to use them.

I manged to create a profile view. The “Failed to load module…” messages are only warnings. I am not so much interested in the time for using CSV, DataFrames, because thats reasonable short and probably the package authors try to keep it low anyways.

New code:

@time using CSV, DataFrames

const input="""
time,ping
1,25.7
2,31.8
"""

function read_csv(in)
    io = IOBuffer(in)
    df = CSV.read(io, DataFrame)
    close(io)
    df
end

using ProfileView

@profview read_csv(input)

file:///home/ufechner/Bilder/Profile.png

In the left block if I hover I mainly see:

  • abstractinterpretation.jl
  • typeinfer.jl

In the right section I mainly see:

  • file.jl
  • CSV.jl
  • loading.jl

But this is difficult to interpret for me…

Here is what I see with Atom/Juno using

@time using CSV, DataFrames

const input="""
time,ping
1,25.7
2,31.8
"""

function read_csv(in)
    io = IOBuffer(in)
    df = CSV.read(io, DataFrame)
    close(io)
    df
end

using Profile
@profile read_csv(input)
Juno.profiler() 

in a fresh session. Output in console:

3.309279 seconds (7.74 M allocations: 456.073 MiB, 4.35% gc time, 60.60% compilation time)

Graphical output:

The dominating function calls are:

typeinf_ext_toplevel(mi::MethodInstance, world::UInt) = typeinf_ext_toplevel(NativeInterpreter(world), mi)

around 20 % of runtime and

    df = CSV.read(io, DataFrame)

around 80 % of the runtime.

Edit: @ufechner7 the graph in Atom/Juno is interactive: you can click at a bar and the IDE navigates to the corresponding code.

1 Like

There’s a major TTFX discussion going on atm. There’s plenty of tips and discussion going on there

The OP issue seems mostly related to CSV.jl, not to DataFrames. I don’t use the latter and still experience similar times-to-first-CSV-read with CSV.read(file, rowtable) or columntable.

I run another test, and I agree with @aplavin : The issue is related to CSV.jl.

@time using CSV, DataFrames

const input="""
time,ping
1,25.7
2,31.8
"""

function read_csv(in)
    io = IOBuffer(in)
    @time file = CSV.File(io)
    @time df = DataFrame(file)
    close(io)
    df
end

df = read_csv(input)

Output:

julia> @time include("bench2.jl")
  3.071134 seconds (7.53 M allocations: 454.289 MiB, 5.26% gc time, 67.56% compilation time)
 11.473243 seconds (2.32 M allocations: 96.563 MiB, 99.98% compilation time)
  0.106111 seconds (166.72 k allocations: 9.127 MiB, 99.85% compilation time)
 19.250656 seconds (56.72 M allocations: 2.479 GiB, 6.03% gc time, 94.79% compilation time)
2×2 DataFrame
 Row │ time   ping    
     │ Int64  Float64 
─────┼────────────────
   1 │     1     25.7
   2 │     2     31.8

11.5 of the 19.25s total time are related to calling CSV.File .

1 Like

Created an issue in CSV.jl: https://github.com/JuliaData/CSV.jl/issues/974

1 Like

This doesn’t seem unusual, AFAIK CSV.File has to be compiled for the specific schema of your data (i.e. the specific types of your columns)

Well, but why? For small CSV files there should be a code path that avoids compilation of specialized code.

Equivalent Python code for comparison:

from io import StringIO
import pandas as pd

input = StringIO("""time,ping
1,25.7
2,31.8
""")

df = pd.read_csv(input)
print(df)

Output:

time python3 bench.py
   time  ping
0     1  25.7
1     2  31.8

real    0m0.295s
user    0m0.458s
sys     0m0.272s

Python: 0.3s
Julia: 19.3s

Doesn’t look good. If Julia would reach 3s I would already be very happy.