TTFX with DataFrames and CSV

ufechner7 · January 29, 2022, 8:05pm

I have a very simple little program:

@time using CSV, DataFrames

input="""
time,ping
1,25.7
2,31.8
"""

io = IOBuffer(input)
@time df = CSV.read(io, DataFrame)
close(io)
df

Output:

julia> include("bench.jl")
  3.062944 seconds (7.52 M allocations: 453.843 MiB, 5.73% gc time, 67.24% compilation time)
 16.101461 seconds (49.23 M allocations: 2.039 GiB, 6.23% gc time, 99.98% compilation time)
2×2 DataFrame
 Row │ time   ping    
     │ Int64  Float64 
─────┼────────────────
   1 │     1     25.7
   2 │     2     31.8

So it needs about 19s on my machine (Intel® Core™ i7-7700K on Linux).

Is there a way to reduce this time?

pdeffebach · January 29, 2022, 9:41pm

You can certainly help by profiling. Maybe you can narrow the problem down to either CSV.jl or DataFrames.jl separately?

ufechner7 · January 29, 2022, 9:47pm

How can I create a profile? Or where is that documented?

goerch · January 29, 2022, 9:52pm

Still recommended: The Juno.jl Front-End · Juno Documentation

ufechner7 · January 29, 2022, 10:06pm

Well, I thought @profile is only for profiling runtime performance and not to debug interference and compilation time. Am I wrong?

Oscar_Smith · January 29, 2022, 10:07pm

@profile can do both (but if you want to measure compiletime you need to run it in a fresh session)

goerch · January 29, 2022, 10:20pm

Adding to Oscar’s answer: sometimes you have to be careful not to include compilation in your profile (so you run your code twice: first to compile and then to measure).

ufechner7 · January 29, 2022, 10:21pm

Thanks for your answers. Next issue:

julia> using ProfileView
Gtk-Message: 23:20:15.044: Failed to load module “unity-gtk-module”
Gtk-Message: 23:20:15.061: Failed to load module “canberra-gtk-module”
Gtk-Message: 23:20:15.062: Failed to load module “canberra-gtk-module”

Any idea?

goerch · January 29, 2022, 10:24pm

Is this VS Code or Juno? (VS Code profiling still seems to be lagging behind Juno and that is the reason I’m using both IDEs).

pdeffebach · January 29, 2022, 10:28pm

You can install those packages via apt-get on linux. Not sure why they aren’t downloaded automatically.

But I mean more basic profiling. Timing using CSV and using DataFrames separetely. In the CSV.read call, separating out CSV.File and DataFrame(...).

These are all things you can do to narrow down the problem without a complicated flame graph like ProfileView.jl

goerch · January 29, 2022, 10:33pm

OK, and you can debug with println (which I do often enough). But IMHO it would be better if adequate tools were available and users educated to use them.

ufechner7 · January 29, 2022, 10:37pm

I manged to create a profile view. The “Failed to load module…” messages are only warnings. I am not so much interested in the time for using CSV, DataFrames, because thats reasonable short and probably the package authors try to keep it low anyways.

New code:

@time using CSV, DataFrames

const input="""
time,ping
1,25.7
2,31.8
"""

function read_csv(in)
    io = IOBuffer(in)
    df = CSV.read(io, DataFrame)
    close(io)
    df
end

using ProfileView

@profview read_csv(input)

file:///home/ufechner/Bilder/Profile.png

In the left block if I hover I mainly see:

abstractinterpretation.jl
typeinfer.jl

In the right section I mainly see:

file.jl
CSV.jl
loading.jl

But this is difficult to interpret for me…

goerch · January 29, 2022, 10:57pm

Here is what I see with Atom/Juno using

@time using CSV, DataFrames

const input="""
time,ping
1,25.7
2,31.8
"""

function read_csv(in)
    io = IOBuffer(in)
    df = CSV.read(io, DataFrame)
    close(io)
    df
end

using Profile
@profile read_csv(input)
Juno.profiler()

in a fresh session. Output in console:

3.309279 seconds (7.74 M allocations: 456.073 MiB, 4.35% gc time, 60.60% compilation time)

Graphical output:

The dominating function calls are:

typeinf_ext_toplevel(mi::MethodInstance, world::UInt) = typeinf_ext_toplevel(NativeInterpreter(world), mi)

around 20 % of runtime and

    df = CSV.read(io, DataFrame)

around 80 % of the runtime.

Edit: @ufechner7 the graph in Atom/Juno is interactive: you can click at a bar and the IDE navigates to the corresponding code.

lawless-m · January 30, 2022, 10:47am

There’s a major TTFX discussion going on atm. There’s plenty of tips and discussion going on there

aplavin · January 30, 2022, 10:51am

The OP issue seems mostly related to CSV.jl, not to DataFrames. I don’t use the latter and still experience similar times-to-first-CSV-read with CSV.read(file, rowtable) or columntable.

ufechner7 · January 30, 2022, 1:06pm

I run another test, and I agree with @aplavin : The issue is related to CSV.jl.

@time using CSV, DataFrames

const input="""
time,ping
1,25.7
2,31.8
"""

function read_csv(in)
    io = IOBuffer(in)
    @time file = CSV.File(io)
    @time df = DataFrame(file)
    close(io)
    df
end

df = read_csv(input)

Output:

julia> @time include("bench2.jl")
  3.071134 seconds (7.53 M allocations: 454.289 MiB, 5.26% gc time, 67.56% compilation time)
 11.473243 seconds (2.32 M allocations: 96.563 MiB, 99.98% compilation time)
  0.106111 seconds (166.72 k allocations: 9.127 MiB, 99.85% compilation time)
 19.250656 seconds (56.72 M allocations: 2.479 GiB, 6.03% gc time, 94.79% compilation time)
2×2 DataFrame
 Row │ time   ping    
     │ Int64  Float64 
─────┼────────────────
   1 │     1     25.7
   2 │     2     31.8

11.5 of the 19.25s total time are related to calling CSV.File .

ufechner7 · January 30, 2022, 1:16pm

Created an issue in CSV.jl: https://github.com/JuliaData/CSV.jl/issues/974

nilshg · January 30, 2022, 5:33pm

This doesn’t seem unusual, AFAIK CSV.File has to be compiled for the specific schema of your data (i.e. the specific types of your columns)

ufechner7 · January 30, 2022, 9:57pm

Well, but why? For small CSV files there should be a code path that avoids compilation of specialized code.

ufechner7 · January 30, 2022, 10:09pm

Equivalent Python code for comparison:

from io import StringIO
import pandas as pd

input = StringIO("""time,ping
1,25.7
2,31.8
""")

df = pd.read_csv(input)
print(df)

Output:

time python3 bench.py
   time  ping
0     1  25.7
1     2  31.8

real    0m0.295s
user    0m0.458s
sys     0m0.272s

Python: 0.3s
Julia: 19.3s

Doesn’t look good. If Julia would reach 3s I would already be very happy.

Topic		Replies	Views
Julia v1.9.0-beta2 is fast Performance	44	5754	January 5, 2023
Using DataFrames: ~ 10 seconds General Usage	24	2243	March 19, 2018
CSV Reader Benchmarks: Julia Reads CSVs 10-20x Faster than Python and R General Usage announcement	68	8907	March 23, 2022
CSV read performance vs Pandas General Usage	29	8163	May 6, 2019
CSV read in is too slow than other language General Usage performance	13	1368	June 21, 2023

TTFX with DataFrames and CSV

Related topics