I have a very simple little program:
@time using CSV, DataFrames
input="""
time,ping
1,25.7
2,31.8
"""
io = IOBuffer(input)
@time df = CSV.read(io, DataFrame)
close(io)
df
Output:
julia> include("bench.jl")
3.062944 seconds (7.52 M allocations: 453.843 MiB, 5.73% gc time, 67.24% compilation time)
16.101461 seconds (49.23 M allocations: 2.039 GiB, 6.23% gc time, 99.98% compilation time)
2×2 DataFrame
Row │ time ping
│ Int64 Float64
─────┼────────────────
1 │ 1 25.7
2 │ 2 31.8
So it needs about 19s on my machine (Intel® Core™ i7-7700K on Linux).
Is there a way to reduce this time?
3 Likes
You can certainly help by profiling. Maybe you can narrow the problem down to either CSV.jl or DataFrames.jl separately?
How can I create a profile? Or where is that documented?
goerch
January 29, 2022, 9:52pm
4
Well, I thought @profile is only for profiling runtime performance and not to debug interference and compilation time. Am I wrong?
@profile
can do both (but if you want to measure compiletime you need to run it in a fresh session)
1 Like
goerch
January 29, 2022, 10:20pm
7
Adding to Oscar’s answer: sometimes you have to be careful not to include compilation in your profile (so you run your code twice: first to compile and then to measure).
Thanks for your answers. Next issue:
julia> using ProfileView
Gtk-Message: 23:20:15.044: Failed to load module “unity-gtk-module”
Gtk-Message: 23:20:15.061: Failed to load module “canberra-gtk-module”
Gtk-Message: 23:20:15.062: Failed to load module “canberra-gtk-module”
Any idea?
goerch
January 29, 2022, 10:24pm
9
Is this VS Code or Juno? (VS Code profiling still seems to be lagging behind Juno and that is the reason I’m using both IDEs).
You can install those packages via apt-get
on linux. Not sure why they aren’t downloaded automatically.
But I mean more basic profiling. Timing using CSV
and using DataFrames
separetely. In the CSV.read
call, separating out CSV.File
and DataFrame(...)
.
These are all things you can do to narrow down the problem without a complicated flame graph like ProfileView.jl
1 Like
goerch
January 29, 2022, 10:33pm
11
OK, and you can debug with println
(which I do often enough). But IMHO it would be better if adequate tools were available and users educated to use them.
I manged to create a profile view. The “Failed to load module…” messages are only warnings. I am not so much interested in the time for using CSV, DataFrames, because thats reasonable short and probably the package authors try to keep it low anyways.
New code:
@time using CSV, DataFrames
const input="""
time,ping
1,25.7
2,31.8
"""
function read_csv(in)
io = IOBuffer(in)
df = CSV.read(io, DataFrame)
close(io)
df
end
using ProfileView
@profview read_csv(input)
file:///home/ufechner/Bilder/Profile.png
In the left block if I hover I mainly see:
abstractinterpretation.jl
typeinfer.jl
In the right section I mainly see:
file.jl
CSV.jl
loading.jl
But this is difficult to interpret for me…
goerch
January 29, 2022, 10:57pm
13
Here is what I see with Atom/Juno using
@time using CSV, DataFrames
const input="""
time,ping
1,25.7
2,31.8
"""
function read_csv(in)
io = IOBuffer(in)
df = CSV.read(io, DataFrame)
close(io)
df
end
using Profile
@profile read_csv(input)
Juno.profiler()
in a fresh session. Output in console:
3.309279 seconds (7.74 M allocations: 456.073 MiB, 4.35% gc time, 60.60% compilation time)
Graphical output:
The dominating function calls are:
typeinf_ext_toplevel(mi::MethodInstance, world::UInt) = typeinf_ext_toplevel(NativeInterpreter(world), mi)
around 20 % of runtime and
df = CSV.read(io, DataFrame)
around 80 % of the runtime.
Edit: @ufechner7 the graph in Atom/Juno is interactive: you can click at a bar and the IDE navigates to the corresponding code.
1 Like
There’s a major TTFX discussion going on atm. There’s plenty of tips and discussion going on there
So much work has been done to improve julia compilation and remove invalidations. Things really feel a lot snappier to use.
But there is still the problem of packages that were written without much thought to startup time, and TTFX - the time-to-first whatever a package does most, is sometimes really slow.
Some common problems I see are (reorganised to be in order of importance):
Lack of type stability slowing down compilation.
Not calling precompile for common methods, or just running them …
aplavin
January 30, 2022, 10:51am
15
The OP issue seems mostly related to CSV.jl, not to DataFrames. I don’t use the latter and still experience similar times-to-first-CSV-read with CSV.read(file, rowtable)
or columntable
.
I run another test, and I agree with @aplavin : The issue is related to CSV.jl.
@time using CSV, DataFrames
const input="""
time,ping
1,25.7
2,31.8
"""
function read_csv(in)
io = IOBuffer(in)
@time file = CSV.File(io)
@time df = DataFrame(file)
close(io)
df
end
df = read_csv(input)
Output:
julia> @time include("bench2.jl")
3.071134 seconds (7.53 M allocations: 454.289 MiB, 5.26% gc time, 67.56% compilation time)
11.473243 seconds (2.32 M allocations: 96.563 MiB, 99.98% compilation time)
0.106111 seconds (166.72 k allocations: 9.127 MiB, 99.85% compilation time)
19.250656 seconds (56.72 M allocations: 2.479 GiB, 6.03% gc time, 94.79% compilation time)
2×2 DataFrame
Row │ time ping
│ Int64 Float64
─────┼────────────────
1 │ 1 25.7
2 │ 2 31.8
11.5 of the 19.25s total time are related to calling CSV.File .
1 Like
nilshg
January 30, 2022, 5:33pm
18
This doesn’t seem unusual, AFAIK CSV.File has to be compiled for the specific schema of your data (i.e. the specific types of your columns)
Well, but why? For small CSV files there should be a code path that avoids compilation of specialized code.
Equivalent Python code for comparison:
from io import StringIO
import pandas as pd
input = StringIO("""time,ping
1,25.7
2,31.8
""")
df = pd.read_csv(input)
print(df)
Output:
time python3 bench.py
time ping
0 1 25.7
1 2 31.8
real 0m0.295s
user 0m0.458s
sys 0m0.272s
Python: 0.3s
Julia: 19.3s
Doesn’t look good. If Julia would reach 3s I would already be very happy.