Is this VS Code or Juno? (VS Code profiling still seems to be lagging behind Juno and that is the reason I’m using both IDEs).
You can install those packages via apt-get
on linux. Not sure why they aren’t downloaded automatically.
But I mean more basic profiling. Timing using CSV
and using DataFrames
separetely. In the CSV.read
call, separating out CSV.File
and DataFrame(...)
.
These are all things you can do to narrow down the problem without a complicated flame graph like ProfileView.jl
OK, and you can debug with println
(which I do often enough). But IMHO it would be better if adequate tools were available and users educated to use them.
I manged to create a profile view. The “Failed to load module…” messages are only warnings. I am not so much interested in the time for using CSV, DataFrames, because thats reasonable short and probably the package authors try to keep it low anyways.
New code:
@time using CSV, DataFrames
const input="""
time,ping
1,25.7
2,31.8
"""
function read_csv(in)
io = IOBuffer(in)
df = CSV.read(io, DataFrame)
close(io)
df
end
using ProfileView
@profview read_csv(input)
file:///home/ufechner/Bilder/Profile.png
In the left block if I hover I mainly see:
- abstractinterpretation.jl
- typeinfer.jl
In the right section I mainly see:
- file.jl
- CSV.jl
- loading.jl
But this is difficult to interpret for me…
Here is what I see with Atom/Juno using
@time using CSV, DataFrames
const input="""
time,ping
1,25.7
2,31.8
"""
function read_csv(in)
io = IOBuffer(in)
df = CSV.read(io, DataFrame)
close(io)
df
end
using Profile
@profile read_csv(input)
Juno.profiler()
in a fresh session. Output in console:
3.309279 seconds (7.74 M allocations: 456.073 MiB, 4.35% gc time, 60.60% compilation time)
Graphical output:
The dominating function calls are:
typeinf_ext_toplevel(mi::MethodInstance, world::UInt) = typeinf_ext_toplevel(NativeInterpreter(world), mi)
around 20 % of runtime and
df = CSV.read(io, DataFrame)
around 80 % of the runtime.
Edit: @ufechner7 the graph in Atom/Juno is interactive: you can click at a bar and the IDE navigates to the corresponding code.
There’s a major TTFX discussion going on atm. There’s plenty of tips and discussion going on there
The OP issue seems mostly related to CSV.jl, not to DataFrames. I don’t use the latter and still experience similar times-to-first-CSV-read with CSV.read(file, rowtable)
or columntable
.
I run another test, and I agree with @aplavin : The issue is related to CSV.jl.
@time using CSV, DataFrames
const input="""
time,ping
1,25.7
2,31.8
"""
function read_csv(in)
io = IOBuffer(in)
@time file = CSV.File(io)
@time df = DataFrame(file)
close(io)
df
end
df = read_csv(input)
Output:
julia> @time include("bench2.jl")
3.071134 seconds (7.53 M allocations: 454.289 MiB, 5.26% gc time, 67.56% compilation time)
11.473243 seconds (2.32 M allocations: 96.563 MiB, 99.98% compilation time)
0.106111 seconds (166.72 k allocations: 9.127 MiB, 99.85% compilation time)
19.250656 seconds (56.72 M allocations: 2.479 GiB, 6.03% gc time, 94.79% compilation time)
2×2 DataFrame
Row │ time ping
│ Int64 Float64
─────┼────────────────
1 │ 1 25.7
2 │ 2 31.8
11.5 of the 19.25s total time are related to calling CSV.File .
Created an issue in CSV.jl: First call of CSV.File very slow · Issue #974 · JuliaData/CSV.jl · GitHub
This doesn’t seem unusual, AFAIK CSV.File has to be compiled for the specific schema of your data (i.e. the specific types of your columns)
Well, but why? For small CSV files there should be a code path that avoids compilation of specialized code.
Equivalent Python code for comparison:
from io import StringIO
import pandas as pd
input = StringIO("""time,ping
1,25.7
2,31.8
""")
df = pd.read_csv(input)
print(df)
Output:
time python3 bench.py
time ping
0 1 25.7
1 2 31.8
real 0m0.295s
user 0m0.458s
sys 0m0.272s
Python: 0.3s
Julia: 19.3s
Doesn’t look good. If Julia would reach 3s I would already be very happy.
This is a known issue with CSV.jl, still - it is good to keep track of it explicitly in the issue so thank you for opening it.
Use DelimitedFiles
And how would I parse them into a data frame?
julia> data, header = readdlm(IOBuffer(input), ',', header=true)
([1.0 25.7; 2.0 31.8], AbstractString["time" "ping"])
julia> DataFrame(data, vec(header))
2×2 DataFrame
Row │ time ping
│ Float64 Float64
─────┼──────────────────
1 │ 1.0 25.7
2 │ 2.0 31.8
and something I will blog post next week is:
julia> input="""
time,ping,name
1,25.7,a
2,,b
"""
"time,ping,name\n1,25.7,a\n2,,b\n"
julia> data, header = readdlm(IOBuffer(input), ',', header=true)
(Any[1 25.7 "a"; 2 "" "b"], AbstractString["time" "ping" "name"])
julia> identity.(DataFrame(ifelse.(data .== "", missing, data), vec(header)))
2×3 DataFrame
Row │ time ping name
│ Int64 Float64? SubStrin…
─────┼─────────────────────────────
1 │ 1 25.7 a
2 │ 2 missing b
(I have just realized how far we can go with plain DelimitedFiles
and one line of code)
For OP’s input example, @btime
of your code on a fresh first run shows readlm()
taking ~0.12 s and DataFrame()
taking ~0.25 s.
Well, @btime cannot be used to determine the time-to-first-dataframe.
But the solution suggested by @rafael.guerra seams to solve my initial problem. Code:
@time using DataFrames, DelimitedFiles
const input="""
time,ping
1,25.7
2,31.8
"""
const input="""
time,ping
1,25.7
2,31.8
"""
function read_csv(inp)
@time data, header = readdlm(IOBuffer(inp), ',',header=true)
# @time df = DataFrame(data, vec(header))
@time df = identity.(DataFrame(ifelse.(data .== "", missing, data), vec(header)))
@time df[!,:time] = convert.(Int64,df[:,:time])
df
end
df = read_csv(input)
Output:
julia> @time include("bench5.jl")
0.837795 seconds (1.92 M allocations: 132.477 MiB, 4.66% gc time, 0.51% compilation time)
0.114726 seconds (90.08 k allocations: 4.882 MiB, 99.82% compilation time)
0.594612 seconds (1.70 M allocations: 93.770 MiB, 9.75% gc time, 99.64% compilation time)
0.080426 seconds (201.72 k allocations: 11.051 MiB, 99.58% compilation time)
2.109000 seconds (5.50 M allocations: 327.066 MiB, 8.52% gc time, 60.14% compilation time)
2×2 DataFrame
Row │ time ping
│ Int64 Float64
─────┼────────────────
1 │ 1 25.7
2 │ 2 31.8
Summary:
Time-to-first-dataframe
Python (Pandas): 0.3s
DelimitedFiles: 2.1s
CSV: 19.0s
DelimitedFiles does not support two features by default:
a. detecting different column types
b. detecting missing values
The code above handles this correctly for this toy example.
Open questions:
- is it possible to make CSV faster to avoid the need of two different solutions depending on the size of the problem?
- if CSV.jl cannot made fast, would it be good to have a package CSVlight.jl that has the same interfaces as CSV.jl and can serve as drop-in replacement?
Why is that?