TTFX with DataFrames and CSV

Is this VS Code or Juno? (VS Code profiling still seems to be lagging behind Juno and that is the reason I’m using both IDEs).

You can install those packages via apt-get on linux. Not sure why they aren’t downloaded automatically.

But I mean more basic profiling. Timing using CSV and using DataFrames separetely. In the CSV.read call, separating out CSV.File and DataFrame(...).

These are all things you can do to narrow down the problem without a complicated flame graph like ProfileView.jl

1 Like

OK, and you can debug with println (which I do often enough). But IMHO it would be better if adequate tools were available and users educated to use them.

I manged to create a profile view. The “Failed to load module…” messages are only warnings. I am not so much interested in the time for using CSV, DataFrames, because thats reasonable short and probably the package authors try to keep it low anyways.

New code:

@time using CSV, DataFrames

const input="""
time,ping
1,25.7
2,31.8
"""

function read_csv(in)
    io = IOBuffer(in)
    df = CSV.read(io, DataFrame)
    close(io)
    df
end

using ProfileView

@profview read_csv(input)

file:///home/ufechner/Bilder/Profile.png

In the left block if I hover I mainly see:

  • abstractinterpretation.jl
  • typeinfer.jl

In the right section I mainly see:

  • file.jl
  • CSV.jl
  • loading.jl

But this is difficult to interpret for me…

Here is what I see with Atom/Juno using

@time using CSV, DataFrames

const input="""
time,ping
1,25.7
2,31.8
"""

function read_csv(in)
    io = IOBuffer(in)
    df = CSV.read(io, DataFrame)
    close(io)
    df
end

using Profile
@profile read_csv(input)
Juno.profiler() 

in a fresh session. Output in console:

3.309279 seconds (7.74 M allocations: 456.073 MiB, 4.35% gc time, 60.60% compilation time)

Graphical output:

The dominating function calls are:

typeinf_ext_toplevel(mi::MethodInstance, world::UInt) = typeinf_ext_toplevel(NativeInterpreter(world), mi)

around 20 % of runtime and

    df = CSV.read(io, DataFrame)

around 80 % of the runtime.

Edit: @ufechner7 the graph in Atom/Juno is interactive: you can click at a bar and the IDE navigates to the corresponding code.

1 Like

There’s a major TTFX discussion going on atm. There’s plenty of tips and discussion going on there

The OP issue seems mostly related to CSV.jl, not to DataFrames. I don’t use the latter and still experience similar times-to-first-CSV-read with CSV.read(file, rowtable) or columntable.

I run another test, and I agree with @aplavin : The issue is related to CSV.jl.

@time using CSV, DataFrames

const input="""
time,ping
1,25.7
2,31.8
"""

function read_csv(in)
    io = IOBuffer(in)
    @time file = CSV.File(io)
    @time df = DataFrame(file)
    close(io)
    df
end

df = read_csv(input)

Output:

julia> @time include("bench2.jl")
  3.071134 seconds (7.53 M allocations: 454.289 MiB, 5.26% gc time, 67.56% compilation time)
 11.473243 seconds (2.32 M allocations: 96.563 MiB, 99.98% compilation time)
  0.106111 seconds (166.72 k allocations: 9.127 MiB, 99.85% compilation time)
 19.250656 seconds (56.72 M allocations: 2.479 GiB, 6.03% gc time, 94.79% compilation time)
2×2 DataFrame
 Row │ time   ping    
     │ Int64  Float64 
─────┼────────────────
   1 │     1     25.7
   2 │     2     31.8

11.5 of the 19.25s total time are related to calling CSV.File .

1 Like

Created an issue in CSV.jl: First call of CSV.File very slow · Issue #974 · JuliaData/CSV.jl · GitHub

1 Like

This doesn’t seem unusual, AFAIK CSV.File has to be compiled for the specific schema of your data (i.e. the specific types of your columns)

Well, but why? For small CSV files there should be a code path that avoids compilation of specialized code.

Equivalent Python code for comparison:

from io import StringIO
import pandas as pd

input = StringIO("""time,ping
1,25.7
2,31.8
""")

df = pd.read_csv(input)
print(df)

Output:

time python3 bench.py
   time  ping
0     1  25.7
1     2  31.8

real    0m0.295s
user    0m0.458s
sys     0m0.272s

Python: 0.3s
Julia: 19.3s

Doesn’t look good. If Julia would reach 3s I would already be very happy.

This is a known issue with CSV.jl, still - it is good to keep track of it explicitly in the issue so thank you for opening it.

3 Likes

Use DelimitedFiles

And how would I parse them into a data frame?

julia> data, header = readdlm(IOBuffer(input), ',', header=true)
([1.0 25.7; 2.0 31.8], AbstractString["time" "ping"])

julia> DataFrame(data, vec(header))
2×2 DataFrame
 Row │ time     ping
     │ Float64  Float64
─────┼──────────────────
   1 │     1.0     25.7
   2 │     2.0     31.8
3 Likes

and something I will blog post next week is:

julia> input="""
       time,ping,name
       1,25.7,a
       2,,b
       """
"time,ping,name\n1,25.7,a\n2,,b\n"

julia> data, header = readdlm(IOBuffer(input), ',', header=true)
(Any[1 25.7 "a"; 2 "" "b"], AbstractString["time" "ping" "name"])

julia> identity.(DataFrame(ifelse.(data .== "", missing, data), vec(header)))
2×3 DataFrame
 Row │ time   ping       name
     │ Int64  Float64?   SubStrin…
─────┼─────────────────────────────
   1 │     1       25.7  a
   2 │     2  missing    b

(I have just realized how far we can go with plain DelimitedFiles and one line of code)

3 Likes

For OP’s input example, @btime of your code on a fresh first run shows readlm() taking ~0.12 s and DataFrame() taking ~0.25 s.

Well, @btime cannot be used to determine the time-to-first-dataframe.

But the solution suggested by @rafael.guerra seams to solve my initial problem. Code:

@time using DataFrames, DelimitedFiles

const input="""
time,ping
1,25.7
2,31.8
"""

const input="""
time,ping
1,25.7
2,31.8
"""

function read_csv(inp)
    @time data, header = readdlm(IOBuffer(inp), ',',header=true)
    # @time df = DataFrame(data, vec(header))
    @time df = identity.(DataFrame(ifelse.(data .== "", missing, data), vec(header)))
    @time df[!,:time] = convert.(Int64,df[:,:time])
    df
end

df = read_csv(input)

Output:

julia> @time include("bench5.jl")

  0.837795 seconds (1.92 M allocations: 132.477 MiB, 4.66% gc time, 0.51% compilation time)
  0.114726 seconds (90.08 k allocations: 4.882 MiB, 99.82% compilation time)
  0.594612 seconds (1.70 M allocations: 93.770 MiB, 9.75% gc time, 99.64% compilation time)
  0.080426 seconds (201.72 k allocations: 11.051 MiB, 99.58% compilation time)
  2.109000 seconds (5.50 M allocations: 327.066 MiB, 8.52% gc time, 60.14% compilation time)
2×2 DataFrame
 Row │ time   ping    
     │ Int64  Float64 
─────┼────────────────
   1 │     1     25.7
   2 │     2     31.8

Summary:

Time-to-first-dataframe

Python (Pandas): 0.3s
DelimitedFiles:  2.1s
CSV:            19.0s

DelimitedFiles does not support two features by default:
a. detecting different column types
b. detecting missing values
The code above handles this correctly for this toy example.

Open questions:

  • is it possible to make CSV faster to avoid the need of two different solutions depending on the size of the problem?
  • if CSV.jl cannot made fast, would it be good to have a package CSVlight.jl that has the same interfaces as CSV.jl and can serve as drop-in replacement?
2 Likes

Why is that?