Logging vs pushing to dataframe

I am currently tracking progress during an iterative process in two different domains - numerical solutions of nonlinear equations, and machine learning. In order to track the progress, I have been simply pushing to a DataFrame in the iterative loop.

I have however just come across JuliaML/ValueHistories.jl and JuliaLogging/TensorBoardLogger.jl. It looks like such a logging approach is specifically made for what I am doing, which makes me feel like I should look more into it.

However - plotting the DataFrame after training is real simple. There is a noteworthy difference in overhead from tracking progress:

julia> @btime begin
           mvh = MVHistory()
           for i in 1:100
               push!(mvh, :squared, i, i^2)
               push!(mvh, :cubed,   i, i^3)
  11.200 ΞΌs (193 allocations: 10.92 KiB)

julia> @btime begin
               df = DataFrame(i=[], squared=[], cubed=[])
               for i in 1:100
                   push!(df, [i i^2 i^3])
  44.500 ΞΌs (309 allocations: 17.98 KiB)

julia> @btime begin
               df = DataFrame(i=Int64[], squared=Int64[], cubed=Int64[])
               for i in 1:100
                   push!(df, [i i^2 i^3])
  34.500 ΞΌs (309 allocations: 18.08 KiB)

, but this time scale is so small that it is not a problem for my application. Setting up TensorBoardLogger.jl, or extracting data from the types defined in ValueHistories.jl, seems like a bit of extra work. So before I do something like that, I was wondering:

What are the potential benefits of logging, as opposed to pushing to a DataFrame?

My perspective (as DataFrames.jl maintainer): as you noted pushing to a data frame is expected to be a bit slower than using a specialized package, at the benefit of having a rich and flexible set of functions you can use to process this data frame.

Alternatively you can construct your object with logs and only at the end convert it to a data frame.

TensorBoard is also a gui tool for analysing those logs, so it has a somewhat different scope than barebones logging to a common datastructure or to the console.

Performance is unlikely a major focus or advantage of those TensorBoard of ValueHistory loggers. For example, plain julia tables are much faster:

julia> @btime begin
                      tbl = []
                      for i in 1:100
                          push!(tbl, (;i, squared=i^2, cubed=i^3))
  1.887 ΞΌs (104 allocations: 5.03 KiB)

Of course, these don’t directly integrate with any GUI solution like tensorboard, but are convenient to postprocess and analyse from within julia.

For which I personally like to use ProgressMeter.jl and UnicodePlots.jl, which together provide a very nice interface for these kind of things while staying in the REPL

1 Like

I think the benefits of logging for this kind of use are mostly in whatever help the logging backend can give you (in other words, I agree pushing to a DataFrame or calling a logging macro aren’t very different and one isn’t really better than the other). [IMO the other big benefit of the logging system is to decouple things so that library code can decide what it wants to log out, and application code that uses those libraries can decide how to handle those logs, by setting a global logger and then running the library code. But it doesn’t sound like that’s as relevant here].

For example, if you log so many values that your DataFrame won’t fit into memory, then having a backend that logs to a file or uploads logs somewhere might be useful. Or if the logging backend can help organize logs by date/run/parameter choices and compare them. I’ve recently started using Wandb at work for that, and it has a logger interface through Wandb.jl.

If you wanted, you could also write a logger that’s backed by a DataFrame :slight_smile:. Check out the Julia logging org docs for more on writing/using loggers: https://julialogging.github.io/.

1 Like

So timing ended up being relevant when I realized I was using 10’s of percent of the time pushing to a dataframe. So I made the following timer comparison, using @kristoffer.carlsson 's excellent TimerOutputs.jl:

using TimerOutputs
using DataFrames
using ValueHistories

to = TimerOutput()

    df = DataFrame(i = Vector{Int64}(undef, 0), isq = Vector{Int64}(undef, 0))
    named_tuple = (v1 = Vector{Int64}(undef, 0), v2 = Vector{Int64}(undef, 0))
    mat = Matrix{Int64}(undef, 0, 2)
    vhist = History(Int64)
    qvhist = QHistory(Int64)
    mvhist = MVHistory()
    iobuffer1 = IOBuffer()
    iobuffer2 = IOBuffer()
    dict = Dict(:v1=>Vector{Int64}(undef, 0), :v2=>Vector{Int64}(undef, 0))
    v1 = Vector{Int64}(undef, 0)
    v2 = Vector{Int64}(undef, 0)

    for i in 1:2:10000
        isq = i^2
        @timeit to "DataFrame"     push!(df, [i isq])
        @timeit to "NamedTuple"    (push!(named_tuple.v1, i); push!(named_tuple.v2, isq))
        @timeit to "Matrix"        vcat(mat, [i isq])
        @timeit to "MValueHistory" push!(mvhist, :isq, i, isq)
        @timeit to "QValueHistory" push!(qvhist, i, isq)
        @timeit to "ValueHistory"  push!(vhist, i, isq)
        @timeit to "IObuffers"     (write(iobuffer1, i); write(iobuffer2, isq))
        @timeit to "Dictonary"     (push!(dict[:v1], i); push!(dict[:v2], isq))
        @timeit to "Vectors"       (push!(v1, i); push!(v2, isq)) 
to |> println

On my machine, this produces the following:

                                Time                    Allocations      
                         ───────────────────────   ────────────────────────  
    Tot / % measured:         222ms /   6.1%           22.5MiB /  28.6%      

 Section         ncalls     time    %tot     avg     alloc    %tot      avg  
 DataFrame        5.00k   3.17ms   23.2%   634ns   1.42MiB   22.0%     297B  
 NamedTuple       5.00k   1.71ms   12.6%   343ns    865KiB   13.1%     177B  
 Matrix           5.00k   1.59ms   11.7%   319ns    938KiB   14.3%     192B  
 IObuffers        5.00k   1.45ms   10.7%   291ns    592KiB    9.0%     121B  
 MValueHistory    5.00k   1.37ms   10.1%   275ns    704KiB   10.7%     144B  
 Dictonary        5.00k   1.29ms    9.5%   259ns    552KiB    8.4%     113B  
 QValueHistory    5.00k   1.04ms    7.6%   208ns    373KiB    5.7%    76.4B  
 ValueHistory     5.00k   1.01ms    7.4%   203ns    552KiB    8.4%     113B  
 Vectors          5.00k    985ΞΌs    7.2%   197ns    552KiB    8.4%     113B  

The three fastest alternatives, QHistory, History and Vectors alternate which is fastest between runs.