Logging vs pushing to dataframe

TheLateKronos · March 8, 2022, 7:25am

I am currently tracking progress during an iterative process in two different domains - numerical solutions of nonlinear equations, and machine learning. In order to track the progress, I have been simply pushing to a DataFrame in the iterative loop.

I have however just come across JuliaML/ValueHistories.jl and JuliaLogging/TensorBoardLogger.jl. It looks like such a logging approach is specifically made for what I am doing, which makes me feel like I should look more into it.

However - plotting the DataFrame after training is real simple. There is a noteworthy difference in overhead from tracking progress:

julia> @btime begin
           mvh = MVHistory()
           for i in 1:100
               push!(mvh, :squared, i, i^2)
               push!(mvh, :cubed,   i, i^3)
           end
       end
  11.200 μs (193 allocations: 10.92 KiB)

julia> @btime begin
               df = DataFrame(i=[], squared=[], cubed=[])
               for i in 1:100
                   push!(df, [i i^2 i^3])
               end
           end
  44.500 μs (309 allocations: 17.98 KiB)

julia> @btime begin
               df = DataFrame(i=Int64[], squared=Int64[], cubed=Int64[])
               for i in 1:100
                   push!(df, [i i^2 i^3])
               end
           end
  34.500 μs (309 allocations: 18.08 KiB)

, but this time scale is so small that it is not a problem for my application. Setting up TensorBoardLogger.jl, or extracting data from the types defined in ValueHistories.jl, seems like a bit of extra work. So before I do something like that, I was wondering:

What are the potential benefits of logging, as opposed to pushing to a DataFrame?

bkamins · March 8, 2022, 8:00am

My perspective (as DataFrames.jl maintainer): as you noted pushing to a data frame is expected to be a bit slower than using a specialized package, at the benefit of having a rich and flexible set of functions you can use to process this data frame.

Alternatively you can construct your object with logs and only at the end convert it to a data frame.

aplavin · March 8, 2022, 10:40am

TensorBoard is also a gui tool for analysing those logs, so it has a somewhat different scope than barebones logging to a common datastructure or to the console.

Performance is unlikely a major focus or advantage of those TensorBoard of ValueHistory loggers. For example, plain julia tables are much faster:

julia> @btime begin
                      tbl = []
                      for i in 1:100
                          push!(tbl, (;i, squared=i^2, cubed=i^3))
                      end
                  end
  1.887 μs (104 allocations: 5.03 KiB)

Of course, these don’t directly integrate with any GUI solution like tensorboard, but are convenient to postprocess and analyse from within julia.

lrnv · March 8, 2022, 10:56pm

For which I personally like to use ProgressMeter.jl and UnicodePlots.jl, which together provide a very nice interface for these kind of things while staying in the REPL

ericphanson · March 9, 2022, 3:01am

I think the benefits of logging for this kind of use are mostly in whatever help the logging backend can give you (in other words, I agree pushing to a DataFrame or calling a logging macro aren’t very different and one isn’t really better than the other). [IMO the other big benefit of the logging system is to decouple things so that library code can decide what it wants to log out, and application code that uses those libraries can decide how to handle those logs, by setting a global logger and then running the library code. But it doesn’t sound like that’s as relevant here].

For example, if you log so many values that your DataFrame won’t fit into memory, then having a backend that logs to a file or uploads logs somewhere might be useful. Or if the logging backend can help organize logs by date/run/parameter choices and compare them. I’ve recently started using Wandb at work for that, and it has a logger interface through Wandb.jl.

If you wanted, you could also write a logger that’s backed by a DataFrame . Check out the Julia logging org docs for more on writing/using loggers: https://julialogging.github.io/.

TheLateKronos · March 12, 2022, 4:52pm

So timing ended up being relevant when I realized I was using 10’s of percent of the time pushing to a dataframe. So I made the following timer comparison, using @kristoffer.carlsson 's excellent TimerOutputs.jl:

using TimerOutputs
using DataFrames
using ValueHistories

to = TimerOutput()

begin
    
    df = DataFrame(i = Vector{Int64}(undef, 0), isq = Vector{Int64}(undef, 0))
    named_tuple = (v1 = Vector{Int64}(undef, 0), v2 = Vector{Int64}(undef, 0))
    mat = Matrix{Int64}(undef, 0, 2)
    vhist = History(Int64)
    qvhist = QHistory(Int64)
    mvhist = MVHistory()
    iobuffer1 = IOBuffer()
    iobuffer2 = IOBuffer()
    dict = Dict(:v1=>Vector{Int64}(undef, 0), :v2=>Vector{Int64}(undef, 0))
    v1 = Vector{Int64}(undef, 0)
    v2 = Vector{Int64}(undef, 0)

    for i in 1:2:10000
        isq = i^2
        @timeit to "DataFrame"     push!(df, [i isq])
        @timeit to "NamedTuple"    (push!(named_tuple.v1, i); push!(named_tuple.v2, isq))
        @timeit to "Matrix"        vcat(mat, [i isq])
        @timeit to "MValueHistory" push!(mvhist, :isq, i, isq)
        @timeit to "QValueHistory" push!(qvhist, i, isq)
        @timeit to "ValueHistory"  push!(vhist, i, isq)
        @timeit to "IObuffers"     (write(iobuffer1, i); write(iobuffer2, isq))
        @timeit to "Dictonary"     (push!(dict[:v1], i); push!(dict[:v2], isq))
        @timeit to "Vectors"       (push!(v1, i); push!(v2, isq)) 
    end
end
to |> println

On my machine, this produces the following:

                                Time                    Allocations      
                         ───────────────────────   ────────────────────────  
    Tot / % measured:         222ms /   6.1%           22.5MiB /  28.6%      

 Section         ncalls     time    %tot     avg     alloc    %tot      avg  
 ──────────────────────────────────────────────────────────────────────────  
 DataFrame        5.00k   3.17ms   23.2%   634ns   1.42MiB   22.0%     297B  
 NamedTuple       5.00k   1.71ms   12.6%   343ns    865KiB   13.1%     177B  
 Matrix           5.00k   1.59ms   11.7%   319ns    938KiB   14.3%     192B  
 IObuffers        5.00k   1.45ms   10.7%   291ns    592KiB    9.0%     121B  
 MValueHistory    5.00k   1.37ms   10.1%   275ns    704KiB   10.7%     144B  
 Dictonary        5.00k   1.29ms    9.5%   259ns    552KiB    8.4%     113B  
 QValueHistory    5.00k   1.04ms    7.6%   208ns    373KiB    5.7%    76.4B  
 ValueHistory     5.00k   1.01ms    7.4%   203ns    552KiB    8.4%     113B  
 Vectors          5.00k    985μs    7.2%   197ns    552KiB    8.4%     113B  
 ──────────────────────────────────────────────────────────────────────────

The three fastest alternatives, QHistory, History and Vectors alternate which is fastest between runs.

Topic		Replies	Views
Simulation Framework with Logging to Tabular Data General Usage question	25	2976	June 19, 2018
Using DataFrames: ~ 10 seconds General Usage	24	2313	March 19, 2018
Help with first time working on performance New to Julia performance	6	1039	April 14, 2017
DataTables or DataFrames? Data question	32	15494	November 19, 2018
Hierarchical or multi-index for data frames Data	10	7447	October 9, 2019

Logging vs pushing to dataframe

Related topics