Performance: Fast way to access numbers in Dataframes or alternatives

pjuergens · November 7, 2022, 2:28pm

Hello together,

I’m searching for a fast (or the fastest) way to access datapoints/numbers in a dataframes-like structure. I have 57 timeseries (columns) with 43800 datapoints (rows) each that are read in from a csv-file to a dataframe (DataFrames.jl). Currently I’m accessing single datapoints via row-index and column-label, i.e. e.g. GLD[i, "hour"] where GLD is the dataframe and i some index. However, this seems to be very time consuming.

So my question is: what is the fastest way to deal with this kind of data? Is there a faster way to access the data in the dataframe? Should I convert it to a (named) array or tuple or is there any other more performant datastructure?

gbaraldi · November 7, 2022, 2:34pm

Since you want to iterate through them all you might want a container that’s a bit faster than the usual dataframe from DataFrames.jl. Maybe GitHub - JuliaData/TypedTables.jl: Simple, fast, column-based storage for data analysis in Julia is a bit faster?

pdeffebach · November 7, 2022, 2:39pm

Accessing the data should not be time-consuming at all. TypedTables shouldn’t be an improvement over DataFrames for data access alone.

It’s likely the real problem is that you are not using a function barrier. For best performance, write functions that act on the vectors directly, not the whole data frame.

function foo(a, b)
    a[100] + b[101]
end
foo(df.a, df.b)

DataFramesMeta.jl’s @with macro can help with this.

@with df begin 
    :a[100] + :b[101]
end

bkamins · November 7, 2022, 3:19pm

Per comments below - the answer is “it depends” on your code. In some cases the way to go it so call Tables.columntable and convert it to a NamedTuple, but often, as @pdeffebach commented using a function barrier or DataFramesMeta.jl will be enough.

pjuergens · November 8, 2022, 7:45am

Thank you, @pdeffebach , the function barrier solved the issue - decreasing computation time and allocations massively.

Can you roughly explain why?

Sukera · November 8, 2022, 8:57am

Julia is built around functions, which are compiled to fast, native code. While you can do everything in global scope, it won’t be optimal as global variables need to have their types, size etc checked on every access, which takes time.

As for why allocations change, it’s hard to say without knowing the code you’re running.

aplavin · November 8, 2022, 9:41am

Is it natural in your case to have a function that operates on whole objects/rows (it often is)? Such as, func(obj) = obj.value / (obj.day + obj.hour/24), and then apply this function to rows of your table.

If that’s the case, then arrays are an obvious and builtin datastructure: tbl = [(day=1, hour=2, value=123.456), ...]. Thanks to Tables.jl, this is a fully capable table in Julia, and can be converted from/to other table types at will. Access to a single row is cheap and done by tbl[i]; apply the function to all rows is also performant, just do map(func, tbl).

For even better performance, you may want columnar storage, such as StructArrays or the already mentioned TypedTables. They are both tables again! And the interface to access a single row or apply the function to all rows stays the same, tbl[i] and map(func, tbl).

pjuergens · November 9, 2022, 8:55am

Unfortunately, the function barrier reduced the problem, but it seems to me like getting data from the dataframe is still the bottleneck in my code. I tried to put it in a minimal working example below, reproducing the structure of my code. So basically in the function foo I just need a slice of some columns of the dataframe that I need to access by name and operations are just done on single numbers.

The profiler shows that the line calling foo is critical, mainly due to call of getindex and string. @time results in 0.058843 seconds (875.52 k allocations: 44.086 MiB, 30.16% gc time), so also a lot of allocations and garbage collection is happening here.

Some background: I’m translating some pascal-code to julia. In pascal the dataframe is just a matrix accessed by indizes. The pascal-code takes 5 seconds to run, my julia-translation currently takes 30 seconds, so there should be some performance problem But maybe I’m also missing something fundamental, as I am new to julia.

using DataFrames

len_df = 5 * 8760
df = DataFrame(
    "column 1" => rand(len_df),
    "column 2" => rand(len_df),
    "other column 1" => rand(len_df),
    "other column 2" => rand(len_df),
    "a" => rand(len_df),
    "b" => rand(len_df),
    "c" => rand(len_df),
    "d" => rand(len_df),
    "e" => rand(len_df),
)

function foo(start_index::Integer, a::Array{Float64}, b::Array{Float64})
    sum = 0.0
    for i in start_index:start_index+24
        sum += a[i] * b[i]
    end
    return sum
end

function bar(df, len_df)
    for i in 1:(len_df-24)
        for j in 1:2
            foo(i, df[!, "column $j"], df[!, "other column $j"])
        end
    end
end

@time bar(df, len_df)
@profview bar(df, len_df)

kristoffer.carlsson · November 9, 2022, 9:29am

The performance docs have a section about this: Performance Tips · The Julia Language

lawless-m · November 9, 2022, 9:55am

even just pre-generating your column names makes a significant difference

julia> function bar(df, len_df)
    for i in 1:(len_df-24)
        for j in 1:2
            foo(i, df[!, "column $j"], df[!, "other column $j"])
        end
    end
end

julia> @time bar(df, len_df)
  0.082328 seconds (963.66 k allocations: 48.902 MiB, 5.15% gc time, 42.16% compilation time)

julia> @time bar(df, len_df)
  0.047580 seconds (875.52 k allocations: 44.086 MiB, 7.91% gc time)

vs

julia> function bar(df, len_df)
           for i in 1:(len_df-24)
               foo(i, df[!, "column 1"], df[!, "other column 1"])
               foo(i, df[!, "column 2"], df[!, "other column 2"])
           end
       end
bar (generic function with 1 method)

julia> @time bar(df, len_df)
  0.023997 seconds (18.70 k allocations: 1.009 MiB, 61.50% compilation time)

julia> @time bar(df, len_df)
  0.011519 seconds

pjuergens · November 9, 2022, 11:20am

wow, that’s indeed significant. However, I only see the difference in my minimal example and cannot reproduce it in my actual code. Maybe there is overall another problem in the code that I just don’t see at the moment…

The profiler points to the functions getindex from dataframe.jl:525, indexed_iterate from tuple.jl:89 and the return statement of my function - which is foo in my example (with Flags: GC). Even if I delete the content of the function just returning a fixed number, it doesn’t change anything - that’s why I thought the problem lies in passing the input-arguments to the function.

Thanks already for all your help!

nilshg · November 9, 2022, 11:37am

It might well be that your MWE is too trivial, but aren’t you just doing

using RollingFunctions
rolling(sum, df."column 1" .* df."other column 1", 24)

pjuergens · November 15, 2022, 9:35am

Thanks to everybody! I further increased performance the following way

initializing arrays for the columns of the dataframe outside of the for-loops and using them when calling the function instead of accessing the dataframe there
I had some issues with the struct that stored the return-value of my function. Passing the type as parameter to the struct as explained in the performance tips and fixing the dimensions of the array had a huge impact on performance and avoids dynamic dispatch

and @nilshg you are very right, my MWE was too trivial