I’m searching for a fast (or the fastest) way to access datapoints/numbers in a dataframes-like structure. I have 57 timeseries (columns) with 43800 datapoints (rows) each that are read in from a csv-file to a dataframe (DataFrames.jl). Currently I’m accessing single datapoints via row-index and column-label, i.e. e.g. GLD[i, "hour"] where GLD is the dataframe and i some index. However, this seems to be very time consuming.
So my question is: what is the fastest way to deal with this kind of data? Is there a faster way to access the data in the dataframe? Should I convert it to a (named) array or tuple or is there any other more performant datastructure?
Accessing the data should not be time-consuming at all. TypedTables shouldn’t be an improvement over DataFrames for data access alone.
It’s likely the real problem is that you are not using a function barrier. For best performance, write functions that act on the vectors directly, not the whole data frame.
function foo(a, b)
a[100] + b[101]
end
foo(df.a, df.b)
DataFramesMeta.jl’s @with macro can help with this.
Per comments below - the answer is “it depends” on your code. In some cases the way to go it so call Tables.columntable and convert it to a NamedTuple, but often, as @pdeffebach commented using a function barrier or DataFramesMeta.jl will be enough.
Julia is built around functions, which are compiled to fast, native code. While you can do everything in global scope, it won’t be optimal as global variables need to have their types, size etc checked on every access, which takes time.
As for why allocations change, it’s hard to say without knowing the code you’re running.
Is it natural in your case to have a function that operates on whole objects/rows (it often is)? Such as, func(obj) = obj.value / (obj.day + obj.hour/24), and then apply this function to rows of your table.
If that’s the case, then arrays are an obvious and builtin datastructure: tbl = [(day=1, hour=2, value=123.456), ...]. Thanks to Tables.jl, this is a fully capable table in Julia, and can be converted from/to other table types at will. Access to a single row is cheap and done by tbl[i]; apply the function to all rows is also performant, just do map(func, tbl).
For even better performance, you may want columnar storage, such as StructArrays or the already mentioned TypedTables. They are both tables again! And the interface to access a single row or apply the function to all rows stays the same, tbl[i] and map(func, tbl).
Unfortunately, the function barrier reduced the problem, but it seems to me like getting data from the dataframe is still the bottleneck in my code. I tried to put it in a minimal working example below, reproducing the structure of my code. So basically in the function foo I just need a slice of some columns of the dataframe that I need to access by name and operations are just done on single numbers.
The profiler shows that the line calling foo is critical, mainly due to call of getindex and string. @time results in 0.058843 seconds (875.52 k allocations: 44.086 MiB, 30.16% gc time), so also a lot of allocations and garbage collection is happening here.
Some background: I’m translating some pascal-code to julia. In pascal the dataframe is just a matrix accessed by indizes. The pascal-code takes 5 seconds to run, my julia-translation currently takes 30 seconds, so there should be some performance problem But maybe I’m also missing something fundamental, as I am new to julia.
using DataFrames
len_df = 5 * 8760
df = DataFrame(
"column 1" => rand(len_df),
"column 2" => rand(len_df),
"other column 1" => rand(len_df),
"other column 2" => rand(len_df),
"a" => rand(len_df),
"b" => rand(len_df),
"c" => rand(len_df),
"d" => rand(len_df),
"e" => rand(len_df),
)
function foo(start_index::Integer, a::Array{Float64}, b::Array{Float64})
sum = 0.0
for i in start_index:start_index+24
sum += a[i] * b[i]
end
return sum
end
function bar(df, len_df)
for i in 1:(len_df-24)
for j in 1:2
foo(i, df[!, "column $j"], df[!, "other column $j"])
end
end
end
@time bar(df, len_df)
@profview bar(df, len_df)
even just pre-generating your column names makes a significant difference
julia> function bar(df, len_df)
for i in 1:(len_df-24)
for j in 1:2
foo(i, df[!, "column $j"], df[!, "other column $j"])
end
end
end
julia> @time bar(df, len_df)
0.082328 seconds (963.66 k allocations: 48.902 MiB, 5.15% gc time, 42.16% compilation time)
julia> @time bar(df, len_df)
0.047580 seconds (875.52 k allocations: 44.086 MiB, 7.91% gc time)
vs
julia> function bar(df, len_df)
for i in 1:(len_df-24)
foo(i, df[!, "column 1"], df[!, "other column 1"])
foo(i, df[!, "column 2"], df[!, "other column 2"])
end
end
bar (generic function with 1 method)
julia> @time bar(df, len_df)
0.023997 seconds (18.70 k allocations: 1.009 MiB, 61.50% compilation time)
julia> @time bar(df, len_df)
0.011519 seconds
wow, that’s indeed significant. However, I only see the difference in my minimal example and cannot reproduce it in my actual code. Maybe there is overall another problem in the code that I just don’t see at the moment…
The profiler points to the functions getindex from dataframe.jl:525, indexed_iterate from tuple.jl:89 and the return statement of my function - which is foo in my example (with Flags: GC). Even if I delete the content of the function just returning a fixed number, it doesn’t change anything - that’s why I thought the problem lies in passing the input-arguments to the function.
Thanks to everybody! I further increased performance the following way
initializing arrays for the columns of the dataframe outside of the for-loops and using them when calling the function instead of accessing the dataframe there
I had some issues with the struct that stored the return-value of my function. Passing the type as parameter to the struct as explained in the performance tips and fixing the dimensions of the array had a huge impact on performance and avoids dynamic dispatch
and @nilshg you are very right, my MWE was too trivial