I want to read a CSV file and do some path-dependent calculations involving multiple columns (i.e., no vectorization allowed in the actual problem.)
I was surprised to see that in a simple example, DataFrames introduce over 100x overhead. Is there a faster way to iterate over rows? Or alternatively, a way to parse a CSV directly into a NamedTuple?
julia> d = [(a=rand(),b=rand()) for _ in 1:10^6];
julia> df = DataFrame(d);
julia> function f(xs)
s = 0.0;
for x in xs
s += x.a * x.b
end
s
end
f (generic function with 1 method)
julia> function g(xs)
s = 0.0
for x in eachrow(xs)
s += x.a * x.b
end
s
end
g (generic function with 1 method)
julia> @btime f($d)
577.269 ÎĽs (0 allocations: 0 bytes)
249855.20496448214
julia> @btime g($df)
105.782 ms (6998979 allocations: 122.05 MiB)
249855.20496448386
EDIT: The reason is that iteration over the rows of a DataFrame is type-unstable, hence the slow down. I’m not a big fan of Julia’s DataFrame API anyway, so I’ll just stick with NamedTuples.
Ok. If your use case is time critical, you may want to work with several vectors instead of a dataframe. Also encoding the character data in Ints or similar might speed up the process
For reference, you don’t even need to use IndexedTables. You can just pass a type-stable iterators to a function. In the OP, call g(Tables.columntable(df)) instead of g(df) to pass g a named tuple of vectors, and then replace eachrow(xs) with Tables.rows(xs).
(IndexedTables is great if you have few columns, but with hundreds of columns of different types it’s going to stress the compiler.)
If you want performance, you cannot just use for row in Tables.namedtupleiterator(df), right? You would still need to pass the rows iterator as a function argument:
function g(rows)
s = 0.0
for row in rows
s += row.a * row.b
end
s
end
# compile once
g(eachrow(df))
# faster but recompiles for each dataframe
g(Tables.namedtupleiterator(df))
To get around the long compilation times, can we subset the DataFrame to only fetch the columns we want?, ie, Tables.namedtupleiterator(df[!, [:a, :b]])
Hi Rob @robsmith11 and all on this thread - might you have an insight or new approach here ?
So I took up the solution of using IndexedTables.jl for Fast Iterations over rows of a Dataframe
here. And then I proceeded to attempt to use IndexedTables for multidimensional 2D,3D (scatter) Plots
using IndexedTables to graph N-Dimensional data; but ran into issues trying to
collect the iterable for iter in eachindex(keys(tab_t1.index.columns))
because tab_t1.index.columns returns ERROR: LoadError: type IndexedTable has no field index
as you’ll see when you run the Julia pseudocode listed here
Any new insights or approaches to getting the keys from tab_t1.index.columns ;
-or- another way to automatically graph general multidimensional 2D,3D,(? and 4D like www.wolframalpha.com ?) scatter Plots is appreciated.