I want to read a CSV file and do some path-dependent calculations involving multiple columns (i.e., no vectorization allowed in the actual problem.)
I was surprised to see that in a simple example, DataFrames introduce over 100x overhead. Is there a faster way to iterate over rows? Or alternatively, a way to parse a CSV directly into a NamedTuple?
julia> d = [(a=rand(),b=rand()) for _ in 1:10^6];
julia> df = DataFrame(d);
julia> function f(xs)
s = 0.0;
for x in xs
s += x.a * x.b
f (generic function with 1 method)
julia> function g(xs)
s = 0.0
for x in eachrow(xs)
s += x.a * x.b
g (generic function with 1 method)
julia> @btime f($d)
577.269 μs (0 allocations: 0 bytes)
julia> @btime g($df)
105.782 ms (6998979 allocations: 122.05 MiB)
For reference, you don’t even need to use IndexedTables. You can just pass a type-stable iterators to a function. In the OP, call g(Tables.columntable(df)) instead of g(df) to pass g a named tuple of vectors, and then replace eachrow(xs) with Tables.rows(xs).
(IndexedTables is great if you have few columns, but with hundreds of columns of different types it’s going to stress the compiler.)
Hi Rob @robsmith11 and all on this thread - might you have an insight or new approach here ?
So I took up the solution of using IndexedTables.jl for Fast Iterations over rows of a Dataframe
here. And then I proceeded to attempt to use IndexedTables for multidimensional 2D,3D (scatter) Plots
using IndexedTables to graph N-Dimensional data; but ran into issues trying to
collect the iterable for iter in eachindex(keys(tab_t1.index.columns))
because tab_t1.index.columns returns ERROR: LoadError: type IndexedTable has no field index
as you’ll see when you run the Julia pseudocode listed here
Any new insights or approaches to getting the keys from tab_t1.index.columns ;
-or- another way to automatically graph general multidimensional 2D,3D,(? and 4D like www.wolframalpha.com ?) scatter Plots is appreciated.