Fast iteration over rows of a DataFrame

robsmith11 · May 26, 2019, 1:26am

I want to read a CSV file and do some path-dependent calculations involving multiple columns (i.e., no vectorization allowed in the actual problem.)

I was surprised to see that in a simple example, DataFrames introduce over 100x overhead. Is there a faster way to iterate over rows? Or alternatively, a way to parse a CSV directly into a NamedTuple?

julia> d = [(a=rand(),b=rand()) for _ in 1:10^6];

julia> df = DataFrame(d);

julia> function f(xs)
        s = 0.0;
        for x in xs
         s += x.a * x.b
        end
        s
       end
f (generic function with 1 method)

julia> function g(xs)
        s = 0.0
        for x in eachrow(xs)
         s += x.a * x.b
        end
        s
       end
g (generic function with 1 method)

julia> @btime f($d)
  577.269 μs (0 allocations: 0 bytes)
249855.20496448214

julia> @btime g($df)
  105.782 ms (6998979 allocations: 122.05 MiB)
249855.20496448386

Elrod · May 26, 2019, 3:26am

If all the entries can be promoted to the same type, you may get better performance with readdlm, which returns a matrix.

bernhard · May 26, 2019, 4:00pm

the slow down is indeed surprising.
Seems like you should just use a Matrix instead of a DataFrame (if your use case allows this)

the function below is about 38% faster than yours, but still way slower than the matrix

function u(xs)
        s = 0.0;
        @inbounds for i=1:size(xs,1)
            s+=xs[i,:a]*xs[i,:b]
        end
        s
       end

robsmith11 · May 26, 2019, 4:34pm

My actual use case does involve multiple types of columns.

I’ve opened an issue on the DataFrames.jl github to see if they have any ideas:
https://github.com/JuliaData/DataFrames.jl/issues/1827

EDIT: The reason is that iteration over the rows of a DataFrame is type-unstable, hence the slow down. I’m not a big fan of Julia’s DataFrame API anyway, so I’ll just stick with NamedTuples.

bernhard · May 26, 2019, 7:19pm

Ok. If your use case is time critical, you may want to work with several vectors instead of a dataframe. Also encoding the character data in Ints or similar might speed up the process

robsmith11 · December 22, 2019, 5:23am

For anyone else googling this, the solution is to use IndexedTables.jl.

Unlike DataFrames.jl, it is type-stable when iterating over rows, so the performance is just as fast as working with raw vectors.

I’m surprised IndexedTables.jl isn’t more popular. It seems to have most of the same features without the performance pitfalls.

bashonubuntu · December 22, 2019, 6:00am

Also look at https://github.com/piever/JuliaDBMeta.jl

nalimilan · December 26, 2019, 9:43pm

For reference, you don’t even need to use IndexedTables. You can just pass a type-stable iterators to a function. In the OP, call g(Tables.columntable(df)) instead of g(df) to pass g a named tuple of vectors, and then replace eachrow(xs) with Tables.rows(xs).

(IndexedTables is great if you have few columns, but with hundreds of columns of different types it’s going to stress the compiler.)

msekino · February 16, 2020, 2:01pm

Thank you for your useful information!
Finally, I could iterate over rows using:

tbl = Tables.rowtable(df)
for row in Tables.rows(tbl)

bkamins · February 16, 2020, 2:57pm

Actually the recommended way after DataFrames 0.21 will be released (or on current #master) is:

for row in Tables.namedtupleiterator(df)

if you need performance but you are willing to pay the cost of compilation.

If your computation is small and you want to avoid compilation cost (which for very wide tables can be significant) use what you have indicated above:

for row in eachrow(df)

matthieu · February 16, 2020, 3:23pm

If you want performance, you cannot just use for row in Tables.namedtupleiterator(df), right? You would still need to pass the rows iterator as a function argument:

function g(rows)
   s = 0.0
   for row in rows
      s += row.a * row.b
   end
   s
 end
# compile once
g(eachrow(df))
# faster but recompiles for each dataframe
g(Tables.namedtupleiterator(df))

bkamins · February 16, 2020, 4:20pm

Yes - I was too brief. Thank you for correcting. You need a barrier function as you have indicated.

In this specific case one could also write the following to get a barrier:

mapreduce(row -> row.a+row.b, +, Tables.namedtupleiterator(df), init=0.0)

aaowens · February 16, 2020, 4:37pm

To get around the long compilation times, can we subset the DataFrame to only fetch the columns we want?, ie, Tables.namedtupleiterator(df[!, [:a, :b]])

bkamins · February 16, 2020, 5:05pm

Sure - but I did not want to complicate the code with another change. Actually the fastest way would probably be just:

Tables.rows((a=df.a, b=df.b))

Marc.Cox · June 30, 2020, 3:20pm

Hi Rob @robsmith11 and all on this thread - might you have an insight or new approach here ?

So I took up the solution of using IndexedTables.jl for Fast Iterations over rows of a Dataframe
here. And then I proceeded to attempt to use IndexedTables for multidimensional 2D,3D (scatter) Plots
using IndexedTables to graph N-Dimensional data; but ran into issues trying to
collect the iterable for iter in eachindex(keys(tab_t1.index.columns))
because tab_t1.index.columns returns
ERROR: LoadError: type IndexedTable has no field index
as you’ll see when you run the Julia pseudocode listed here

Any new insights or approaches to getting the keys from tab_t1.index.columns ;
-or- another way to automatically graph general multidimensional 2D,3D,(? and 4D like www.wolframalpha.com ?) scatter Plots is appreciated.

TY
-Marc

Topic		Replies	Views
Performance: Fast way to access numbers in Dataframes or alternatives Performance dataframes , data_structures	12	1185	November 15, 2022
DataFrame transformation is so slow, what am I doing wrong? Performance compilation , dataframes	17	339	May 19, 2024
Performance of eachrow(::DataFrame) Data	4	501	August 24, 2023
Accessing a column value from DataFrameRow allocates Performance dataframes	10	838	March 7, 2022
[ANN] RowTables.jl Data announcement	6	1121	July 26, 2018

Fast iteration over rows of a DataFrame

Related topics