Fast iteration over rows of a DataFrame

I want to read a CSV file and do some path-dependent calculations involving multiple columns (i.e., no vectorization allowed in the actual problem.)

I was surprised to see that in a simple example, DataFrames introduce over 100x overhead. Is there a faster way to iterate over rows? Or alternatively, a way to parse a CSV directly into a NamedTuple?

julia> d = [(a=rand(),b=rand()) for _ in 1:10^6];

julia> df = DataFrame(d);

julia> function f(xs)
        s = 0.0;
        for x in xs
         s += x.a * x.b
        end
        s
       end
f (generic function with 1 method)

julia> function g(xs)
        s = 0.0
        for x in eachrow(xs)
         s += x.a * x.b
        end
        s
       end
g (generic function with 1 method)

julia> @btime f($d)
  577.269 μs (0 allocations: 0 bytes)
249855.20496448214

julia> @btime g($df)
  105.782 ms (6998979 allocations: 122.05 MiB)
249855.20496448386
4 Likes

If all the entries can be promoted to the same type, you may get better performance with readdlm, which returns a matrix.

1 Like

the slow down is indeed surprising.
Seems like you should just use a Matrix instead of a DataFrame (if your use case allows this)

the function below is about 38% faster than yours, but still way slower than the matrix

function u(xs)
        s = 0.0;
        @inbounds for i=1:size(xs,1)
            s+=xs[i,:a]*xs[i,:b]
        end
        s
       end

My actual use case does involve multiple types of columns.

I’ve opened an issue on the DataFrames.jl github to see if they have any ideas:

EDIT: The reason is that iteration over the rows of a DataFrame is type-unstable, hence the slow down. I’m not a big fan of Julia’s DataFrame API anyway, so I’ll just stick with NamedTuples.

1 Like

Ok. If your use case is time critical, you may want to work with several vectors instead of a dataframe. Also encoding the character data in Ints or similar might speed up the process

For anyone else googling this, the solution is to use IndexedTables.jl.

Unlike DataFrames.jl, it is type-stable when iterating over rows, so the performance is just as fast as working with raw vectors.

I’m surprised IndexedTables.jl isn’t more popular. It seems to have most of the same features without the performance pitfalls.

5 Likes

Also look at https://github.com/piever/JuliaDBMeta.jl

2 Likes

For reference, you don’t even need to use IndexedTables. You can just pass a type-stable iterators to a function. In the OP, call g(Tables.columntable(df)) instead of g(df) to pass g a named tuple of vectors, and then replace eachrow(xs) with Tables.rows(xs).

(IndexedTables is great if you have few columns, but with hundreds of columns of different types it’s going to stress the compiler.)

7 Likes

Thank you for your useful information!
Finally, I could iterate over rows using:

  • tbl = Tables.rowtable(df)
  • for row in Tables.rows(tbl)

Actually the recommended way after DataFrames 0.21 will be released (or on current #master) is:

for row in Tables.namedtupleiterator(df)

if you need performance but you are willing to pay the cost of compilation.

If your computation is small and you want to avoid compilation cost (which for very wide tables can be significant) use what you have indicated above:

for row in eachrow(df)
9 Likes

If you want performance, you cannot just use for row in Tables.namedtupleiterator(df), right? You would still need to pass the rows iterator as a function argument:

function g(rows)
   s = 0.0
   for row in rows
      s += row.a * row.b
   end
   s
 end
# compile once
g(eachrow(df))
# faster but recompiles for each dataframe
g(Tables.namedtupleiterator(df))
4 Likes

Yes - I was too brief. Thank you for correcting. You need a barrier function as you have indicated.

In this specific case one could also write the following to get a barrier:

mapreduce(row -> row.a+row.b, +, Tables.namedtupleiterator(df), init=0.0)

To get around the long compilation times, can we subset the DataFrame to only fetch the columns we want?, ie, Tables.namedtupleiterator(df[!, [:a, :b]])

1 Like

Sure - but I did not want to complicate the code with another change. Actually the fastest way would probably be just:

Tables.rows((a=df.a, b=df.b))
4 Likes

Hi Rob @robsmith11 and all on this thread - might you have an insight or new approach here ?

So I took up the solution of using IndexedTables.jl for Fast Iterations over rows of a Dataframe
here. And then I proceeded to attempt to use IndexedTables for multidimensional 2D,3D (scatter) Plots
using IndexedTables to graph N-Dimensional data; but ran into issues trying to
collect the iterable for iter in eachindex(keys(tab_t1.index.columns))
because tab_t1.index.columns returns
ERROR: LoadError: type IndexedTable has no field index
as you’ll see when you run the Julia pseudocode listed here

Any new insights or approaches to getting the keys from tab_t1.index.columns ;
-or- another way to automatically graph general multidimensional 2D,3D,(? and 4D like www.wolframalpha.com ?) scatter Plots is appreciated.

TY
-Marc