Fast iteration over rows of a DataFrame

I want to read a CSV file and do some path-dependent calculations involving multiple columns (i.e., no vectorization allowed in the actual problem.)

I was surprised to see that in a simple example, DataFrames introduce over 100x overhead. Is there a faster way to iterate over rows? Or alternatively, a way to parse a CSV directly into a NamedTuple?

julia> d = [(a=rand(),b=rand()) for _ in 1:10^6];

julia> df = DataFrame(d);

julia> function f(xs)
        s = 0.0;
        for x in xs
         s += x.a * x.b
        end
        s
       end
f (generic function with 1 method)

julia> function g(xs)
        s = 0.0
        for x in eachrow(xs)
         s += x.a * x.b
        end
        s
       end
g (generic function with 1 method)

julia> @btime f($d)
  577.269 μs (0 allocations: 0 bytes)
249855.20496448214

julia> @btime g($df)
  105.782 ms (6998979 allocations: 122.05 MiB)
249855.20496448386
1 Like

If all the entries can be promoted to the same type, you may get better performance with readdlm, which returns a matrix.

1 Like

the slow down is indeed surprising.
Seems like you should just use a Matrix instead of a DataFrame (if your use case allows this)

the function below is about 38% faster than yours, but still way slower than the matrix

function u(xs)
        s = 0.0;
        @inbounds for i=1:size(xs,1)
            s+=xs[i,:a]*xs[i,:b]
        end
        s
       end

My actual use case does involve multiple types of columns.

I’ve opened an issue on the DataFrames.jl github to see if they have any ideas:

EDIT: The reason is that iteration over the rows of a DataFrame is type-unstable, hence the slow down. I’m not a big fan of Julia’s DataFrame API anyway, so I’ll just stick with NamedTuples.

1 Like

Ok. If your use case is time critical, you may want to work with several vectors instead of a dataframe. Also encoding the character data in Ints or similar might speed up the process