Piping DataFrame rows

Lincoln_Hannah · November 4, 2020, 12:01pm

@pipe eachrow( my_DataFrame ) .|> _

This gives an error:
“Objects of type DataFrameRow{DataFrame, DataFrames.Index} are not callable”

It would be very useful to Pipe the rows of a DataFrame. Is there a way to avoid the error?

fbanning · November 4, 2020, 1:11pm

Why not do

using DataFrames

df = DataFrame(a = rand(100))

for row in eachrow(df)
    row.a * 2 # or do whatever you need to do
end

You can also do this via piping:

@pipe eachrow(df) .|> _.a * 2

Of course piping is faster and needs less allocations:

julia> @benchmark for row in eachrow(df)
       row.a * 2
       end
BenchmarkTools.Trial:
  memory estimate:  20.34 KiB
  allocs estimate:  501
  --------------
  minimum time:     16.399 μs (0.00% GC)
  median time:      17.100 μs (0.00% GC)
  mean time:        19.240 μs (5.33% GC)
  maximum time:     2.709 ms (99.02% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark @pipe eachrow(df) .|> _.a * 2
BenchmarkTools.Trial:
  memory estimate:  4.34 KiB
  allocs estimate:  210
  --------------
  minimum time:     10.399 μs (0.00% GC)
  median time:      11.600 μs (0.00% GC)
  mean time:        12.083 μs (0.96% GC)
  maximum time:     1.176 ms (98.44% GC)
  --------------
  samples:          10000
  evals/sample:     1

Lincoln_Hannah · November 5, 2020, 3:30am

Thanks Frederik.
That’s helpful.
I’d like to create many new columns using row level logic, and have new columns depend on other new columns. The shortest way I’ve found is to convert the static DataFrameRow object to a dynamic Dictionary and then to a DotMap for nicer notation. This conversion takes four steps:
DataFrameRow → NamedTuple → Dictionary → DotMap.
Is there a better approach?

using DataFrames, Pipe, DotMaps, NamedTupleTools

df = DataFrame(a = rand(10))

@pipe eachrow(df) .|> begin  
    r          = DotMap(convert(Dict,NamedTuple(_)))    
    r.New      = r.a * 2
    r.New2     = r.New + 10
    r
end |> DataFrame

xiaodai · November 5, 2020, 4:20am

The logic you have don’t require row by row

df.New = 2*df.a
df.New2 = df.New .+ 10
df

tbeason · November 5, 2020, 4:27am

I’m sure your problem is more complex than the MWE, but I will at least partially second what @xiaodai said. It doesn’t quite look like you need row-by-row logic because you never mentioned anything about referring to other rows – only to other columns. I would use row-by-row if I needed to keep track of the elements from the previous row, for example. If you just need to refer to other columns, it seems like you could do it “normally” and it should work fine and relatively fast.

xiaodai · November 5, 2020, 4:37am

I was thinking the same thing too.

Lincoln_Hannah · November 5, 2020, 4:38am

I could have 10 or more calculated columns to add. Some might require more complex operations. Its nice to put them in a block and forget about vectorising over other dimensions (columns) within the block

xiaodai · November 5, 2020, 5:06am

that might be slow for large datasets, but for smaller datasets it should be fine.

Topic		Replies	Views
Performance of eachrow(::DataFrame) Data	4	478	August 24, 2023
Fast iteration over rows of a DataFrame Performance	14	13932	June 30, 2020
JuliaDB versus Data question	12	2254	June 18, 2019
Review inefficient pipeline? Data first-steps	2	513	September 5, 2019
Can this be made faster? Performance dataframes	5	541	March 19, 2022

Piping DataFrame rows

Related topics