Piping DataFrame rows

@pipe eachrow( my_DataFrame ) .|> _

This gives an error:
“Objects of type DataFrameRow{DataFrame, DataFrames.Index} are not callable”

It would be very useful to Pipe the rows of a DataFrame. Is there a way to avoid the error?

Why not do

using DataFrames

df = DataFrame(a = rand(100))

for row in eachrow(df)
    row.a * 2 # or do whatever you need to do
end

You can also do this via piping:

@pipe eachrow(df) .|> _.a * 2

Of course piping is faster and needs less allocations:

julia> @benchmark for row in eachrow(df)
       row.a * 2
       end
BenchmarkTools.Trial:
  memory estimate:  20.34 KiB
  allocs estimate:  501
  --------------
  minimum time:     16.399 μs (0.00% GC)
  median time:      17.100 μs (0.00% GC)
  mean time:        19.240 μs (5.33% GC)
  maximum time:     2.709 ms (99.02% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark @pipe eachrow(df) .|> _.a * 2
BenchmarkTools.Trial:
  memory estimate:  4.34 KiB
  allocs estimate:  210
  --------------
  minimum time:     10.399 μs (0.00% GC)
  median time:      11.600 μs (0.00% GC)
  mean time:        12.083 μs (0.96% GC)
  maximum time:     1.176 ms (98.44% GC)
  --------------
  samples:          10000
  evals/sample:     1
2 Likes

Thanks Frederik.
That’s helpful.
I’d like to create many new columns using row level logic, and have new columns depend on other new columns. The shortest way I’ve found is to convert the static DataFrameRow object to a dynamic Dictionary and then to a DotMap for nicer notation. This conversion takes four steps:
DataFrameRow → NamedTuple → Dictionary → DotMap.
Is there a better approach?

using DataFrames, Pipe, DotMaps, NamedTupleTools

df = DataFrame(a = rand(10))

@pipe eachrow(df) .|> begin  
    r          = DotMap(convert(Dict,NamedTuple(_)))    
    r.New      = r.a * 2
    r.New2     = r.New + 10
    r
end |> DataFrame

The logic you have don’t require row by row

df.New = 2*df.a
df.New2 = df.New .+ 10
df

I’m sure your problem is more complex than the MWE, but I will at least partially second what @xiaodai said. It doesn’t quite look like you need row-by-row logic because you never mentioned anything about referring to other rows – only to other columns. I would use row-by-row if I needed to keep track of the elements from the previous row, for example. If you just need to refer to other columns, it seems like you could do it “normally” and it should work fine and relatively fast.

1 Like

I was thinking the same thing too.

I could have 10 or more calculated columns to add. Some might require more complex operations. Its nice to put them in a block and forget about vectorising over other dimensions (columns) within the block

that might be slow for large datasets, but for smaller datasets it should be fine.