@pipe eachrow( my_DataFrame ) .|> _
This gives an error:
“Objects of type DataFrameRow{DataFrame, DataFrames.Index} are not callable”
It would be very useful to Pipe the rows of a DataFrame. Is there a way to avoid the error?
@pipe eachrow( my_DataFrame ) .|> _
This gives an error:
“Objects of type DataFrameRow{DataFrame, DataFrames.Index} are not callable”
It would be very useful to Pipe the rows of a DataFrame. Is there a way to avoid the error?
Why not do
using DataFrames
df = DataFrame(a = rand(100))
for row in eachrow(df)
row.a * 2 # or do whatever you need to do
end
You can also do this via piping:
@pipe eachrow(df) .|> _.a * 2
Of course piping is faster and needs less allocations:
julia> @benchmark for row in eachrow(df)
row.a * 2
end
BenchmarkTools.Trial:
memory estimate: 20.34 KiB
allocs estimate: 501
--------------
minimum time: 16.399 μs (0.00% GC)
median time: 17.100 μs (0.00% GC)
mean time: 19.240 μs (5.33% GC)
maximum time: 2.709 ms (99.02% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark @pipe eachrow(df) .|> _.a * 2
BenchmarkTools.Trial:
memory estimate: 4.34 KiB
allocs estimate: 210
--------------
minimum time: 10.399 μs (0.00% GC)
median time: 11.600 μs (0.00% GC)
mean time: 12.083 μs (0.96% GC)
maximum time: 1.176 ms (98.44% GC)
--------------
samples: 10000
evals/sample: 1
Thanks Frederik.
That’s helpful.
I’d like to create many new columns using row level logic, and have new columns depend on other new columns. The shortest way I’ve found is to convert the static DataFrameRow object to a dynamic Dictionary and then to a DotMap for nicer notation. This conversion takes four steps:
DataFrameRow → NamedTuple → Dictionary → DotMap.
Is there a better approach?
using DataFrames, Pipe, DotMaps, NamedTupleTools
df = DataFrame(a = rand(10))
@pipe eachrow(df) .|> begin
r = DotMap(convert(Dict,NamedTuple(_)))
r.New = r.a * 2
r.New2 = r.New + 10
r
end |> DataFrame
The logic you have don’t require row by row
df.New = 2*df.a
df.New2 = df.New .+ 10
df
I’m sure your problem is more complex than the MWE, but I will at least partially second what @xiaodai said. It doesn’t quite look like you need row-by-row logic because you never mentioned anything about referring to other rows – only to other columns. I would use row-by-row if I needed to keep track of the elements from the previous row, for example. If you just need to refer to other columns, it seems like you could do it “normally” and it should work fine and relatively fast.
I could have 10 or more calculated columns to add. Some might require more complex operations. Its nice to put them in a block and forget about vectorising over other dimensions (columns) within the block
that might be slow for large datasets, but for smaller datasets it should be fine.