Specific columns in dataframe transform

mydf = DataFrame(a = [1,2,3], b = [1, missing, 2])

Works:

mydf = transform(mydf, 
                           AsTable([:a, :b])
                           => ByRow(x -> x.a + x.b ) => :c)

Works:

mydf = transform(mydf, 
                           AsTable([:a, :b])
                           => ByRow(x -> sum(x) ) => :c)

Doesn’t work. But in some cases it is convenient to use the numerical index rather than repeat the column names:

mydf = transform(mydf, 
                           AsTable([:a, :b])
                           => ByRow(x -> sum(x[1:2]) ) => :c)

ERROR: MethodError: no method matching getindex(::NamedTuple{(:a, :b),Tuple{Int64,Union{Missing, Int64}}}, ::UnitRange{Int64})
Closest candidates are:
getindex(::NamedTuple, ::Int64) at namedtuple.jl:112
getindex(::NamedTuple, ::Symbol) at namedtuple.jl:113
Stacktrace:
[1] (::var"#359#360")(::NamedTuple{(:a, :b),Tuple{Int64,Union{Missing, Int64}}}) at ./REPL[16]:3
[2] iterate at ./generator.jl:47 [inlined]
[3] collect at ./array.jl:686 [inlined]
[4] (::ByRow{var"#359#360"})(::NamedTuple{(:a, :b),Tuple{Array{Int64,1},Array{Union{Missing, Int64},1}}}) at /home/jube/.julia/packages/DataFrames/MA4YO/src/abstractdataframe/selection.jl:325
[5] default_table_transformation(::DataFrame, ::ByRow{var"#359#360"}) at /home/jube/.julia/packages/DataFrames/MA4YO/src/abstractdataframe/selectionfast.jl:126
[6] table_transformation(::DataFrame, ::Function) at /home/jube/.julia/packages/DataFrames/MA4YO/src/abstractdataframe/selectionfast.jl:54
[7] _transformation_helper(::DataFrame, ::AsTable, ::Base.RefValue{Any}) at /home/jube/.julia/packages/DataFrames/MA4YO/src/abstractdataframe/selection.jl:557
[8] select_transform!(::Base.RefValue{Any}, ::DataFrame, ::DataFrame, ::Set{Symbol}, ::Bool, ::Base.RefValue{Bool}, ::BitArray{1}) at /home/jube/.julia/packages/DataFrames/MA4YO/src/abstractdataframe/selection.jl:764
[9] _manipulate(::DataFrame, ::Array{Any,1}, ::Bool, ::Bool) at /home/jube/.julia/packages/DataFrames/MA4YO/src/abstractdataframe/selection.jl:1621
[10] manipulate(::DataFrame, ::Any, ::Vararg{Any,N} where N; copycols::Bool, keeprows::Bool, renamecols::Bool) at /home/jube/.julia/packages/DataFrames/MA4YO/src/abstractdataframe/selection.jl:1541
[11] select(::DataFrame, ::Any, ::Vararg{Any,N} where N; copycols::Bool, renamecols::Bool) at /home/jube/.julia/packages/DataFrames/MA4YO/src/abstractdataframe/selection.jl:1171
[12] #transform#482 at /home/jube/.julia/packages/DataFrames/MA4YO/src/abstractdataframe/selection.jl:1238 [inlined]
[13] transform(::DataFrame, ::Any) at /home/jube/.julia/packages/DataFrames/MA4YO/src/abstractdataframe/selection.jl:1238
[14] top-level scope at REPL[16]:1

This is the way how Base Julia is designed, so there is nothing we can do about this limitation.

julia> nt = (a=1,b=2,c=3)
(a = 1, b = 2, c = 3)

julia> nt[1:2]
ERROR: MethodError: no method matching getindex(::NamedTuple{(:a, :b, :c), Tuple{Int64, Int64, Int64}}, ::UnitRange{Int64})

You can write: sum(collect(x)[1:2]). Unfortunately this will have a performance impact.

2 Likes

Could getindex be broadcasted? Like in:
getindex.(Ref(nt), 1:2)

Yes, but it will allocate.

Now I realized that the best approach is probably:

Tuple(nt)[1:2]

as tuples support such indexing.

2 Likes

Maybe the tiny OP example is not representative, but FWIW (Julia 1.7.2, DataFrames v1.3.4):

using DataFrames, BenchmarkTools

mydf = DataFrame(a = [1,2,3], b = [1, missing, 2])

@btime transform($mydf, AsTable([:a, :b]) => ByRow(x -> sum(Tuple(x)[1:2])) => :c)           # 27 μs (205 allocs: 10.4 KiB)

@btime transform($mydf, AsTable([:a, :b]) => ByRow(x -> sum(getindex.(Ref(x),1:2))) => :c)   # 22 μs (200 allocs: 10.5 KiB)

Yes - I also noticed it, but this is something that hopefully could be improved on compiler side in the future as Tuple(x)[1:2] should be possible to be resolved at compile time since x is a type-stable NamedTuple.

I am not sure whom to ping here for an opinion, maybe @jameson?

1 Like

in terms of performance, the simplest form appears to be the fastest

julia> mydf = DataFrame(a = 1:10^6, b = rand([1, missing, 2],10^6));

julia> @btime transform($mydf, 
                                  AsTable([:a, :b])      
                                  => ByRow(x -> x.a + x.b ) => :c);
  7.879 ms (202 allocations: 24.81 MiB)

julia> @btime  transform($mydf, 
                                  AsTable([:a, :b])      
                                  => ByRow(x -> sum(x) ) => :c);
  27.902 ms (1000202 allocations: 55.32 MiB)

julia> @btime  transform($mydf,  [:a, :b]=> (+) => :c);
  6.795 ms (150 allocations: 24.80 MiB)



julia> @btime transform($mydf, AsTable([:a, :b]) => ByRow(x -> sum(Tuple(x)[1:2])) => :c) ;          # 27 μs (205 allocs: 10.4 KiB)
  217.256 ms (5665980 allocations: 157.03 MiB)

julia> @btime transform($mydf, AsTable([:a, :b]) => ByRow(x -> sum(getindex.(Ref(x),1:2))) => :c) ;
  82.576 ms (3333599 allocations: 187.57 MiB)



1 Like