Cleanest way to convert DataFrame row into a Vector?

I am currently using an ugly hack for it:

for row in eachrow(df)
  vec = convert(Array, row)' # notice the transpose
end

The problem is that vec has type Matrix in this case, not Vector. I am sure there is a cleaner way?

1 Like
for row in eachrow(df)
  v = vec(convert(Array, row)) # no transpose necessary
end
3 Likes

Thank you @pfitzseb.

Now it looks like it doesn’t work anymore:
MethodError: Cannot convert an object of type
DataFrameRow{DataFrame, DataFrames.Index} to an object of type
Array

3 Likes

A lot of convert methods were deprecated. But collect works and will remain that way for a long time, given that DataFrames.jl is now at 1.0 (which it wasn’t when this thread started).

3 Likes

Yep. Furthermore there’s the method Matrix()

I guess the collect function is the best.

1 Like
values(df[2,:])

#or

[values(df[2,:])...]
1 Like

Splatting like this will be inefficient for data frames with many columns. I would not recommend it as a solution, even if it’s possible.

The following tests seem to show that collect() can be 2 x slower for a dataframe with 1000 columns:

using DataFrames
df = DataFrame(rand(100,1000), :auto)
@btime [v for v in values($df[2,:])]    #  68.2 μs (2007 allocs: 70.9 KiB)
@btime [values($df[2,:])...]            #  91.9 μs (3007 allocs: 86.6 KiB)
@btime collect($df[2,:])                # 117.8 μs (3985 allocs: 85.9 KiB)
1 Like

it seems that what makes the difference is the values (…) function

julia> @btime collect(values($df[2,:]));
  56.800 μs (2006 allocations: 63.03 KiB)

julia> @btime [v for v in values($df[2,:])] ;
  59.800 μs (2007 allocations: 70.91 KiB)

julia> @btime [values($df[2,:])...]  ;        
  77.600 μs (3007 allocations: 86.59 KiB)

julia> @btime [v for v in $df[2,:]] ;
  103.500 μs (3983 allocations: 85.83 KiB)
3 Likes

This benchmarks ~1.25 faster for the few cases I tested.
(all the values in the row need to be of the same concrete type)

function dfrow(df::DataFrame, row)
    T = typeof(df[row,1])
    Array{T,1}(df[row,:])
end
1 Like

maybe I mess with the types or misunderstand what you say, but it seems that removing the limitations on the type of the columns does not penalize the performance.
rather.

julia> @btime Vector{Float64}(df[2,:]);
  41.400 μs (1491 allocations: 31.27 KiB)

julia> @btime Vector{Any}(df[2,:]);
  28.500 μs (1002 allocations: 23.62 KiB)

1 Like

indeed … how unexpected

dfrow(df::DataFrame, row::Int) = Array{Any, 1}(df[row, :])
1 Like

forgive me but my English does not allow me to grasp the nuances of your observation.
Could you explain the meaning more explicitly?
I take this opportunity to ask if an implementation of values () in the following way is not preferable.

_values(dfr)=tuple(Vector{Any}(dfr))

If I understood Julia’s performance tips correctly, parametrizing with Any avoids runtime type checking and is therefore faster. However, if we need to do calculations with such an array, we will pay the performance price later.

1 Like

in this way it preserves the type and much of the performance

df = DataFrame(rand(100,1000), :auto)
rowtovec(df::DataFrame,row::Int)=[df[row,i] for i in 1:ncol(df)]

julia> @btime rowtovec(df,2)
  31.400 μs (1003 allocations: 23.64 KiB)
1000-element Vector{Float64}:
 0.31623289609155747
 0.08624728904277756
2 Likes

this does appear to outperform the others while avoiding {Any}

I wondered why [df[row,i] for i in 1:ncol(df)] would perform better, and after confirmed it looked at the output of Meta.@lower. What surprises me is that while it looks like it should be equivalent to collect(Base.Generator(i -> df[2, i], 1:ncol(df))), this form is ~60% slower (48μs vs 30μs).

For context, here’s Meta.@lower [df[row,i] for i in 1:ncol(df)]

:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─       $(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─      global var"#76#77"
│        const var"#76#77"
│   %3 = Core._structtype(Main, Symbol("#76#77"), Core.svec(), Core.svec(), Core.svec(), false, 0)
│        var"#76#77" = %3
│        Core._setsuper!(var"#76#77", Core.Function)
│        Core._typebody!(var"#76#77", Core.svec())
└──      return nothing
)))
│   %2  = Core.svec(var"#76#77", Core.Any)
│   %3  = Core.svec()
│   %4  = Core.svec(%2, %3, $(QuoteNode(:(#= none:0 =#))))
│         $(Expr(:method, false, :(%4), CodeInfo(
    @ none within `none`
1 ─ %1 = Base.getindex(df, 2, i)
└──      return %1
)))
│         #76 = %new(var"#76#77")
│   %7  = #76
│   %8  = ncol(df)
│   %9  = 1:%8
│   %10 = Base.Generator(%7, %9)
│   %11 = Base.collect(%10)
└──       return %11
))))

And here’s Meta.@lower collect(Base.Generator(i -> df[2, i], 1:ncol(df)))

:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─       $(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─      global var"#46#47"
│        const var"#46#47"
│   %3 = Core._structtype(Main, Symbol("#46#47"), Core.svec(), Core.svec(), Core.svec(), false, 0)
│        var"#46#47" = %3
│        Core._setsuper!(var"#46#47", Core.Function)
│        Core._typebody!(var"#46#47", Core.svec())
└──      return nothing
)))
│   %2  = Core.svec(var"#46#47", Core.Any)
│   %3  = Core.svec()
│   %4  = Core.svec(%2, %3, $(QuoteNode(:(#= REPL[1]:1 =#))))
│         $(Expr(:method, false, :(%4), CodeInfo(
    @ REPL[1]:1 within `none`
1 ─ %1 = Base.getindex(df, 2, i)
└──      return %1
)))
│   %6  = Base.getproperty(Base, :Generator)
│         #46 = %new(var"#46#47")
│   %8  = #46
│   %9  = ncol(df)
│   %10 = 1:%9
│   %11 = (%6)(%8, %10)
│   %12 = collect(%11)
└──       return %12
))))

The only difference I see is that instead of %10 = Base.Generator(%7, %9), the second expansion has:

│   %6  = Base.getproperty(Base, :Generator)
│         #46 = %new(var"#46#47")
...
│   %11 = (%6)(%8, %10)

This seems rather strange to me…

@tecosaur that seems pretty intriguing to me, I’ll also wait for someone more knowledgeable to chime in here