Cleanest way to convert DataFrame row into a Vector?

juliohm · August 24, 2017, 1:02am

I am currently using an ugly hack for it:

for row in eachrow(df)
  vec = convert(Array, row)' # notice the transpose
end

The problem is that vec has type Matrix in this case, not Vector. I am sure there is a cleaner way?

pfitzseb · August 24, 2017, 9:04am

for row in eachrow(df)
  v = vec(convert(Array, row)) # no transpose necessary
end

juliohm · August 24, 2017, 4:44pm

Thank you @pfitzseb.

Frisus95 · September 10, 2021, 5:45pm

Now it looks like it doesn’t work anymore:
MethodError: Cannot convert an object of type
DataFrameRow{DataFrame, DataFrames.Index} to an object of type
Array

pdeffebach · September 10, 2021, 5:51pm

A lot of convert methods were deprecated. But collect works and will remain that way for a long time, given that DataFrames.jl is now at 1.0 (which it wasn’t when this thread started).

Frisus95 · September 10, 2021, 5:52pm

Yep. Furthermore there’s the method Matrix()

rmsmsgood · August 2, 2022, 8:06am

I guess the collect function is the best.

rocco_sprmnt21 · August 2, 2022, 12:05pm

values(df[2,:])

#or

[values(df[2,:])...]

pdeffebach · August 2, 2022, 1:08pm

Splatting like this will be inefficient for data frames with many columns. I would not recommend it as a solution, even if it’s possible.

rafael.guerra · August 2, 2022, 3:25pm

The following tests seem to show that collect() can be 2 x slower for a dataframe with 1000 columns:

using DataFrames
df = DataFrame(rand(100,1000), :auto)
@btime [v for v in values($df[2,:])]    #  68.2 μs (2007 allocs: 70.9 KiB)
@btime [values($df[2,:])...]            #  91.9 μs (3007 allocs: 86.6 KiB)
@btime collect($df[2,:])                # 117.8 μs (3985 allocs: 85.9 KiB)

rocco_sprmnt21 · August 2, 2022, 4:54pm

it seems that what makes the difference is the values (…) function

julia> @btime collect(values($df[2,:]));
  56.800 μs (2006 allocations: 63.03 KiB)

julia> @btime [v for v in values($df[2,:])] ;
  59.800 μs (2007 allocations: 70.91 KiB)

julia> @btime [values($df[2,:])...]  ;        
  77.600 μs (3007 allocations: 86.59 KiB)

julia> @btime [v for v in $df[2,:]] ;
  103.500 μs (3983 allocations: 85.83 KiB)

JeffreySarnoff · August 2, 2022, 9:39pm

This benchmarks ~1.25 faster for the few cases I tested.
(all the values in the row need to be of the same concrete type)

function dfrow(df::DataFrame, row)
    T = typeof(df[row,1])
    Array{T,1}(df[row,:])
end

rocco_sprmnt21 · August 3, 2022, 1:42pm

maybe I mess with the types or misunderstand what you say, but it seems that removing the limitations on the type of the columns does not penalize the performance.
rather.

julia> @btime Vector{Float64}(df[2,:]);
  41.400 μs (1491 allocations: 31.27 KiB)

julia> @btime Vector{Any}(df[2,:]);
  28.500 μs (1002 allocations: 23.62 KiB)

JeffreySarnoff · August 3, 2022, 2:00pm

indeed … how unexpected

dfrow(df::DataFrame, row::Int) = Array{Any, 1}(df[row, :])

rocco_sprmnt21 · August 3, 2022, 2:14pm

forgive me but my English does not allow me to grasp the nuances of your observation.
Could you explain the meaning more explicitly?
I take this opportunity to ask if an implementation of values () in the following way is not preferable.

_values(dfr)=tuple(Vector{Any}(dfr))

rafael.guerra · August 3, 2022, 2:18pm

If I understood Julia’s performance tips correctly, parametrizing with Any avoids runtime type checking and is therefore faster. However, if we need to do calculations with such an array, we will pay the performance price later.

rocco_sprmnt21 · August 3, 2022, 3:01pm

in this way it preserves the type and much of the performance

df = DataFrame(rand(100,1000), :auto)
rowtovec(df::DataFrame,row::Int)=[df[row,i] for i in 1:ncol(df)]

julia> @btime rowtovec(df,2)
  31.400 μs (1003 allocations: 23.64 KiB)
1000-element Vector{Float64}:
 0.31623289609155747
 0.08624728904277756

JeffreySarnoff · August 3, 2022, 4:21pm

this does appear to outperform the others while avoiding {Any}

tecosaur · February 10, 2023, 6:51am

I wondered why [df[row,i] for i in 1:ncol(df)] would perform better, and after confirmed it looked at the output of Meta.@lower. What surprises me is that while it looks like it should be equivalent to collect(Base.Generator(i -> df[2, i], 1:ncol(df))), this form is ~60% slower (48μs vs 30μs).

For context, here’s Meta.@lower [df[row,i] for i in 1:ncol(df)]

:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─       $(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─      global var"#76#77"
│        const var"#76#77"
│   %3 = Core._structtype(Main, Symbol("#76#77"), Core.svec(), Core.svec(), Core.svec(), false, 0)
│        var"#76#77" = %3
│        Core._setsuper!(var"#76#77", Core.Function)
│        Core._typebody!(var"#76#77", Core.svec())
└──      return nothing
)))
│   %2  = Core.svec(var"#76#77", Core.Any)
│   %3  = Core.svec()
│   %4  = Core.svec(%2, %3, $(QuoteNode(:(#= none:0 =#))))
│         $(Expr(:method, false, :(%4), CodeInfo(
    @ none within `none`
1 ─ %1 = Base.getindex(df, 2, i)
└──      return %1
)))
│         #76 = %new(var"#76#77")
│   %7  = #76
│   %8  = ncol(df)
│   %9  = 1:%8
│   %10 = Base.Generator(%7, %9)
│   %11 = Base.collect(%10)
└──       return %11
))))

And here’s Meta.@lower collect(Base.Generator(i -> df[2, i], 1:ncol(df)))

:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─       $(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─      global var"#46#47"
│        const var"#46#47"
│   %3 = Core._structtype(Main, Symbol("#46#47"), Core.svec(), Core.svec(), Core.svec(), false, 0)
│        var"#46#47" = %3
│        Core._setsuper!(var"#46#47", Core.Function)
│        Core._typebody!(var"#46#47", Core.svec())
└──      return nothing
)))
│   %2  = Core.svec(var"#46#47", Core.Any)
│   %3  = Core.svec()
│   %4  = Core.svec(%2, %3, $(QuoteNode(:(#= REPL[1]:1 =#))))
│         $(Expr(:method, false, :(%4), CodeInfo(
    @ REPL[1]:1 within `none`
1 ─ %1 = Base.getindex(df, 2, i)
└──      return %1
)))
│   %6  = Base.getproperty(Base, :Generator)
│         #46 = %new(var"#46#47")
│   %8  = #46
│   %9  = ncol(df)
│   %10 = 1:%9
│   %11 = (%6)(%8, %10)
│   %12 = collect(%11)
└──       return %12
))))

The only difference I see is that instead of %10 = Base.Generator(%7, %9), the second expansion has:

│   %6  = Base.getproperty(Base, :Generator)
│         #46 = %new(var"#46#47")
...
│   %11 = (%6)(%8, %10)

This seems rather strange to me…

tom-plaa · February 22, 2023, 3:03pm

@tecosaur that seems pretty intriguing to me, I’ll also wait for someone more knowledgeable to chime in here

Topic		Replies	Views
Convert DataFrameRow to Vector Data	6	1157	March 6, 2022
Dataframe to vector New to Julia dataframes , vector	6	1930	June 23, 2022
How to convert a DataFrameRow to an Array? New to Julia array , dataframes , convert	4	1742	June 23, 2021
How to convert a dataframe into a 1-D vector, line by line? General Usage dataframes , vector	6	92	November 14, 2024
How to create data frame from saved vectors in julia 1.7? New to Julia question	4	2567	April 5, 2022

Cleanest way to convert DataFrame row into a Vector?

Related topics