I am currently using an ugly hack for it:
for row in eachrow(df)
vec = convert(Array, row)' # notice the transpose
end
The problem is that vec has type Matrix
in this case, not Vector
. I am sure there is a cleaner way?
I am currently using an ugly hack for it:
for row in eachrow(df)
vec = convert(Array, row)' # notice the transpose
end
The problem is that vec has type Matrix
in this case, not Vector
. I am sure there is a cleaner way?
for row in eachrow(df)
v = vec(convert(Array, row)) # no transpose necessary
end
Thank you @pfitzseb.
Now it looks like it doesn’t work anymore:
MethodError: Cannot convert
an object of type
DataFrameRow{DataFrame, DataFrames.Index} to an object of type
Array
A lot of convert
methods were deprecated. But collect
works and will remain that way for a long time, given that DataFrames.jl is now at 1.0 (which it wasn’t when this thread started).
Yep. Furthermore there’s the method Matrix()
I guess the collect
function is the best.
values(df[2,:])
#or
[values(df[2,:])...]
Splatting like this will be inefficient for data frames with many columns. I would not recommend it as a solution, even if it’s possible.
The following tests seem to show that collect()
can be 2 x slower for a dataframe with 1000 columns:
using DataFrames
df = DataFrame(rand(100,1000), :auto)
@btime [v for v in values($df[2,:])] # 68.2 μs (2007 allocs: 70.9 KiB)
@btime [values($df[2,:])...] # 91.9 μs (3007 allocs: 86.6 KiB)
@btime collect($df[2,:]) # 117.8 μs (3985 allocs: 85.9 KiB)
it seems that what makes the difference is the values (…) function
julia> @btime collect(values($df[2,:]));
56.800 μs (2006 allocations: 63.03 KiB)
julia> @btime [v for v in values($df[2,:])] ;
59.800 μs (2007 allocations: 70.91 KiB)
julia> @btime [values($df[2,:])...] ;
77.600 μs (3007 allocations: 86.59 KiB)
julia> @btime [v for v in $df[2,:]] ;
103.500 μs (3983 allocations: 85.83 KiB)
This benchmarks ~1.25 faster for the few cases I tested.
(all the values in the row need to be of the same concrete type)
function dfrow(df::DataFrame, row)
T = typeof(df[row,1])
Array{T,1}(df[row,:])
end
maybe I mess with the types or misunderstand what you say, but it seems that removing the limitations on the type of the columns does not penalize the performance.
rather.
julia> @btime Vector{Float64}(df[2,:]);
41.400 μs (1491 allocations: 31.27 KiB)
julia> @btime Vector{Any}(df[2,:]);
28.500 μs (1002 allocations: 23.62 KiB)
indeed … how unexpected
dfrow(df::DataFrame, row::Int) = Array{Any, 1}(df[row, :])
forgive me but my English does not allow me to grasp the nuances of your observation.
Could you explain the meaning more explicitly?
I take this opportunity to ask if an implementation of values ()
in the following way is not preferable.
_values(dfr)=tuple(Vector{Any}(dfr))
If I understood Julia’s performance tips correctly, parametrizing with Any avoids runtime type checking and is therefore faster. However, if we need to do calculations with such an array, we will pay the performance price later.
in this way it preserves the type and much of the performance
df = DataFrame(rand(100,1000), :auto)
rowtovec(df::DataFrame,row::Int)=[df[row,i] for i in 1:ncol(df)]
julia> @btime rowtovec(df,2)
31.400 μs (1003 allocations: 23.64 KiB)
1000-element Vector{Float64}:
0.31623289609155747
0.08624728904277756
this does appear to outperform the others while avoiding {Any}
I wondered why [df[row,i] for i in 1:ncol(df)]
would perform better, and after confirmed it looked at the output of Meta.@lower
. What surprises me is that while it looks like it should be equivalent to collect(Base.Generator(i -> df[2, i], 1:ncol(df)))
, this form is ~60% slower (48μs vs 30μs).
For context, here’s Meta.@lower [df[row,i] for i in 1:ncol(df)]
:($(Expr(:thunk, CodeInfo(
@ none within `top-level scope`
1 ─ $(Expr(:thunk, CodeInfo(
@ none within `top-level scope`
1 ─ global var"#76#77"
│ const var"#76#77"
│ %3 = Core._structtype(Main, Symbol("#76#77"), Core.svec(), Core.svec(), Core.svec(), false, 0)
│ var"#76#77" = %3
│ Core._setsuper!(var"#76#77", Core.Function)
│ Core._typebody!(var"#76#77", Core.svec())
└── return nothing
)))
│ %2 = Core.svec(var"#76#77", Core.Any)
│ %3 = Core.svec()
│ %4 = Core.svec(%2, %3, $(QuoteNode(:(#= none:0 =#))))
│ $(Expr(:method, false, :(%4), CodeInfo(
@ none within `none`
1 ─ %1 = Base.getindex(df, 2, i)
└── return %1
)))
│ #76 = %new(var"#76#77")
│ %7 = #76
│ %8 = ncol(df)
│ %9 = 1:%8
│ %10 = Base.Generator(%7, %9)
│ %11 = Base.collect(%10)
└── return %11
))))
And here’s Meta.@lower collect(Base.Generator(i -> df[2, i], 1:ncol(df)))
:($(Expr(:thunk, CodeInfo(
@ none within `top-level scope`
1 ─ $(Expr(:thunk, CodeInfo(
@ none within `top-level scope`
1 ─ global var"#46#47"
│ const var"#46#47"
│ %3 = Core._structtype(Main, Symbol("#46#47"), Core.svec(), Core.svec(), Core.svec(), false, 0)
│ var"#46#47" = %3
│ Core._setsuper!(var"#46#47", Core.Function)
│ Core._typebody!(var"#46#47", Core.svec())
└── return nothing
)))
│ %2 = Core.svec(var"#46#47", Core.Any)
│ %3 = Core.svec()
│ %4 = Core.svec(%2, %3, $(QuoteNode(:(#= REPL[1]:1 =#))))
│ $(Expr(:method, false, :(%4), CodeInfo(
@ REPL[1]:1 within `none`
1 ─ %1 = Base.getindex(df, 2, i)
└── return %1
)))
│ %6 = Base.getproperty(Base, :Generator)
│ #46 = %new(var"#46#47")
│ %8 = #46
│ %9 = ncol(df)
│ %10 = 1:%9
│ %11 = (%6)(%8, %10)
│ %12 = collect(%11)
└── return %12
))))
The only difference I see is that instead of %10 = Base.Generator(%7, %9)
, the second expansion has:
│ %6 = Base.getproperty(Base, :Generator)
│ #46 = %new(var"#46#47")
...
│ %11 = (%6)(%8, %10)
This seems rather strange to me…
@tecosaur that seems pretty intriguing to me, I’ll also wait for someone more knowledgeable to chime in here