How can I use the quantile! function row-wise on a dataframe?
df = DataFrame(A = 1:4, B = 5:8, C = 1:4, D = 5:8)
For example, I would like to add a column with the 0.15 percentile for each row.
Thank you for the help!
How can I use the quantile! function row-wise on a dataframe?
df = DataFrame(A = 1:4, B = 5:8, C = 1:4, D = 5:8)
For example, I would like to add a column with the 0.15 percentile for each row.
Thank you for the help!
AsTable
and collect
are useful here
julia> using DataFramesMeta, Statistics
julia> transform(df, AsTable([:A, :B, :C]) => ByRow(t -> quantile(collect(t), .15)) => :q_15)
4ร5 DataFrame
Row โ A B C D q_15
โ Int64 Int64 Int64 Int64 Float64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1 5 1 5 1.0
2 โ 2 6 2 6 2.0
3 โ 3 7 3 7 3.0
4 โ 4 8 4 8 4.0
On the latest version of DataFramesMeta you can do
julia> @rtransform df :q_15 = quantile(collect(AsTable([:A, :B, :C])), .15)
4ร5 DataFrame
Row โ A B C D q_15
โ Int64 Int64 Int64 Int64 Float64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1 5 1 5 1.0
2 โ 2 6 2 6 2.0
3 โ 3 7 3 7 3.0
4 โ 4 8 4 8 4.0
EDIT: It looks like you donโt even need collect
in there anywhere .
Thank you, Peter! This is helpful. However, is there a solution for the case where you have many columns? In the data I am working with, I have 10K columns where I want to calculate row-wise percentiles. Again, thank you for the help!
Canโt speak for dataframes but if you can get just a plain Array
then VectorizedStatistics.jl (and NanStatistics.jl) have fast (after compilation) methods with dimension-wise quantiles and percentiles
julia> A = rand(1000, 10000);
julia> @time vquantile!(copy(A), 0.15, dims=2)
0.727348 seconds (3 allocations: 76.302 MiB)
1000ร1 Matrix{Float64}:
0.15339182476244487
0.15116373206593528
0.14991189653752926
0.14992050611491373
0.1502873388769227
0.14873368062909753
โฎ
0.14993164357947808
0.14337271376066515
0.15202934011278724
0.14750348034834604
0.15218943855833048
julia> @time vquantile!(copy(A), 0.15, dims=1)
0.131755 seconds (4 allocations: 76.370 MiB)
1ร10000 Matrix{Float64}:
0.132576 0.13291 0.17419 0.165488 0.168598 0.146831 โฆ 0.151196 0.15443 0.147143 0.146142 0.158048 0.139629
The copy
is because this is an in-place method for efficiency, which you definitely donโt want if youโre going to alternate between row-wise and column-wise percentiles but is great if youโre going to take many different percentiles along the same dimension successively
How many rows do you have? In this instance it may be better to call permutedims
on your data frame and operate by column.
The solution I give will not scale to 10K columns. You probably want a different data structure
If you have enough RAM most likely moving the data to a Matrix
and then doing rowwise quantiles will be most efficient (I give an example using standard functions, specialized packages will be faster).
If you want to do it in DataFrames.jl here is the way to do it:
julia> using DataFrames, Statistics
julia> df = DataFrame(rand(100, 10_000), :auto);
julia> @time transform(df, AsTable(All()) => ByRow((t -> quantile(t, 0.15))โcollect) => :q_15);
0.093942 seconds (242.68 k allocations: 35.970 MiB, 6.62% gc time, 61.12% compilation time)
which is slower but not by much in comparison to:
julia> @time quantile.(eachrow(Matrix(df)), 0.15);
0.016732 seconds (10.21 k allocations: 15.497 MiB)
(all timings are after compilation)
Note that the crucial part is โcollect
. DataFrames.jl detects such composition and avoids excessive compilation in such cases.
On larger data the difference is even smaller:
julia> df = DataFrame(rand(10_000, 10_000), :auto);
julia> @time quantile.(eachrow(Matrix(df)), 0.15);
1.954686 seconds (30.02 k allocations: 1.491 GiB, 1.86% gc time)
julia> @time select(df, AsTable(All()) => ByRow((t -> quantile(t, .15))โcollect) => :q_15);
2.081471 seconds (220.47 k allocations: 1.500 GiB, 7.30% gc time, 2.69% compilation time)
(mostly GC is in play here as the DataFrames.jl method allocates more)