How can I use the quantile! function row-wise on a dataframe?
df = DataFrame(A = 1:4, B = 5:8, C = 1:4, D = 5:8)
For example, I would like to add a column with the 0.15 percentile for each row.
Thank you for the help!
How can I use the quantile! function row-wise on a dataframe?
df = DataFrame(A = 1:4, B = 5:8, C = 1:4, D = 5:8)
For example, I would like to add a column with the 0.15 percentile for each row.
Thank you for the help!
AsTable and collect are useful here
julia> using DataFramesMeta, Statistics
julia> transform(df, AsTable([:A, :B, :C]) => ByRow(t -> quantile(collect(t), .15)) => :q_15)
4ร5 DataFrame
 Row โ A      B      C      D      q_15    
     โ Int64  Int64  Int64  Int64  Float64 
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
   1 โ     1      5      1      5      1.0
   2 โ     2      6      2      6      2.0
   3 โ     3      7      3      7      3.0
   4 โ     4      8      4      8      4.0
On the latest version of DataFramesMeta you can do
julia> @rtransform df :q_15 = quantile(collect(AsTable([:A, :B, :C])), .15)
4ร5 DataFrame
 Row โ A      B      C      D      q_15    
     โ Int64  Int64  Int64  Int64  Float64 
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
   1 โ     1      5      1      5      1.0
   2 โ     2      6      2      6      2.0
   3 โ     3      7      3      7      3.0
   4 โ     4      8      4      8      4.0
EDIT: It looks like you donโt even need collect in there anywhere .
Thank you, Peter! This is helpful. However, is there a solution for the case where you have many columns? In the data I am working with, I have 10K columns where I want to calculate row-wise percentiles. Again, thank you for the help!
Canโt speak for dataframes but if you can get just a plain Array then VectorizedStatistics.jl (and NanStatistics.jl) have fast (after compilation) methods with dimension-wise quantiles and percentiles
julia> A = rand(1000, 10000);
julia> @time vquantile!(copy(A), 0.15, dims=2)
  0.727348 seconds (3 allocations: 76.302 MiB)
1000ร1 Matrix{Float64}:
 0.15339182476244487
 0.15116373206593528
 0.14991189653752926
 0.14992050611491373
 0.1502873388769227
 0.14873368062909753
 โฎ
 0.14993164357947808
 0.14337271376066515
 0.15202934011278724
 0.14750348034834604
 0.15218943855833048
julia> @time vquantile!(copy(A), 0.15, dims=1)
  0.131755 seconds (4 allocations: 76.370 MiB)
1ร10000 Matrix{Float64}:
 0.132576  0.13291  0.17419  0.165488  0.168598  0.146831  โฆ  0.151196  0.15443  0.147143  0.146142  0.158048  0.139629
The copy is because this is an in-place method for efficiency, which you definitely donโt want if youโre going to alternate between row-wise and column-wise percentiles but is great if youโre going to take many different percentiles along the same dimension successively
How many rows do you have? In this instance it may be better to call permutedims on your data frame and operate by column.
The solution I give will not scale to 10K columns. You probably want a different data structure
If you have enough RAM most likely moving the data to a Matrix and then doing rowwise quantiles will be most efficient (I give an example using standard functions, specialized packages will be faster).
If you want to do it in DataFrames.jl here is the way to do it:
julia> using DataFrames, Statistics
julia> df = DataFrame(rand(100, 10_000), :auto);
julia> @time transform(df, AsTable(All()) => ByRow((t -> quantile(t, 0.15))โcollect) => :q_15);
  0.093942 seconds (242.68 k allocations: 35.970 MiB, 6.62% gc time, 61.12% compilation time)
which is slower but not by much in comparison to:
julia> @time quantile.(eachrow(Matrix(df)), 0.15);
  0.016732 seconds (10.21 k allocations: 15.497 MiB)
(all timings are after compilation)
Note that the crucial part is โcollect. DataFrames.jl detects such composition and avoids excessive compilation in such cases.
On larger data the difference is even smaller:
julia> df = DataFrame(rand(10_000, 10_000), :auto);
julia> @time quantile.(eachrow(Matrix(df)), 0.15);
  1.954686 seconds (30.02 k allocations: 1.491 GiB, 1.86% gc time)
julia> @time select(df, AsTable(All()) => ByRow((t -> quantile(t, .15))โcollect) => :q_15);
  2.081471 seconds (220.47 k allocations: 1.500 GiB, 7.30% gc time, 2.69% compilation time)
(mostly GC is in play here as the DataFrames.jl method allocates more)