Row-wise quantile

How can I use the quantile! function row-wise on a dataframe?

df = DataFrame(A = 1:4, B = 5:8, C = 1:4, D = 5:8)

For example, I would like to add a column with the 0.15 percentile for each row.

Thank you for the help!

AsTable and collect are useful here

julia> using DataFramesMeta, Statistics

julia> transform(df, AsTable([:A, :B, :C]) => ByRow(t -> quantile(collect(t), .15)) => :q_15)
4ร—5 DataFrame
 Row โ”‚ A      B      C      D      q_15    
     โ”‚ Int64  Int64  Int64  Int64  Float64 
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚     1      5      1      5      1.0
   2 โ”‚     2      6      2      6      2.0
   3 โ”‚     3      7      3      7      3.0
   4 โ”‚     4      8      4      8      4.0

On the latest version of DataFramesMeta you can do

julia> @rtransform df :q_15 = quantile(collect(AsTable([:A, :B, :C])), .15)
4ร—5 DataFrame
 Row โ”‚ A      B      C      D      q_15    
     โ”‚ Int64  Int64  Int64  Int64  Float64 
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚     1      5      1      5      1.0
   2 โ”‚     2      6      2      6      2.0
   3 โ”‚     3      7      3      7      3.0
   4 โ”‚     4      8      4      8      4.0

EDIT: It looks like you donโ€™t even need collect in there anywhere .

Thank you, Peter! This is helpful. However, is there a solution for the case where you have many columns? In the data I am working with, I have 10K columns where I want to calculate row-wise percentiles. Again, thank you for the help!

Canโ€™t speak for dataframes but if you can get just a plain Array then VectorizedStatistics.jl (and NanStatistics.jl) have fast (after compilation) methods with dimension-wise quantiles and percentiles

julia> A = rand(1000, 10000);

julia> @time vquantile!(copy(A), 0.15, dims=2)
  0.727348 seconds (3 allocations: 76.302 MiB)
1000ร—1 Matrix{Float64}:
 0.15339182476244487
 0.15116373206593528
 0.14991189653752926
 0.14992050611491373
 0.1502873388769227
 0.14873368062909753
 โ‹ฎ
 0.14993164357947808
 0.14337271376066515
 0.15202934011278724
 0.14750348034834604
 0.15218943855833048

julia> @time vquantile!(copy(A), 0.15, dims=1)
  0.131755 seconds (4 allocations: 76.370 MiB)
1ร—10000 Matrix{Float64}:
 0.132576  0.13291  0.17419  0.165488  0.168598  0.146831  โ€ฆ  0.151196  0.15443  0.147143  0.146142  0.158048  0.139629

The copy is because this is an in-place method for efficiency, which you definitely donโ€™t want if youโ€™re going to alternate between row-wise and column-wise percentiles but is great if youโ€™re going to take many different percentiles along the same dimension successively

1 Like

How many rows do you have? In this instance it may be better to call permutedims on your data frame and operate by column.

The solution I give will not scale to 10K columns. You probably want a different data structure

If you have enough RAM most likely moving the data to a Matrix and then doing rowwise quantiles will be most efficient (I give an example using standard functions, specialized packages will be faster).
If you want to do it in DataFrames.jl here is the way to do it:

julia> using DataFrames, Statistics

julia> df = DataFrame(rand(100, 10_000), :auto);

julia> @time transform(df, AsTable(All()) => ByRow((t -> quantile(t, 0.15))โˆ˜collect) => :q_15);
  0.093942 seconds (242.68 k allocations: 35.970 MiB, 6.62% gc time, 61.12% compilation time)

which is slower but not by much in comparison to:

julia> @time quantile.(eachrow(Matrix(df)), 0.15);
  0.016732 seconds (10.21 k allocations: 15.497 MiB)

(all timings are after compilation)

Note that the crucial part is โˆ˜collect. DataFrames.jl detects such composition and avoids excessive compilation in such cases.

2 Likes

On larger data the difference is even smaller:

julia> df = DataFrame(rand(10_000, 10_000), :auto);

julia> @time quantile.(eachrow(Matrix(df)), 0.15);
  1.954686 seconds (30.02 k allocations: 1.491 GiB, 1.86% gc time)

julia> @time select(df, AsTable(All()) => ByRow((t -> quantile(t, .15))โˆ˜collect) => :q_15);
  2.081471 seconds (220.47 k allocations: 1.500 GiB, 7.30% gc time, 2.69% compilation time)

(mostly GC is in play here as the DataFrames.jl method allocates more)

1 Like