Julia equivalent of R's quantile function

I am trying to convert R code to Julia and need help on below logic conversion.

Currently in R they are trying to process each row in percentile_to df using for loop by passing arguments to function func_get_percentile_value.

I know we can do this easily using transform in julia but question here is in the func_get_percentile_value function they are doing quantile based on another df() matched values.

Ex: quantile(history[IN_GR == type, ACT_TIN_MINUTES], probs = prob

I would like to check if we can pass another df in the transform function? or best way to handle this scenario?

I would also like to check on quantile function, I saw there is quantile function in Distributions module, need confirmation if I can use this also how about probs that is being used in R.

for(i in 1:nrow(percentile_to.dt)){
      row = percentile_to.dt[i,]
      percentile_to.dt[i,"TO_MAX"] <- func_get_percentile_value("ABC", "TO", row$depart, row$arrival, row$TO_Percentile_Threshold)
      percentile_to.dt[i,"TO_REPLACE"] <- func_get_percentile_value("ABC", "TO", row$depart, row$arrival, row$TO_Percentile_Replace)
    }

func_get_percentile_value <- function(type, mode, origin, destination, percentile){
  if (!is.numeric(percentile) | percentile<0 | percentile>100){
    return(NULL)
  } else{
    prob = percentile/100
  }
  
  if (origin == "*" & destination == "*"){
    if (mode == "TO"){
      return(quantile(history[OUT_GR == type, ACT_OUT_MINUTES], probs = prob))
    } else if (mode == "TI"){
      return(quantile(history[IN_GR == type, ACT_TIN_MINUTES], probs = prob))
    }
  }
}

StatsBase has

 quantile(itr, p; sorted=false, alpha::Real=1.0, beta::Real=alpha)


  Compute the quantile(s) of a collection itr at a specified probability or vector or tuple of probabilities p on
  the interval [0,1]. The keyword argument sorted indicates whether itr can be assumed to be sorted.

  Samples quantile are defined by Q(p) = (1-γ)*x[j] + γ*x[j+1], where x[j] is the j-th order statistic, and γ is a
  function of j = floor(n*p + m), m = alpha + p*(1 - alpha - beta) and g = n*p + m - j.

  By default (alpha = beta = 1), quantiles are computed via linear interpolation between the points ((k-1)/(n-1),
  v[k]), for k = 1:n where n = length(itr). This corresponds to Definition 7 of Hyndman and Fan (1996), and is the
  same as the R and NumPy default.

  The keyword arguments alpha and beta correspond to the same parameters in Hyndman and Fan, setting them to
  different values allows to calculate quantiles with any of the methods 4-9 defined in this paper:

    •  Def. 4: alpha=0, beta=1

    •  Def. 5: alpha=0.5, beta=0.5

    •  Def. 6: alpha=0, beta=0 (Excel PERCENTILE.EXC, Python default, Stata altdef)

    •  Def. 7: alpha=1, beta=1 (Julia, R and NumPy default, Excel PERCENTILE and PERCENTILE.INC, Python
       'inclusive')

    •  Def. 8: alpha=1/3, beta=1/3

    •  Def. 9: alpha=3/8, beta=3/8

  │ Note
  │
  │  An ArgumentError is thrown if v contains NaN or missing values. Use the skipmissing function to omit
  │  missing entries and compute the quantiles of non-missing values.

  References
  ≡≡≡≡≡≡≡≡≡≡≡≡

    •  Hyndman, R.J and Fan, Y. (1996) "Sample Quantiles in Statistical Packages", The American Statistician,
       Vol. 50, No. 4, pp. 361-365

    •  Quantile on Wikipedia (https://en.m.wikipedia.org/wiki/Quantile) details the different quantile
       definitions

  Examples
  ≡≡≡≡≡≡≡≡≡≡

  julia> using Statistics
  
  julia> quantile(0:20, 0.5)
  10.0
  
  julia> quantile(0:20, [0.1, 0.5, 0.9])
  3-element Vector{Float64}:
    2.0
   10.0
   18.000000000000004
  
  julia> quantile(skipmissing([1, 10, missing]), 0.5)
  5.5
2 Likes

@jar1 Thank You for the information. will you be able to answer my other question which if we can pass dataframe to the transform function so that I can use inside that df in function call> or is there better approach that I can use for this?

While it’s not clear to me exactly how you intend to use it, you can certainly use a dataframe as a function argument within the mini-language.

As far as I can tell your function is not being passed dataframes at all, but just vectors. You are referencing what looks like a dataframe in the body of your function (history[OUT_GR == type, ACT_OUT_MINUTES]), just be aware that this type of reference to global variables inside functions is bound to make your code quite slow, so you should be passing it as an argument to your function.

Here’s one way of writing your function:

function func_get_percentile_value(df, type, mode, origin, destination, percentile)
    if !(percentile isa Number) || percentile<0 || percentile>100
        return missing
    end
    
    prob = percentile/100
            
    if origin == '*' && destination == '*' # What happens if this is not the case?
        if mode == "TO"
            return quantile(df[df.OUT_GR .== type, :ACT_OUT_MINUTES], prob)
        elseif mode == "TI"
            return quantile(df[df.IN_GR .== type, :ACT_TIN_MINUTES], prob)
        else
            error("mode has to be either TO or TI")
        end
    end
end

and if I understand your code correctly the loop can be replaced by a simple broadcasted invocation of that function, the second line in the loop (which generates the TO_MAX column) would be

percentile_to.dt.TO_MAX = func_get_percentile_value.("ABC", "TO", 
        percentile_to.dt.arrival, percentile_to.dt.destination, percentile_to.dt.TO_Percentile_Threshold)
1 Like

@rocco_sprmnt21 Yeah, I’ve found out recently that we can pass df as function argument - Thank You!

@nilshg Thank You for your detailed clarification. Yes, What ever code I’ve pasted is from ‘R’ which needs to be converted to Julia and Yes they are not passing dataframe instead calculating values inside for loop for each row and which is why code is taking longer than expected to run using R due to such operations.

I have noticed that you used broadcast invocation and just wanted to check if using transform function will be faster or broadcast operation.

Ex. transform!(percentile_df, [:arrival, :destination, :to_percentile_threshold] => ByRow((a,d,t) → func_get_percentile_value(“ABC”, “TO”, a,d,t, df)) => :to_max

Benchmark it, I would be surprised if there was a noticeable difference.

1 Like