I am trying to convert R code to Julia and need help on below logic conversion.
Currently in R they are trying to process each row in percentile_to df using for loop by passing arguments to function func_get_percentile_value.
I know we can do this easily using transform in julia but question here is in the func_get_percentile_value function they are doing quantile based on another df() matched values.
I would like to check if we can pass another df in the transform function? or best way to handle this scenario?
I would also like to check on quantile function, I saw there is quantile function in Distributions module, need confirmation if I can use this also how about probs that is being used in R.
quantile(itr, p; sorted=false, alpha::Real=1.0, beta::Real=alpha)
Compute the quantile(s) of a collection itr at a specified probability or vector or tuple of probabilities p on
the interval [0,1]. The keyword argument sorted indicates whether itr can be assumed to be sorted.
Samples quantile are defined by Q(p) = (1-γ)*x[j] + γ*x[j+1], where x[j] is the j-th order statistic, and γ is a
function of j = floor(n*p + m), m = alpha + p*(1 - alpha - beta) and g = n*p + m - j.
By default (alpha = beta = 1), quantiles are computed via linear interpolation between the points ((k-1)/(n-1),
v[k]), for k = 1:n where n = length(itr). This corresponds to Definition 7 of Hyndman and Fan (1996), and is the
same as the R and NumPy default.
The keyword arguments alpha and beta correspond to the same parameters in Hyndman and Fan, setting them to
different values allows to calculate quantiles with any of the methods 4-9 defined in this paper:
• Def. 4: alpha=0, beta=1
• Def. 5: alpha=0.5, beta=0.5
• Def. 6: alpha=0, beta=0 (Excel PERCENTILE.EXC, Python default, Stata altdef)
• Def. 7: alpha=1, beta=1 (Julia, R and NumPy default, Excel PERCENTILE and PERCENTILE.INC, Python
'inclusive')
• Def. 8: alpha=1/3, beta=1/3
• Def. 9: alpha=3/8, beta=3/8
│ Note
│
│ An ArgumentError is thrown if v contains NaN or missing values. Use the skipmissing function to omit
│ missing entries and compute the quantiles of non-missing values.
References
≡≡≡≡≡≡≡≡≡≡≡≡
• Hyndman, R.J and Fan, Y. (1996) "Sample Quantiles in Statistical Packages", The American Statistician,
Vol. 50, No. 4, pp. 361-365
• Quantile on Wikipedia (https://en.m.wikipedia.org/wiki/Quantile) details the different quantile
definitions
Examples
≡≡≡≡≡≡≡≡≡≡
julia> using Statistics
julia> quantile(0:20, 0.5)
10.0
julia> quantile(0:20, [0.1, 0.5, 0.9])
3-element Vector{Float64}:
2.0
10.0
18.000000000000004
julia> quantile(skipmissing([1, 10, missing]), 0.5)
5.5
@jar1 Thank You for the information. will you be able to answer my other question which if we can pass dataframe to the transform function so that I can use inside that df in function call> or is there better approach that I can use for this?
As far as I can tell your function is not being passed dataframes at all, but just vectors. You are referencing what looks like a dataframe in the body of your function (history[OUT_GR == type, ACT_OUT_MINUTES]), just be aware that this type of reference to global variables inside functions is bound to make your code quite slow, so you should be passing it as an argument to your function.
Here’s one way of writing your function:
function func_get_percentile_value(df, type, mode, origin, destination, percentile)
if !(percentile isa Number) || percentile<0 || percentile>100
return missing
end
prob = percentile/100
if origin == '*' && destination == '*' # What happens if this is not the case?
if mode == "TO"
return quantile(df[df.OUT_GR .== type, :ACT_OUT_MINUTES], prob)
elseif mode == "TI"
return quantile(df[df.IN_GR .== type, :ACT_TIN_MINUTES], prob)
else
error("mode has to be either TO or TI")
end
end
end
and if I understand your code correctly the loop can be replaced by a simple broadcasted invocation of that function, the second line in the loop (which generates the TO_MAX column) would be
@nilshg Thank You for your detailed clarification. Yes, What ever code I’ve pasted is from ‘R’ which needs to be converted to Julia and Yes they are not passing dataframe instead calculating values inside for loop for each row and which is why code is taking longer than expected to run using R due to such operations.
I have noticed that you used broadcast invocation and just wanted to check if using transform function will be faster or broadcast operation.