Finding the percentile rank of a value in a dataset. Noob needs comments on approach

anon69491625 · April 11, 2022, 6:21pm

trying to get percentile using a specific value in this case stock close.
The data set is below I loaded it all into a dataframe called df_symbol. The columns are time and close.

time	close
20210412	701.98
20210413	762.32
20210414	732.23
20210415	738.85
20210416	739.78
20210419	714.63
20210420	718.99
20210421	744.12
20210422	719.69
20210423	729.4
20210426	738.2
20210427	704.74
20210428	694.4
20210429	677
20210430	709.44
20210503	684.9
20210504	673.6
20210505	670.94
20210506	663.54
20210507	672.37
20210510	629.04
20210511	617.2
20210512	589.89
20210513	571.69
20210514	589.74
20210517	576.83
20210518	577.87
20210519	563.46
20210520	586.78
20210521	580.88
20210524	606.44
20210525	604.69
20210526	619.13
20210527	630.85
20210528	625.22
20210601	623.9
20210602	605.12
20210603	572.84
20210604	599.05
20210607	605.13
20210608	603.59
20210609	598.78
20210610	610.12
20210611	609.89
20210614	617.69
20210615	599.36
20210616	604.87
20210617	616.6
20210618	623.31
20210621	620.83
20210622	623.71
20210623	656.57
20210624	679.82
20210625	671.87
20210628	688.72
20210629	680.76
20210630	679.7
20210701	677.92
20210702	678.9
20210706	659.58
20210707	644.65
20210708	652.81
20210709	656.95
20210712	685.7
20210713	668.54
20210714	653.38
20210715	650.6
20210716	644.22
20210719	646.22
20210720	660.5
20210721	655.29
20210722	649.26
20210723	643.38
20210726	657.62
20210727	644.78
20210728	646.98
20210729	677.35
20210730	687.2
20210802	709.67
20210803	709.74
20210804	710.92
20210805	714.63
20210806	699.1
20210809	713.76
20210810	709.99
20210811	707.82
20210812	722.25
20210813	717.17
20210816	686.17
20210817	665.71
20210818	688.99
20210819	673.47
20210820	680.26
20210823	706.3
20210824	708.49
20210825	711.2
20210826	701.16
20210827	711.92
20210830	730.91
20210831	735.72
20210901	734.09
20210902	732.39
20210903	733.57
20210907	752.92
20210908	753.87
20210909	754.86
20210910	736.27
20210913	743
20210914	744.49
20210915	755.83
20210916	756.99
20210917	759.49
20210920	730.17
20210921	739.38
20210922	751.94
20210923	753.64
20210924	774.39
20210927	791.36
20210928	777.56
20210929	781.31
20210930	775.48
20211001	775.22
20211004	781.53
20211005	780.59
20211006	782.75
20211007	793.61
20211008	785.49
20211011	791.94
20211012	805.72
20211013	811.08
20211014	818.32
20211015	843.03
20211018	870.11
20211019	864.27
20211020	865.8
20211021	894
20211022	909.68
20211025	1024.86
20211026	1018.43
20211027	1037.86
20211028	1077.04
20211029	1114
20211101	1208.59
20211102	1172
20211103	1213.86
20211104	1229.91
20211105	1222.09
20211108	1162.94
20211109	1023.5
20211110	1067.95
20211111	1063.51
20211112	1033.42
20211115	1013.39
20211116	1054.73
20211117	1089.01
20211118	1096.38
20211119	1137.06
20211122	1156.87
20211123	1109.03
20211124	1116
20211126	1081.92
20211129	1136.99
20211130	1144.76
20211201	1095
20211202	1084.6
20211203	1014.97
20211206	1009.01
20211207	1051.75
20211208	1068.96
20211209	1003.8
20211210	1017.03
20211213	966.41
20211214	958.51
20211215	975.99
20211216	926.92
20211217	932.57
20211220	899.94
20211221	938.53
20211222	1008.87
20211223	1067
20211227	1093.94
20211228	1088.47
20211229	1086.19
20211230	1070.34
20211231	1056.78
20220103	1199.78
20220104	1149.59
20220105	1088.12
20220106	1064.7
20220107	1026.96
20220110	1058.12
20220111	1064.4
20220112	1106.22
20220113	1031.56
20220114	1049.61
20220118	1030.51
20220119	995.65
20220120	996.27
20220121	943.9
20220124	930
20220125	918.4
20220126	937.41
20220127	829.1
20220128	846.35
20220131	936.72
20220201	931.25
20220202	905.66
20220203	891.14
20220204	923.32
20220207	907.34
20220208	922
20220209	932
20220210	904.55
20220211	860
20220214	875.76
20220215	922.43
20220216	923.39
20220217	876.35
20220218	856.98
20220222	821.53
20220223	764.04
20220224	800.77
20220225	809.87
20220228	870.43
20220301	864.37
20220302	879.89
20220303	839.29
20220304	838.29
20220307	804.58
20220308	824.4
20220309	858.97
20220310	838.3
20220311	795.35
20220314	766.37
20220315	801.89
20220316	840.23
20220317	871.6
20220318	905.39
20220321	921.16
20220322	993.98
20220323	999.11
20220324	1013.92
20220325	1010.64
20220328	1091.84
20220329	1099.57
20220330	1093.99
20220331	1077.6
20220401	1084.59
20220404	1145.45
20220405	1091.26
20220406	1045.76
20220407	1057.26
20220408	1025.49
20220411	988.67

I’m still learning Julia so can someone just check my code please.
ANY comments welcomed.

julia> last(df_symbol)
DataFrameRow
 Row │ time      close   
     │ Int64     Float64 
─────┼───────────────────
 254 │ 20220411   988.67

julia> last(df_symbol).close
988.67

julia> days_below = nrow(filter(row -> row.close < last(df_symbol).close,df_symbol) )
185

julia> percent_below = (days_below/nrow(df_symbol))*100
72.83464566929135

julia> percent_above  = 100 -  (days_below/nrow(df_symbol))*100
27.165354330708652

bkamins · April 11, 2022, 6:40pm

Alternative ways to do it are.

This one will be fast:

julia> 100 * count(<(df_symbol.close[end]), df_symbol.close) / nrow(df_symbol)
72.83464566929133

or (this one is probably simple to understand)

julia> using Statistics

julia> 100 * mean(df_symbol.close .< df_symbol.close[end])
72.83464566929135

or (using a function designed for this means, the issue is that you need to learn it ):

julia> using StatsBase

julia> percentilerank(df_symbol.close, df_symbol.close[end], method=:strict)
72.83464566929135

anon69491625 · April 11, 2022, 8:31pm

Thanks so much for taking the time @bkamins I’m going to claim “close but no cigar” on my code attempt

Why is

julia> 100 * count(<(df_symbol.close[end]), df_symbol.close) / nrow(df_symbol)
72.83464566929133

faster than

percent_below = ( ( nrow(filter(row -> row.close < last(df_symbol).close,df_symbol)  /nrow(df_symbol))*100

thanks again for taking the time to help a noob

bkamins · April 12, 2022, 4:35am

Because count version is non-allocating and type stable.

The

filter(row -> row.close < last(df_symbol).close, df_symbol)

expression creates a new data frame. This allocates a lot of memory. Additionally the last(df_symbol).close is captured dynamically. If you wanted to follow your pattern but in a faster way you could write this part as:

filter(:close => <(last(df_symbol).close), df_symbol, view=true)

it would still allocate but much less (only filtering condition) and would be type stable as in <(last(df_symbol).close) part, since this is a partial function application, last(df_symbol).close is evaluated at creation time.

anon69491625 · April 12, 2022, 12:49pm

Thank you for taking the time. Excellent learning session. I’ll spend more time considering the ramifications of what I am doing. I’m going to walk the count route until I develop the ability to find the tools specifically designed to do what I want Thanks for the heads up on StatsBase.

Topic		Replies	Views
How to efficiently get the DataFrame row index of statistical operation result New to Julia statistics , dataframes	1	393	April 30, 2021
Opposite of StatsBase.percentile Statistics	5	479	November 21, 2022
How do I calculate percentiles/IQR with julia? New to Julia question	7	5827	May 27, 2019
How to calculate the statistics/ percentile in julia General Usage	5	1780	May 29, 2023
Julia equivalent of R's quantile function Performance r , functions , julia	7	1363	August 10, 2023

Finding the percentile rank of a value in a dataset. Noob needs comments on approach

Related topics