Returning the n lowest values across columns in a dataframe

Snowy · May 31, 2022, 2:32am

Hi all,
My first post was removed (not sure why), so here is try number 2…
I’m new to Julia and am stuck on a problem.
I have a dataframe defined as follows:

df = DataFrame((name=["S1","S2","S3","S4","S5"],Z1=[41,12,19,17,11],Z2=[13,28,29,99,41],Z3=[7,86,23,12,71],Z4=[4,13,23,11,19],Z5=[41,12,13,19,22],Z6=[11,18,22,46,5]))

For each row, I want to return the column index of the lowest n values (better yet, return the column name). So for instance, if I wanted the 3 lowest values across Row “S1”, it would be columns 5 (Z4), 4 (Z3), and 7 (Z6).

I can get the column indexes with the following code:

partialsortperm(Vector(df[1,2:ncol(df)]),1:3)

But this returns a vector, which isn’t really ideal. (To clarify, since the input is a df, the expected output is also a df… The indexes themselves aren’t really of value, the value is the column name represented by the index, presented in a package of a dataframe that’s easily read by a human that’s used to looking at spreadsheets.)

I am wondering if there is a better solution here? Is there an available function that allows me to return the indexes or column names for the n lowest values and return the result as a dataframe? If not, what might be the suggested solution? I can imagine perhaps running the above code through a eachrow() loop, building a matrix of the results, converting the matrix to a df, then joining that back to a match table that has the column index/names to get a final result… All of that seems a little overly complex for what I imagine is a simple solution.

Appreciate your help.
Snowy

jd-foster · May 31, 2022, 5:03am

This works, but there must be a better one than this:

n_lowest(row::DataFrameRow,n::Integer) = row[[1; partialsortperm(Vector(row[2:end]), 1:n).+1]]

n_lowest.(eachrow(df),3)

This is specific to your case where you have a “name” column in the first position, so we need an ‘offset’ index [[1; and the .+1 shift.

rafael.guerra · May 31, 2022, 9:13am

A small variation that returns the names (NB: edited to select only the columns with names like “Z” + number):

nlowest!(df,n) = df[!,:nlowest] = names.((df,), [ones(Int,n)] .+ partialsortperm.(Vector.(eachrow(select(df,r"Z\d+"))), (1:n,)))

n=3
nlowest!(df,n)

julia> df
5×8 DataFrame
 Row │ name    Z1     Z2     Z3     Z4     Z5     Z6     nlowest
     │ String  Int64  Int64  Int64  Int64  Int64  Int64  Array…
─────┼──────────────────────────────────────────────────────────────────────
   1 │ S1         41     13      7      4     41     11  ["Z4", "Z3", "Z6"]
   2 │ S2         12     28     86     13     12     18  ["Z1", "Z5", "Z4"]
   3 │ S3         19     29     23     23     13     22  ["Z5", "Z1", "Z6"]
   4 │ S4         17     99     12     11     19     46  ["Z4", "Z3", "Z1"]
   5 │ S5         11     41     71     19     22      5  ["Z6", "Z1", "Z4"]

jd-foster · June 1, 2022, 12:35am

Snowy · June 1, 2022, 3:31am

Thanks! This probably gets as close to what I was originally trying to achieve.
Question for both though…
Why is this the case that I can only run the function once? If I run the following code:

nlowest!(df,n) = df[!,:nlowest] = names.((df,), [ones(Int,n)] .+ partialsortperm.(Vector.(eachrow(df[:,2:end])), (1:n,)))

n=3
nlowest!(df,n)

I get the expected result.
If I change n = 4 and rerun nlowest!(df,n), I get the following error:

MethodError: no method matching isless(::Vector{String}, ::Int64)

If I restart the REPL and run the following code:

nlowest!(df,n) = df[!,:nlowest] = names.((df,), [ones(Int,n)] .+ partialsortperm.(Vector.(eachrow(df[:,2:end])), (1:n,)))
n=4
nlowest!(df,n)

Then once again, I get the expected result with no error.
Why is this?

edit
Interestingly, if I reload the dataframe, then rerun the function with a different n value, it works perfectly. I do not understand this behavior.

jules · June 1, 2022, 4:29am

Because the function mutates your dataframe and the next time you run it it includes the new column with strings in it, which fails. The ! suffix of the function indicates that it mutates the dataframe, that’s a common convention for naming things in Julia.

jd-foster · June 1, 2022, 5:00am

A version that doesn’t change the original dataframe:

nlowest(df,n) = names.((df,), [ones(Int,n)] .+ partialsortperm.(Vector.(eachrow(df[:,2:end])), (1:n,)))

rafael.guerra · June 1, 2022, 5:37am

I have updated the function nlowest!() in my post above to fix this. Basically by selecting only the columns with names like “Z” + number, in order to perform the analysis.

Topic		Replies	Views
For each row, return n lowest column indexes or column names General Usage question , dataframes	2	395	May 31, 2022
DataFrame group by first column, and sort by last column Data dataframes	4	1892	October 16, 2018
Sort rows in a dataframe based on a predefined order New to Julia sort , dataframes	5	1827	September 17, 2021
How to efficiently get the DataFrame row index of statistical operation result New to Julia statistics , dataframes	1	350	April 30, 2021
Index of max values of N columns in a DataFrame without manually entering each column New to Julia question , indexing , dataframes , maxima	15	1510	January 7, 2023

Returning the n lowest values across columns in a dataframe

Related topics