Sorting: use of keyword by and lt

I tried to solve the problem proposed here using the function SORT() with the BY and LT keywords.

In particular I tried with the following dataframe and expression:

df = DataFrame(NAME = ["A1","A1","A2",   "A1","A2","A2","A2","A2","A3","A3","A4","A5","A6","A6","A6","A6","A6","A6","A6"],
                CAT = ["INF","NF", "XX" , "INH","AP","CF","UTL","GT","CP","MP","AP","BE","NF","CF","PP","AZ","PY","OP","APX"])
dft=combine(groupby(df,"NAME"), [:CAT=>(t->sort!(t,by=last,lt=(x,y)->(y=='F' || y=='P') || x<=y))=>:CAT,
:CAT=>(t->string.("CAT",1:length(t)))=>:mm]) 

What I’ve observed is that if I run the function COMBINE() twice, I get a different ordering of the dataframe.
This seems to depend on the particular LT function I used. Maybe I should also think more about what happens if the first parameter of the LT function is F or P.

It arose the doubt that this particular operation could also depend on the sorting algorithm, implicitly used.
I have not yet been able to find what are the possible values of the ALG parameter to see some details that may be useful to understand what is happening.
Can anyone help me understand how things are going?

Using a custom isless as following seems work correctly:

dft=combine(groupby(df,"NAME"), [:CAT=>(t->sort!(t,by= last, lt= cIsLess))=>:CAT,
:CAT=>(t->string.("CAT",1:length(t)))=>:mm]) 

function cIsLess(x,y)
    if x in "FP"
        return false 
    elseif y in "FP"
        return true
    else return x<y
    end
end

Note that these are not equivalent. Your second versions seems better to me, so what messed up your ordering was indeed the incorrect lt function in the first approach.
(In particular, the first version fails to enforce the “F+P comes at the end” constraint if one of them is x and x<=y is true)

Yes, I am aware of this.
The thing I can’t explain is why if I execute the expression once the order is right (wrong) if I repeat the execution a second time the order is wrong (right).

I see, the problem is, that the order is unpredictable. Specifically, say f(x, y) is the erroneous first version of lt, and say a="F", b="G". Both f(a, b) and f(b, a) yield true then, so even if the frame was sorted by a first pass of this function, as soon as the order of comparisons changes (through the first sorting), the order will be changed again by a second and following passes of sort with this order function (depending on the sorting algorithm used, really).
I suspect, that the Fs and Ps that cause this problem just happen to be in positions that they are the second arguments of the comparison function whenever it matters.
If you care, you could try the different sorting algorithms and see how they cope with the broken order :slight_smile: I speculate that something that applies < naively from left to right like bubble sort would actually be idempotent with the broken order, while more efficient sorting algorithms may not be.

1 Like

Hi @FPGro,
Thank you for your attention to the subject.
I share all your observations. I too thought that the result using the wrong custom LT could depend not only on the initial position of the “P” and “F”, but also on the sorting algorithm used.
but I still haven’t been able to find how to give inputs different to the alg = parameter and look for an idempotent algorithm (I love this term)
But in any case it still remains to be clarified because with the same custom LT and with the same dataframe executing the COMBINE function twice consecutively we get two different results.
It is as if the second run started from a different “system state” perhaps modified from the first execution.
but what? as? where?

This isn’t related to combine, but to the fact that you sort! the data in place, so the second time you run the operation the data is in a different order:

julia> t = df.CAT[df.NAME .== "A1"]
3-element Vector{String}:
 "INF"
 "NF"
 "INH"

julia> sort!(t,by=last,lt=(x,y)->(y=='F' || y=='P') || x<=y)
3-element Vector{String}:
 "INH"
 "NF"
 "INF"

julia> sort!(t,by=last,lt=(x,y)->(y=='F' || y=='P') || x<=y)
3-element Vector{String}:
 "INF"
 "NF"
 "INH"
1 Like

ok. That seems to be the point.
Now I have tried using the sort() function instead of sort!() and everything seems to work as expected (including errors, such as those underlined by @FPGro)

dft=combine(dfg, [:CAT=>(t->sort(t,by=last,lt=(x,y)->(y=='F' || y=='P') || x<=y))=>:CAT,
:CAT=>(t->string.("CAT",1:length(t)))=>:mm]) 

But if it is immediately evident that applying sort!() twice to the same vector, as you have shown, can produce two different results, it was not, for me at least, so immediate to think that the use of a function within another function acting on a vector obtained from yet another function could change the state of the dataset.
My experience is only on some power query M script, where everything is immutable and therefore I have habits deriving from that context.
The moral is: you have to be very careful with the “!”!
Thank you all!

I see where the confusion comes from now. Indeed, it is not entirely obvious that :CAT stays part of the original dataframe even if it is changed and collected in that pipeline. So take care to use mutating functions only if you intend to, I guess. Have a nice day.

That’s exactly the reason for the ! convention :smile:. Note that it’s just a convention - it’s perfectly possible to define mutating functions that don’t end in !, but it’s a really nice heuristic that when you see it, you know extra caution is advised.

1 Like