Hi,

I was trying to play around with the implementation of fsortperm in SortingLab.jl and wanted to try making an implementation to just compute directly the sorted indices rather than sorting both the values vector and the indices vector.

I ended up having this comparison function to compare my implementation with the one from SortingLab (it is actually an adapted version but conceptually it should be equivalent for the benchmark I am trying).

This is the code I have for the comparison:

```
import SortingAlgorithms.uint_mapping
using BenchmarkTools
const RADIX_SIZE = 11
const RADIX_MASK = 0x7FF
function sort_test_outer(vs::AbstractVector, f_inner; lo::Int=1, hi::Int=length(vs), ind=collect(1:hi), o=Base.Order.ForwardOrdering(), ts=similar(vs), ind1=copy(ind))
T = eltype(vs)
# Init
iters = ceil(Integer, sizeof(T)*8/RADIX_SIZE)
bin = zeros(UInt32, 2^RADIX_SIZE, iters)
if lo > 1; bin[1,:] .= lo-1; end
# Histogram for each element, radix
for i = lo:hi
v = uint_mapping(o, vs[i])
for j = 1:iters
idx = Int((v >> ((j-1)*RADIX_SIZE)) & RADIX_MASK) + 1
@inbounds bin[idx,j] += 1
end
end
args = (
vs = vs,
ts = ts,
ind = ind,
ind1 = ind1,
hi = hi,
)
# Sort!
f_inner(bin,iters,o,args)
end
function inner_t1(bin,iters,o,args)
# Unpack the used variables
vs = args.vs
ind = args.ind
ind1 = args.ind1
swaps = 0
for j = 1:iters
cbin = vcat(0,cumsum(bin[1:end-1,j]))
@inbounds for i in ind # <--- Here I index directly on the index vector
v = uint_mapping(o, vs[i])
idx = Int((v >> ((j-1)*RADIX_SIZE)) & RADIX_MASK) + 1
ci = cbin[idx] += 1
ind1[ci] = i
end
ind,ind1 = ind1,ind
swaps += 1
end
if isodd(swaps)
ind,ind1 = ind1,ind
end
return ind
end
function inner_sortinglab(bin,iters,o,args)
# Unpack the used variables
vs = args.vs
ts = args.ts
hi = args.hi
ind = args.ind
ind1 = args.ind1
swaps = 0
for j = 1:iters
cbin = vcat(0,cumsum(bin[1:end-1,j]))
@inbounds for i in 1:hi
v = uint_mapping(o, vs[i])
idx = Int((v >> ((j-1)*RADIX_SIZE)) & RADIX_MASK) + 1
ci = cbin[idx] += 1
ts[ci] = vs[i]
ind1[ci] = ind[i]
end
vs,ts = ts,vs
ind,ind1 = ind1,ind
swaps += 1
end
if isodd(swaps)
vs,ts = ts,vs
ind,ind1 = ind1,ind
end
return ind
end
a = rand(10^6);
display(sort_test_outer(a,inner_t1) == sort_test_outer(a,inner_sortinglab))
display(@benchmark sort_test_outer($a,inner_t1))
display(@benchmark sort_test_outer($a,inner_sortinglab))
```

where the main difference lies in the inner loops of inner_t1 and inner_sortinglab.

On my machine, these are the results

```
true
BenchmarkTools.Trial:
memory estimate: 23.18 MiB
allocs estimate: 128
--------------
minimum time: 202.476 ms (0.00% GC)
median time: 219.459 ms (0.00% GC)
mean time: 223.232 ms (0.83% GC)
maximum time: 321.975 ms (1.48% GC)
--------------
samples: 23
evals/sample: 1
BenchmarkTools.Trial:
memory estimate: 23.18 MiB
allocs estimate: 128
--------------
minimum time: 61.853 ms (0.00% GC)
median time: 68.881 ms (0.00% GC)
mean time: 68.984 ms (4.23% GC)
maximum time: 86.784 ms (0.00% GC)
--------------
samples: 73
evals/sample: 1
```

And I was trying to understand why the cuustom version is so much slower.

The main idea was to try and reduce the amount of vector reshuffling as this way I only have to reshuffle the indices vector rather than both ind and vs vectors, eventually calling vs[ind] at the end.

This is obviously not beneficial as the computation time explodes though.

Is it something related to CPU cache? (edited)

Edit: I’ll also tag @xiaodai here as he might have explored this while writing the original code