Why does changing this indexing slows the code so much?

Hi,
I was trying to play around with the implementation of fsortperm in SortingLab.jl and wanted to try making an implementation to just compute directly the sorted indices rather than sorting both the values vector and the indices vector.
I ended up having this comparison function to compare my implementation with the one from SortingLab (it is actually an adapted version but conceptually it should be equivalent for the benchmark I am trying).
This is the code I have for the comparison:

import SortingAlgorithms.uint_mapping
using BenchmarkTools
const RADIX_SIZE = 11
const RADIX_MASK = 0x7FF
function sort_test_outer(vs::AbstractVector, f_inner; lo::Int=1, hi::Int=length(vs), ind=collect(1:hi), o=Base.Order.ForwardOrdering(), ts=similar(vs), ind1=copy(ind))
    T = eltype(vs)
    # Init
    iters = ceil(Integer, sizeof(T)*8/RADIX_SIZE)
    bin = zeros(UInt32, 2^RADIX_SIZE, iters)
    if lo > 1;  bin[1,:] .= lo-1;  end
    # Histogram for each element, radix
    for i = lo:hi
        v = uint_mapping(o, vs[i])
        for j = 1:iters
            idx = Int((v >> ((j-1)*RADIX_SIZE)) & RADIX_MASK) + 1
            @inbounds bin[idx,j] += 1
        end
    end
    args = (
        vs = vs,
        ts = ts,
        ind = ind,
        ind1 = ind1,
        hi = hi,
    )
    # Sort!
    f_inner(bin,iters,o,args)
end
function inner_t1(bin,iters,o,args)
    # Unpack the used variables
    vs = args.vs
    ind = args.ind
    ind1 = args.ind1
    swaps = 0
    for j = 1:iters
        cbin = vcat(0,cumsum(bin[1:end-1,j]))
        @inbounds for i in ind      # <--- Here I index directly on the index vector
            v = uint_mapping(o, vs[i])
            idx = Int((v >> ((j-1)*RADIX_SIZE)) & RADIX_MASK) + 1
            ci = cbin[idx] += 1
            ind1[ci] = i
        end
        ind,ind1 = ind1,ind
        swaps += 1
    end
    if isodd(swaps)
        ind,ind1 = ind1,ind
    end
    return ind
end
function inner_sortinglab(bin,iters,o,args)
    # Unpack the used variables
    vs = args.vs
    ts = args.ts
    hi = args.hi
    ind = args.ind
    ind1 = args.ind1
    swaps = 0
    for j = 1:iters
        cbin = vcat(0,cumsum(bin[1:end-1,j]))
        @inbounds for i in 1:hi
            v = uint_mapping(o, vs[i])
            idx = Int((v >> ((j-1)*RADIX_SIZE)) & RADIX_MASK) + 1
            ci = cbin[idx] += 1
            ts[ci] = vs[i]
            ind1[ci] = ind[i]
        end
        vs,ts = ts,vs
        ind,ind1 = ind1,ind
        swaps += 1
    end
    if isodd(swaps)
        vs,ts = ts,vs
        ind,ind1 = ind1,ind
    end
    return ind
end
a = rand(10^6);
display(sort_test_outer(a,inner_t1) == sort_test_outer(a,inner_sortinglab))
display(@benchmark sort_test_outer($a,inner_t1))
display(@benchmark sort_test_outer($a,inner_sortinglab))

where the main difference lies in the inner loops of inner_t1 and inner_sortinglab.
On my machine, these are the results

true
BenchmarkTools.Trial:
  memory estimate:  23.18 MiB
  allocs estimate:  128
  --------------
  minimum time:     202.476 ms (0.00% GC)
  median time:      219.459 ms (0.00% GC)
  mean time:        223.232 ms (0.83% GC)
  maximum time:     321.975 ms (1.48% GC)
  --------------
  samples:          23
  evals/sample:     1
BenchmarkTools.Trial:
  memory estimate:  23.18 MiB
  allocs estimate:  128
  --------------
  minimum time:     61.853 ms (0.00% GC)
  median time:      68.881 ms (0.00% GC)
  mean time:        68.984 ms (4.23% GC)
  maximum time:     86.784 ms (0.00% GC)
  --------------
  samples:          73
  evals/sample:     1

And I was trying to understand why the cuustom version is so much slower.

The main idea was to try and reduce the amount of vector reshuffling as this way I only have to reshuffle the indices vector rather than both ind and vs vectors, eventually calling vs[ind] at the end.

This is obviously not beneficial as the computation time explodes though.

Is it something related to CPU cache? (edited)

Edit: I’ll also tag @xiaodai here as he might have explored this while writing the original code

how can u do that without sorting the original values along with it?

my guess is that your access pattern is too scatter gunned. CPUs are fastest when you are accessing elements sequentially, so if you just permute the indices then you are accessing element 100 then 1000 then 20 then 1_000_000, which will lead to poor cache performance

Searching optimising julia code by @ChrisRackauckas and go thru the 1.5hr lecture and pay close attention to cache and how CPU is structured should help.

Yes sorry, it was not really just computing the indices but avoid doing the assignment from the vector to be sorted to the temporary one just for the indices rather than for both indices and vector to be sorted.

I’ll have a look at the lecture you mention thanks. I was a bit surprised here because even in the original case, while the access to the vs vector is linear, the access into the vectors ts and ind1 are also scattered as the indexing depends on the actual realization of v and idx and is potentially being accessed in a very non-sequential way.

Slow getindex is generally much worse than slow setindex!, because the instructions that follow generally depend on getindex finishing before it can run, but that’s not the case for setindex! unless you immediately reload that memory.
Hence, setindex! can take its time without necessarily slowing the rest of your code down.

This is thanks to out-of-order execution.
Basically, all the high performance CPUs of today execute instructions in parallel by building a dependency graph. Lots of operations will depend on the loads, so it’s crucial those are fast for the CPU to continue execution. But if a store takes a while, that’ll generally be okay.

2 Likes

Thanks a lot for this, always good to learn :smiley: