Bucket sorting in Julia

Hello guys, I wrote a bucket sorting algorithm in Julia, which apparently is not available in the built-in modules, or in any other library. I will be using this bucket sorting for ray tracing, and I need the code to be as efficient as possible.I will leave my code here, for anyone who may need a bucket sorting code.

I am sure that there will be some obvious programming mistakes (I am a geophysicist, not a computer scientist), so will I appreciate any possible feedback that may improve the code’s efficiency.

lmin=1
lmax=20
tt=rand(lmin:lmax,1000000)

function bucket_aq(tt::Vector{},n::Int)
step=lmax/(n-1)
zbuck=[Any[] for i in 1:n]
#indexing
    for i in 1:length(tt)
        ind=Int(ceil(tt[i]/step)+1.0)
        push!(zbuck[ind],tt[i])
    end
    return zbuck
end


@time zbuck=bucket_aq(tt,10)

Best regards.

ok so this is a histogram but keep all original elements, a few problems:

doesn’t work when your bins are not regular

performance killer; you wanna use

zbuck=[eltype(tt)[] for i in 1:n]

I would use GitHub - JuliaArrays/ArraysOfArrays.jl: Efficient storage and handling of nested arrays in Julia
to avoid growing n separate arrays on heap

2 Likes

Thank you so much for your comments Jerry! I will definitely try those changes and share here the result.

Regards!

So I also wanted to point out one thing. In Julia, you generally don’t want to define a variable outside of a function and then use that variable inside of a function (like you did for lmax). You can do it without a major performance hit using constants, but I would recommend avoiding it if you can.

With @jling recommendations I think this is a pretty performant version of the function you’re writing:

#tt is a Vector of values with type T where T can be any type (e.g. float)
function bucket_aq2(tt::Vector{T}, numBuckets::Int) where T

    #zbuck is a Vector of Vectors of T
    zbuck=[T[] for _ in 1:numBuckets]

    #Store the maximum value for each bin
    maxBinVal = collect(range(minimum(tt), maximum(tt), numBuckets+1))

    #Remove the extra bin where the max value is minimum(tt)
    popfirst!(maxBinVal)

    #Loop through all the values in tt
    for val in tt
        #return the index of the first maxBinVal greater than val
        ind = searchsortedfirst(maxBinVal, val)
        push!(zbuck[ind],val)
    end

    return zbuck
end

Just a side comment: please try to contribute the final implementation to an existing package when it is ready and minimally tested. The package SortingAlgorithms.jl seems like a good candidate where you can submit PRs.

1 Like

Thank you so much Robert! Much cleaner in this way. It is impressive how elegant a code becomes when is written by some one who knows.

I recommend you do it in two passes.

In the first pass you count how many items each bucket will contain.

Then, in the second pass, you have an index per bucket by non-inclusive prefix sum of the counts. By scanning the elements, you now know where they have to go.

You can even do it in-place by pointing to the first element that does not belong to each bucket and “chasing” them around.

Doing the above with multiple threads is more involved but also doable. Each thread is responsible for a block (a section) of the array. First, each thread counts it own. Then you sync, and with prefix sums you find the offsets per bucket, per thread. So in the second pass, each thread moves the elements it is responsible for, to their destinations without the need for locks. It will require a ping-pong buffer, read from one, write to other…

See how Radix Sort is implemented, it is exactly the same way

can we use?
zbuck=fill(T[], numBuckets)

So when you use fill on a vector object it makes a reference to that object instead of new copies:

julia> zbuck = fill(Int64[] , 3)
3-element Vector{Vector{Int64}}:
 []
 []
 []

julia> push!(zbuck[1],10)
1-element Vector{Int64}:
 10

julia> zbuck
3-element Vector{Vector{Int64}}:
 [10]
 [10]
 [10]
1 Like

the only possibly shorter way to write that is

map(_->T[], 1:numBuckets)