Bucket sorting in Julia

Alejandro_Quiaro_San · August 4, 2022, 2:05am

Hello guys, I wrote a bucket sorting algorithm in Julia, which apparently is not available in the built-in modules, or in any other library. I will be using this bucket sorting for ray tracing, and I need the code to be as efficient as possible.I will leave my code here, for anyone who may need a bucket sorting code.

I am sure that there will be some obvious programming mistakes (I am a geophysicist, not a computer scientist), so will I appreciate any possible feedback that may improve the code’s efficiency.

lmin=1
lmax=20
tt=rand(lmin:lmax,1000000)

function bucket_aq(tt::Vector{},n::Int)
step=lmax/(n-1)
zbuck=[Any[] for i in 1:n]
#indexing
    for i in 1:length(tt)
        ind=Int(ceil(tt[i]/step)+1.0)
        push!(zbuck[ind],tt[i])
    end
    return zbuck
end


@time zbuck=bucket_aq(tt,10)

Best regards.

jling · August 4, 2022, 2:12am

ok so this is a histogram but keep all original elements, a few problems:

doesn’t work when your bins are not regular

performance killer; you wanna use

zbuck=[eltype(tt)[] for i in 1:n]

I would use GitHub - JuliaArrays/ArraysOfArrays.jl: Efficient storage and handling of nested arrays in Julia
to avoid growing n separate arrays on heap

Alejandro_Quiaro_San · August 4, 2022, 9:32pm

Thank you so much for your comments Jerry! I will definitely try those changes and share here the result.

Regards!

RobertGregg · August 5, 2022, 12:34am

So I also wanted to point out one thing. In Julia, you generally don’t want to define a variable outside of a function and then use that variable inside of a function (like you did for lmax). You can do it without a major performance hit using constants, but I would recommend avoiding it if you can.

With @jling recommendations I think this is a pretty performant version of the function you’re writing:

#tt is a Vector of values with type T where T can be any type (e.g. float)
function bucket_aq2(tt::Vector{T}, numBuckets::Int) where T

    #zbuck is a Vector of Vectors of T
    zbuck=[T[] for _ in 1:numBuckets]

    #Store the maximum value for each bin
    maxBinVal = collect(range(minimum(tt), maximum(tt), numBuckets+1))

    #Remove the extra bin where the max value is minimum(tt)
    popfirst!(maxBinVal)

    #Loop through all the values in tt
    for val in tt
        #return the index of the first maxBinVal greater than val
        ind = searchsortedfirst(maxBinVal, val)
        push!(zbuck[ind],val)
    end

    return zbuck
end

juliohm · August 5, 2022, 1:19am

Just a side comment: please try to contribute the final implementation to an existing package when it is ready and minimally tested. The package SortingAlgorithms.jl seems like a good candidate where you can submit PRs.

Alejandro_Quiaro_San · August 5, 2022, 1:37am

Thank you so much Robert! Much cleaner in this way. It is impressive how elegant a code becomes when is written by some one who knows.

pitsianis · August 5, 2022, 3:26pm

I recommend you do it in two passes.

In the first pass you count how many items each bucket will contain.

Then, in the second pass, you have an index per bucket by non-inclusive prefix sum of the counts. By scanning the elements, you now know where they have to go.

You can even do it in-place by pointing to the first element that does not belong to each bucket and “chasing” them around.

Doing the above with multiple threads is more involved but also doable. Each thread is responsible for a block (a section) of the array. First, each thread counts it own. Then you sync, and with prefix sums you find the offsets per bucket, per thread. So in the second pass, each thread moves the elements it is responsible for, to their destinations without the need for locks. It will require a ping-pong buffer, read from one, write to other…

See how Radix Sort is implemented, it is exactly the same way

Eben60 · August 5, 2022, 8:58pm

can we use?
zbuck=fill(T[], numBuckets)

RobertGregg · August 6, 2022, 2:39am

So when you use fill on a vector object it makes a reference to that object instead of new copies:

julia> zbuck = fill(Int64[] , 3)
3-element Vector{Vector{Int64}}:
 []
 []
 []

julia> push!(zbuck[1],10)
1-element Vector{Int64}:
 10

julia> zbuck
3-element Vector{Vector{Int64}}:
 [10]
 [10]
 [10]

jling · August 6, 2022, 2:41am

the only possibly shorter way to write that is

map(_->T[], 1:numBuckets)

Topic		Replies	Views
Bucket sorting in Julia, with functions General Usage question , hpc	1	331	September 13, 2022
Faster Sorting With GPU? New to Julia gpu , sort , arrays	25	2316	January 17, 2018
External Sorting Algorithms in julia? Data sort	0	380	November 6, 2020
Performance optimization of a custom binning function Performance	43	975	December 15, 2022
GPU Sort Function GPU question , gpuarrays , sort	20	4796	April 2, 2020

Bucket sorting in Julia

Related topics