Julia Binary Heap Benchmark

Hey Community.

Seems like it might be possible that Julia gets adopted at Codeforces. In order to make that happen I need to provide an implementation for this Binary Heap benchmark.

My Julia code is quite ugly, because I had to use Int32 to shave off some milliseconds. Since literals are apparently always Int64s, I had to pass the Int32s into the functions.

My question is: How can this code be improved? Right now we are on par with the Java implementation and about 25% above the C++ implementation.

(BTW, I fixed a few things on the Java reference. I did a warmup, initialized the array inside the run, and I didn’t use final static to get more speed. So the reference implementation you find in the above link might be a bit slower).

1 Like

And because new users can only put 2 links in a post, here is the actual code I have so far.

EDIT: What this is doing is basically bringing the array out of order first, and then fixing it.

From a performance standpoint, this mostly looks good. However, I would probably write it like this, which imo looks much cleaner.

const i32_1 = Int32(1)
const i32_2 = Int32(2)
function pushdown!(h::Vector{Int32}, pos, n)
    @inbounds while pos*i32_2 <= n
        j = pos*i32_2
        if j+i32_1 <= n && h[j+i32_1] > h[j]
            j += i32_1
        end
        if h[pos] >= h[j]
            break
        else
            h[pos], h[j] = h[j], h[pos]
            pos = j
        end
    end
end

I’m also not sure that all of your Int32 stuff is worthwhile. Is this slower?

function pushdown!(h::Vector{Int32}, pos, n)
    @inbounds while pos*2 <= n
        j = pos*2
        if j+1 <= n && h[j+1] > h[j]
            j += 1
        end
        if h[pos] >= h[j]
            break
        else
            h[pos], h[j] = h[j], h[pos]
            pos = j
        end
    end
end

I think it should be the same (or maybe a little faster).

2 Likes

Hey @Oscar_Smith, thanks for the very fast reply.

Have you benchmarked your code? On my machine, your first version takes 40% longer than my version. Not using Int32 increases this to +47%. Maybe I am doing something wrong?

EDIT: The issue is the const declaration. If I pass the two variables into your function, the speed is the same.

I’m definitely not seeing that. I’m seeing my second version as faster than yours. That said, I also changed the benchmark to do all of it’s operations on Int rather than Int32 (except for the heap elements. The reason for this is that Array indexes are always Int (that doesn’t depend on the element type of the array). As such things like h[pos] should be slower when pos is an Int32.

That sounds great. Could you please link the whole script so I can run and compare it? I am probably calling your function in another way than you are.

function pushdown2!(h::Vector{Int32}, pos, n)
    @inbounds while pos*2 <= n
        j = pos*2
            if j+1 <= n && h[j+1] > h[j]
        j += 1
        end
        if h[pos] >= h[j]
            break
        else
            h[pos], h[j] = h[j], h[pos]
        pos = j
        end
    end
end
function bench2(N)
    h = collect(Int32, 0:N-1)

    for i=div(N,2)+1:-1:1
        pushdown2!(h, i, N)
    end
     
    n = N+1
    while n > 2
         n -= 1
         @inbounds h[1], h[n] = h[n], h[1]
             pushdown2!(h, 1, n-1)
    end
       for i=0:N-1
           @inbounds @assert i == h[i+1]
       end
       h
end
heap = @btime bench2(10_000_000);
1 Like

So yes, when I called your functions, there were conversions going on. However, your functions are still slower than the initial setup. Here is the script I ran.

heap = @btime bench_ndrsh();
heap = @btime bench_oscar1(10_000_000)
heap = @btime bench_oscar2(10_000_000)

gives me

  786.433 ms (2 allocations: 38.15 MiB)
  870.576 ms (2 allocations: 38.15 MiB)
  870.140 ms (2 allocations: 38.15 MiB)

Can you confirm that running this script gives different results on your machine?

I got the same timing when using Int64 (or Int, or removing the type parameter completely) here (when removing the element type constraint to h in pushdown2!).

Edit: I can confirm that @Oscar_Smith version is faster than the one of @ndrsh (original post, not latest post), but I used quite an exotic system (Synology DiskStation).

1 Like

Are you allowed to use multithreading?
You can also try @avx

1 Like

Single core only, and no external libraries allowed. (Apart from that, I don’t know how multithreading would work here logically because iterations in the internal loop affect later iterations, so there are happens-before relations.)

Anyway, if anybody can provide a script that runs faster on a standard CPU (they are running an i5-3470) I’d be happy, otherwise don’t worry. I’ll just submit it, it’s not a big deal anyway.

Thank you.

Are you launching Julia with -O3?

Btw. this is what I get on my MacBook (i5):

░ tamasgal@greybox.local:~/tmp
░ 11:05:49 1 > g++ -O3 heap.cpp && ./a.out
Done in 825

░ tamasgal@greybox.local:~/tmp
░ 11:05:56 > julia -O3 heap.jl
ndrsh
  824.606 ms (2 allocations: 38.15 MiB)
oscar
  920.897 ms (2 allocations: 38.15 MiB)

and on my workhorse (Intel® Xeon® Gold 6246R CPU @ 3.40GHz - 32 cores):

tgal@pi1080:~/tmp/binary-heap-benchmark/cpp$ g++ -O3 heap.cpp && ./a.out
Done in 526
tgal@pi1080:~/tmp/binary-heap-benchmark/cpp$ cd ../julia/
tgal@pi1080:~/tmp/binary-heap-benchmark/julia$ julia -O3 heap.jl
ndrsh
  595.149 ms (2 allocations: 38.15 MiB)
oscar
  586.633 ms (2 allocations: 38.15 MiB)
3 Likes

@tamasgal Thanks so much for double-checking. I tried both -O3 and -O2, but on my machine this makes almost no difference (–> can’t tell because variance is higher). But this confirms that everybody in this thread was right, it just depends on the machine. And since Codeforces is benchmarking with an i5, I will submit my solution. Let’s hope it gets accepted! :slightly_smiling_face:

Thanks everybody!

3 Likes