Funny Benchmark with Julia (no longer) at the bottom

A. S. To prevent any confusion: at the time when this was originally posted, Julia’s run was about 600ms and was ~20x slower than Rust. This is not longer the case.

Resuming the original post…
… →

I stumbled upon this funny benchmark with Julia at the bottom of the list.

They (whoever wrote that code - I wasn’t curious to see the git blame) didn’t even account for compiling time.

If somebody has some time, they might be interested in fixing this (I am not even sure that the Julia code is doing only the work in the benchmark description).

Context: the F# version was also unfairly represented, and somebody fixed it (a little overkill there with some unidiomatic F# - but it is fast).

However, the benchmark doesn’t seem concerned with comparing apples to apples - anybody can jump on it and propose an implementation as long as they do not break the benchmark rules.

4 Likes

I’m betting @quinnj could make this fly, but not sure if worth the effort. It’s a small benchmark at least.

3 Likes

I think there is an update to the code, and in my machine here is the result.

Which beats all , the best benchmark time was Rust : 53 ms.

1 Like

Yes - I did the update. It’s pretty straightforward and naive - but the goal was to replace the weird/initial implementation that took about 0.6 seconds.

5 Likes

I’m not following… the current version of the Julia code is now the fastest? Then what’s to do?

I did the update after writing the initial post (the official result was not yet updated - so we can still see Julia with >0.5 seconds on that table).

1 Like

I guess if someone gripes about calling the function just to compile it prior to timing another call, maybe replace the first call with a precompile and check if that works as well? But the typical tendency of JIT compilation to make its way into 1-run benchmarks has already been adequately demonstrated, and that precompile wouldn’t solve this setup.

@Benny, the compilation time only explained about 100-150 ms of that result (at least on my machine).

It was also a matter of implementation.

Even a pretty naive fix changed things tremendously.

From the most recent PR:

| Go | 26.39 ms
| Zig | 38.00 ms
| Rust | 38.91 ms
| Java (GraalVM) | 40.00 ms
| Julia | 42.67 ms (down from >580 ms)

If somebody has some time and wants to play with it - it might be a nice exercise to improve the speed.

It is not about some popular benchmark, but it tries to showcase a real-world scenario - not some rarely encountered micro-benchmark.

1 Like

if you want to fix the second column (the including I/O, which is timing the julia script.jl from outside), you need to AOT I guess, not sure worth the effort…

Nope - it is about the first column.

Hi there,

I’m the author of the initial Julia implementations. (Sorry for finishing last !)

My goal wasn’t to write the fastest code but to be as close as possible to the plain Python and numpy implementation because that’s my main language at work. If you’re curious the Julia was 3x better than plain python and 1.5x than numpy.

Since then the repo blew out of proportion and I wanted to write a proper version.
I was able to push it toward 70ms in my machine with this code:

But I think I’m already outclassed by the julia wizards of this forum.

In anycase, to fix the second column maybe a tool like DaemonMode.jl could do the job?
Not sure if it’s worth it and may be considered cheating.

15 Likes

Hi @atthom,

I hope there are no hard feelings on your end. I wasn’t curious to check the git blame since my only concern was related to seeing Julia in Rust/Go/Zig club. Objectively we can agree that the implementation was not close to the Rust level (and you had your freedom to create whatever implementation you saw fit and according to your objectives - you had no obligation to satisfy me or anybody else).

Also - since the benchmark is not some micro-benchmark and actually resembles a scenario that is probable to be encountered in real-world data processing, it was somehow painful to see Julia running 20x slower than Rust.

But this was my subjective experience - significant enough that I did something about it. Waiting for the next benchmark run to reflect the latest changes: but I estimate that Julia will get the second place in that benchmark. And that is enough for my personal satisfaction.

However, it would be nice to see somebody from the wizard class getting their hands dirty and doing some wonders (while also making sure to stay within the boundaries set by the benchmark rules).

6 Likes

your last PR actually broke the rule:

because this array doesn’t check bounds

1 Like

In my defence, the docs are not saying anything about this: Home · StrideArrays.jl (juliasimd.github.io)

Thanks for bringing this up - I’ll fix it.

1 Like

Looking at the new code, it was a bit long. Here is a shorter version which tries to be short and fast (achieved close to current version in speed but still doesn’t do some of the optimizations in the current version):

using JSON3
using Dates
using DataStructures

struct PostData
    _id::String
    title::String
    tags::Vector{String}
end

struct RelatedPost
    _id::String
    tags::Vector{String}
    related::SubArray{PostData, 1, Vector{PostData}, Tuple{Vector{Int64}}, false}
end

function relatedIO()
    posts = JSON3.read("posts.json", Vector{PostData})

    start = now()
    all_related_posts = related(posts)
    println("Processing time (w/o IO): $(now() - start)")

    open("../related_posts_julia.json", "w") do f
        JSON3.write(f, all_related_posts)
    end
end

function related(posts, topn = 5)
    tagmap = Dict{String,Vector{Int}}()
    for (idx, post) in enumerate(posts), tag in post.tags
        tagmap[tag] = push!(get(()->Int[], tagmap, tag), idx)
    end

    scratch = zeros(Int, length(posts))
    scratchbase = Ref(0)
    return map(enumerate(posts)) do (i, post)
        for t in post.tags, x in tagmap[t]
            x == i && continue
            scratch[x] = ifelse(scratch[x] < scratchbase[], scratchbase[], scratch[x])+1
        end
        tops = nlargest(topn, 1:length(scratch); by=x->scratch[x])
        scratchbase[] = scratch[tops[1]]
        RelatedPost(post._id, post.tags, @view posts[tops]) 
    end
end

const res = relatedIO()

If anyone wants to try and make it nicer/faster…

A new speed trick is to avoid zeroing the scratch array by starting new counters from a higher scratchbase.

4 Likes

Participation is open from any contributor: feel free to fork and update the script at any point.

3 Likes

I solved the StrideArrays mishap, and the latest PR was reverted.

The previous non-StrideArrays version was not bad - but the performance gain after StrideArrays addition was clearly going against the rules.

@Dan, my current version (from the reverted PR) runs in 32ms on my machine - the version you proposed is around 54-56ms. The weird thing is that by looking at your code I intuitively expected a performance gain (which didn’t materialise when benchmarking it).

From what I see, some improvements can be done for this function I wrote:

function fastmaxindex!(xs::Vector{Int}, topn, maxn, maxv)
    maxn .= 1
    maxv .= 0
    for (i, x) in enumerate(xs)
        if x > maxv[1]
            maxv[1] = x
            maxn[1] = i
            for j in 2:topn
                if maxv[j-1] > maxv[j]
                    maxv[j-1], maxv[j] = maxv[j], maxv[j-1]
                    maxn[j-1], maxn[j] = maxn[j], maxn[j-1]
                end
            end
        end
    end

    reverse!(maxn)

    return maxn
end

Maybe a proper usage of a really fast PriorityQueue might help - or maybe somebody can do a smart algorithmic trick that is more compiler-friendly.

1 Like

Tried to use PriorityQueue from DataStructures, but it was very slow and allocating. The problem could be an underlying Dict supporting the PriorityQueue.

A new Dict type which supports keys from a limited universe and is implemented using flat memory storage could be useful in other cases. I think there are many times a Dict is known to have keys from a small universe and speed is paramount.

1 Like

Go implementation seems to do this (using the knowledge that there are max 100 tags).

The Dictionaries.jl supports a size hint, but in the current scenario, seems slower than the native implementation.

I realized that this whole function can be replaced by partialsortperm!() except that it’s much slower right now:

@@ -69,27 +51,33 @@ function related(posts)
     relatedposts = Vector{RelatedPost}(undef, length(posts))
     taggedpostcount = Vector{Int64}(undef, length(posts))

-    maxn = Vector{Int64}(undef, topn)
-    maxv = Vector{Int64}(undef, topn)
+    scratch = collect(eachindex(posts))

         taggedpostcount[i] = 0

-        fastmaxindex!(taggedpostcount, topn, maxn, maxv)
+        maxn = partialsortperm!(scratch, taggedpostcount, 1:5; rev=true, initialized=true)

         relatedpost = RelatedPost(post._id, post.tags, SVector{topn}(posts[ix] for ix in maxn))
         relatedposts[i] = relatedpost
     end

     return relatedposts
 end

maybe our sorting algorithm guru @Lilith can provide some comments here

1 Like