Funny Benchmark with Julia (no longer) at the bottom

The benchmark was updated to include 15K and 30K posts besides the original 5K (and it might add 100K posts later - I think the VM used prevents that for now).

With this said, there is a Python script in the repo for posts generation:

related_post_gen/gen_fake_posts.py at main · jinyus/related_post_gen (github.com)

Maybe if somebody still wants to improve the performance, it should also check the current solution vs. 5, 15 and 30K posts.

1 Like

I don’t think that’s necessary, the performance difference can be reproduced with the included sample right?

to be clear, we don’t have much to optimize, all I can see is:

while this triple loop is “slow”:


    for (i, post) in enumerate(posts)
        taggedpostcount .= 0
        # for each post (`i`-th)
        # and every tag used in the `i`-th post
        # give all related post +1 in `taggedpostcount` shadow vector
        for tag in post.tags
            for idx in tagmap[tag]
                taggedpostcount[idx] += 1

there doesn’t seem to be a way to remove it

1 Like

New random data seem to have some impact (even if we keep the same size). Except for Go, which constantly maintains the first place, the other players seem to change spots depending on the data.

For example, Julia is in third place right now - ahead of Rust. Zig is 4th - but has the fastest 5K outcome.

So, the ranking is now looking at the summed time running the code on all three datasets.

However, there is another waiting PR for these results: Rust is 2nd and Julia is 5th (although there was no change in either Julia or Rust code).

So I would say that data matters: since there is just a simple sum, getting a very good result in 30K can have more weight than just gaining 1ms for the 5K dataset.

But I agree that it might be a good heuristic to think that if you get an improvement on the 5K one, that will be reflected in the others (unless we are proved wrong by the actual results - I experienced quite a few violations of my intuition already with this benchmark).

2 Likes

We are allowed to use the assumption that there are a maximum of 100 tags (but we cannot assume anything related to the number of posts). That seems to be an optimization point (if we drop the standard Dict).

1 Like

right, after asking around on #performance-helpdesk on Slack, there doesn’t seem to be anything immediately obvious. I tried Dictionaries.ArrayDictionary which is slightly slower than what we have.

I also manually tried Vector{Pair{String, Vector{Int}}} also slower

1 Like

here’s a manual implementation of linear probe dict

diff --git a/julia/related.jl b/julia/related.jl
index acd6894..bffdc2e 100644
--- a/julia/related.jl
+++ b/julia/related.jl
@@ -34,9 +34,7 @@ end
 StructTypes.StructType(::Type{PostData}) = StructTypes.Struct()

 function fastmaxindex!(xs::Vector{Int64}, topn, maxn, maxv)
-    maxn .= 1
-    maxv .= 0
-    top = maxv[1]
+    maxv .= top = 0
     for (i, x) in enumerate(xs)
         if x > top
             maxv[1] = x
@@ -60,13 +58,19 @@ function related(posts)
     topn = 5
     # key is every possible "tag" used in all posts
     # value is indicies of all "post"s that used this tag
-    tagmap = Dict{String,Vector{Int64}}()
+    # tagmap = Dict{String,Vector{Int64}}()
+    tagkey = sizehint!(String[], 100)
+    tagvalue = sizehint!(Vector{Int64}[], 100)
+
     for (idx, post) in enumerate(posts)
         for tag in post.tags
-            if !haskey(tagmap, tag)
-                tagmap[tag] = Vector{Int64}()
+            d_idx = findfirst(==(tag), tagkey)
+            if isnothing(d_idx)
+                push!(tagkey, tag)
+                push!(tagvalue, [idx])
+            else
+                push!(tagvalue[d_idx], idx)
             end
-            push!(tagmap[tag], idx)
         end
     end

@@ -82,7 +86,9 @@ function related(posts)
         # and every tag used in the `i`-th post
         # give all related post +1 in `taggedpostcount` shadow vector
         for tag in post.tags
-            for idx in tagmap[tag]
+            d_idx = findfirst(==(tag), tagkey)
+            vs = tagvalue[d_idx]
+            for idx in vs
                 taggedpostcount[idx] += 1
             end
         end
@@ -100,3 +106,5 @@ function related(posts)
 end

which is basically same speed as Dict

Here is another implementation, using a different algorithm, which potentially could be better.
The idea is to use the generation of tagmap, which makes the per tag vectors of ids sorted. So by walking the sorted vectors of all a post’s tags simultaneously, one can infer the amount of tag overlap.
Explanation is a little tricky, but here is the code:


function related_new(posts, topn = 5)
    tagmap = Dict{String,Vector{Int}}()
    maxtags = 0
    for (idx, post) in enumerate(posts)
        maxtags = max(maxtags, length(post.tags))
        for tag in post.tags
            setindex!(tagmap, push!(get(()->Int[], tagmap, tag), idx), tag)
        end
    end

    topnids = Vector{Int}(undef, topn)
    topnoverlaps = zeros(Int, topn)
    p = zeros(Int, maxtags)
    indxs = Vector{Int}(undef, maxtags)
    ids = Vector{Int}(undef, maxtags)
    tagvecs = Vector{Vector{Int}}(undef, maxtags)
    res = Vector{RelatedPost}(undef, length(posts))
    for (i, post) in enumerate(posts)
        topnoverlaps .= 0
        ntags = length(post.tags)
        indxs[1:ntags] .= 1
        for i in 1:ntags
            tagvecs[i] = tagmap[post.tags[i]]
        end
        ids[1:ntags] .= getindex.(@view(tagvecs[1:ntags]), @view(indxs[1:ntags]))
        sortperm!(@view(p[1:ntags]), @view(ids[1:ntags]))
        lastid = ids[p[1]]
        overlap = 1
        while ids[p[1]] < length(posts) 
            indxs[p[1]] += 1
            ids[p[1]] = indxs[p[1]] > length(tagvecs[p[1]]) ? length(posts)+1 : tagvecs[p[1]][indxs[p[1]]]
            t = 2
            while t<=ntags && ids[p[t-1]] > ids[p[t]]
                p[t-1], p[t] = p[t], p[t-1]
                t += 1
            end
            if ids[p[1]] == lastid
                overlap += 1
            else
                if lastid != i
                    if overlap > topnoverlaps[1]
                        topnoverlaps[1] = overlap
                        topnids[1] = lastid
                    end
                    s = 2
                    while s <= topn && topnoverlaps[s] < topnoverlaps[s-1]
                        topnoverlaps[s-1], topnoverlaps[s] = topnoverlaps[s], topnoverlaps[s-1]
                        topnids[s-1], topnids[s] = topnids[s], topnids[s-1]
                        s += 1
                    end
                end
                overlap = 1
                lastid = ids[p[1]]
            end
        end
        reverse!(topnids)
        res[i] = RelatedPost(post._id, post.tags, @view posts[topnids])
    end
    return res
end

Since it doesn’t have a taggedpostcount vector and doesn’t need to sort anything, it has the potentioal to be faster, perhaps more so for larger post numbers.

The benchmark is not too good, but maybe there are some more optimizations.

what does your benchmark say?

this is illegal according to repo

Benchmark is 8x worse.

julia> @benchmark related_new(posts)
BenchmarkTools.Trial: 18 samples with 1 evaluation.
 Range (min … max):  284.889 ms … 295.876 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     289.561 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   288.686 ms ±   2.789 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃       ▃                 █
  █▇▁▁▁▇▁▁█▁▁▁▇▁▁▁▁▁▁▁▁▁▇▁▁▇█▇▇▁▁▇▇▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  285 ms           Histogram: frequency by time          296 ms <

 Memory estimate: 1.25 MiB, allocs estimate: 183.

yeah, similar allocation but 10x slower (current one we have is about 20ms on my machine)

didn’t go over repo, but why is it illegal?

So, why is it illegal? It doesn’t limit number of posts?

Disableing runtime checks (bounds etc)

you cannot use @inbounds

1 Like

Removed it. Don’t think it will change anything. It’s only a proof-of-concept for trying to out-algorithm other submissions.

Someone should tell them that running benchmarks on GitHub-hosted CI machines is going to be very noisy.

2 Likes

@Lilith, just did a nice PR.

I was just looking at the very moment at how Go implementation does not specifically check for the key exeistance:

for i, post := range posts {
		for _, tag := range post.Tags {
			tagMap[tag] = append(tagMap[tag], isize(i))
		}
	}

And I was wondering if there is something similar in Julia stdlib.

It turns out there it is (code quote from @Lilith PR):

for (idx, post) in enumerate(posts)
    for tag in post.tags
        tags = get!(() -> UInt16[], tagmap, tag)
        push!(tags, idx)
    end
end

I used StrideArrays at some point (by only looking at the docs and not seeing the repo warning).

It was blazing fast (~30% speed gain on my machine), and I requested the revert of that PR after @jling pointed out the issue.

1 Like

I think it’s questionable as to whether the rule is no-one can ever disable bounds checking (in which case probably Julia is excluded as a language, what’s the chance that there’s no @inbounds anywhere in Julia Base etc?) or the benchmark writer can’t disable bounds checking in the code they specifically write.

Also I’m guessing that the rule is really designed for static pre-compiled binaries where all bounds checking is disabled by a compiler switch at the command line, rather than specific bounds checking disabled at a point in the code where you can prove that it’s unnecessary.

If a library can prove that its safe to disable bounds checking, it seems like simply using that library shouldn’t disqualify your solution.

I’m not following entirely but which situation are we in here?

1 Like

A similar discussion took place here: Rules clarification · Issue #169 · jinyus/related_post_gen (github.com)

Seems like stdlib is excepted.

2 Likes