Funny Benchmark with Julia (no longer) at the bottom

[julia] add comments, reduce getindex(), use MVector to speed up by Moelf · Pull Request #162 · jinyus/related_post_gen · GitHub I think MVector and not checking if > ... every single time with getindex() helped reduce runtime by 33%, I checked the output diff seems okay:

> diff related_posts_julia.json related_posts_julia_new.json


1 Like

If I understand correctly, that finds the topn largest values in a list ? (edit: index of the ntop largest values)
If so, you might look into this

we need the index of topn largest values in a list, which is precise partialsortperm( ;rev=true), but it’s much slower

1 Like

That is then exactly (?) what was discussed

but for the minimum.

1 Like

sortperm and friends are really slow in Julia https://github.com/JuliaLang/julia/issues/939

Also, partialsort dispatches between partial-scratch-quicksort (good for long inputs) and insertionsort of the whole input (good for short inputs) but does not use partial-insertionsort (good for when the number of elements being searched for is small, asymptotically, when that number is less than log(length)).

2 Likes

should make the second column (“total”) more reasonable, basically just put it into its own project and utilize some of those native code caching.

Locally it reduced ~3 seconds to about 400 ms

> hyperfine 'julia --startup-file=no --project=Related/ related.jl'
Benchmark 1: julia --startup-file=no --project=Related/ related.jl
  Time (mean ± σ):     407.2 ms ±  12.5 ms    [User: 430.0 ms, System: 338.5 ms]
  Range (min … max):   394.0 ms … 431.2 ms    10 runs
1 Like

Current results:

The underlying data (used for the leaderboard) is changing from run to run, and it seems to affect the results regardless of underlying implementations.

For example, Zig was not updated in 4 days and jumped on top for the current run (the same implementation a day back did worse than Java - GraalVM).

Maybe a better way to optimize this is to use the random post generator and check the improvements by averaging the results of multiple randomly generated datasets.

Leaving the leaderboard aside - I find this kind of real-world benchmark more valuable than micro-benchmarking (although we can admit that we are not always comparing apples to apples here: because anybody is free to do whatever algorithmic tricks and end up with different implementations from language to language).

This was not merged yet - so it still has a large total time value.

2 Likes

this function has the funny property that if I add @inbounds it actually becomes slower (given we’re using MVector now)

As an alternative to the highly-optimized solution built of manual loops, I’d like to propose a solution using higher-level data manipulation tools.
It’s about 2x slower for me than the code in repo, but shorter and easier to understand what’s actually going on – once got used to each primitive.
Uses NamedTuples instead of custom structs.

using DataManipulation, StructArrays

function related_2(posts_raw)
	# arguably the processing is more convenient with namedtuples instead of custom structures
	# converting here, but potentially this could be created when reading in the first place:
	posts = @p posts_raw |> map(Accessors.getproperties) |> StructArray |> @insert __.ix = eachindex(__)

	tagmap = @p let
		posts
		flatmap(_.tags, (p, tag) -> (; p.ix, tag))
		group(_.tag)
		map(map(x->x.ix, _))
	end

	nsharedtags = zeros(Int, length(posts))
	@p let
		posts
		map() do post
			nsharedtags .= 0
			for t in post.tags
				@views nsharedtags[tagmap[t]] .+= 1
			end
			nsharedtags[post.ix] = 0

			related = partialsort(posts, 1:5; by=p -> nsharedtags[p.ix], rev=true)
			(; post..., related)
		end
		collect
	end
end


# code above works without these definitions,
# but Base.partialsort is muuuch slower
using DataStructures: nlargest
partialsort(v, k; kwargs...) = partialsort(v, Base.OneTo(k); kwargs...)
partialsort(v, k::Base.OneTo; by, rev=false) = @p v |> nlargest(length(k), eachindex(__); by=i -> by(__[i])) |> map(v[_])
2 Likes

Julia went from last to the top (while no longer fastest). It’s not new that Julia can be optimized (and it’s neither as scalable in its now optimized version, as e.g. Go’s).

What I would be curious about, then know problem with unoptimized code, could that be optimized (i.e. by Julia improvements, no code changes to the code/benchmark)? I’m ok with it, since I know of it, but dynamic code is maybe slower than it needs to be? And at least possibly surprising to some (new users)?

It’s sort of good that unoptimized is not too fast… then people would “know”, or not. Unoptimized code can’t of course be made as fast, but what would be a reasonable margin?

1 Like

Do you mean like LitteDict? It’s often much faster, if your Dict is small. But note it’s O(n), why not the default (I think that might though be ok for Julia itself, i.e. what does Julia need when e.g. compiling?).

I did think up a hybrid, LittleDict + regular, I or someone should implement, then it’s back to O(1), first scanning a limited n for LittleDict before the fallback. I could be ok for Base, or not…

That’s not the same, it seems then, though I guess helpful for something.

1 Like

Are they, and would this PermFast here, help in this case:

The issue you linked was closed in 2022, since “greatly improved”, then reopened again (otherwise incredible that a 2012 issue was still open).

1 Like

I tried some other approach, especially the one Go-lang took, which is too insert at the correct location instead of doing the maxv[j-1], maxv[j] dance every time there’s a new top 5 candidate:

function fastmaxindex!(xs::Vector{Int64}, topn, maxn, maxv)
    maxn .= 1
    maxv .= _min = 0
    for (i, x) in enumerate(xs)
        x <= _min && continue

        pos = findfirst(<(x), maxv)

        for j in pos+1:4
            maxv[j+1] = maxv[j]
            maxn[j+1] = maxn[j]
        end
        maxv[pos] = x
        maxn[pos] = i
        _min = last(maxv)
    end

    return maxn
end

unfortunately this is slower than yours

1 Like

what is this? I can’t find anything in Julia that has this name

Given we’re operating on small (size = 5) MVector, I also tried to unroll this via Base.@nexpr, but that make things slower again

2 Likes

It’s actually in OrderedCollections.jl (and thus also in DataStructures.jl, since it includes and adds to it), so that may be all you need, a faster dependency, at least at one point.

It’s still faster to import/use:

julia> @time import OrderedCollections
  0.020884 seconds (21.83 k allocations: 1.599 MiB)

julia> @time using DataStructures
  0.109690 seconds (55.32 k allocations: 4.020 MiB)

Both are then slower still if you use using rather than import, not something I’ve considered before. So I’ll use import from now on, unless people think you make of for it later somehow if you use using…

@Palli, I would say that the original (the 600ms one) was just a straightforward implementation without any kind of type-instability. Maybe there was some unnecessary allocation (I don’t remember). And I remember that the original implementation actually used partialsortperm! (which was the sensible thing to do).

However - the point is that one might expect to get a reasonable performance out of the box without the need to start implementing manual stuff like I did with fastmaxindex!.

For example, the Go version seems pretty straightforward without needing acrobatics for a decent speed.

So we brought Julia to the top - but now what? One can honestly ask, what was the development time cost for getting a performant solution for a straightforward and real-world task?

This is in no way a critique (and even less finger-pointing) - just trying to be realistic here.

3 Likes

Maybe we can try a different approach altogether and not get anchored (as in bias) in how I implemented the fastmaxindex!.

that’s what I did by copying Go-lang’s implementation, which insert and shifts everything. But that is slower than yours in Julia

Java is using the same algorithm as Go-lang, hmm, maybe we should try this but with better implementation, maybe I did something wrong


it’s also possible that you have a better sorting algorithm than them, which means we’re extra slow in some other parts

1 Like