Also, partialsort dispatches between partial-scratch-quicksort (good for long inputs) and insertionsort of the whole input (good for short inputs) but does not use partial-insertionsort (good for when the number of elements being searched for is small, asymptotically, when that number is less than log(length)).
The underlying data (used for the leaderboard) is changing from run to run, and it seems to affect the results regardless of underlying implementations.
For example, Zig was not updated in 4 days and jumped on top for the current run (the same implementation a day back did worse than Java - GraalVM).
Maybe a better way to optimize this is to use the random post generator and check the improvements by averaging the results of multiple randomly generated datasets.
Leaving the leaderboard aside - I find this kind of real-world benchmark more valuable than micro-benchmarking (although we can admit that we are not always comparing apples to apples here: because anybody is free to do whatever algorithmic tricks and end up with different implementations from language to language).
This was not merged yet - so it still has a large total time value.
As an alternative to the highly-optimized solution built of manual loops, I’d like to propose a solution using higher-level data manipulation tools.
It’s about 2x slower for me than the code in repo, but shorter and easier to understand what’s actually going on – once got used to each primitive.
Uses NamedTuples instead of custom structs.
using DataManipulation, StructArrays
function related_2(posts_raw)
# arguably the processing is more convenient with namedtuples instead of custom structures
# converting here, but potentially this could be created when reading in the first place:
posts = @p posts_raw |> map(Accessors.getproperties) |> StructArray |> @insert __.ix = eachindex(__)
tagmap = @p let
posts
flatmap(_.tags, (p, tag) -> (; p.ix, tag))
group(_.tag)
map(map(x->x.ix, _))
end
nsharedtags = zeros(Int, length(posts))
@p let
posts
map() do post
nsharedtags .= 0
for t in post.tags
@views nsharedtags[tagmap[t]] .+= 1
end
nsharedtags[post.ix] = 0
related = partialsort(posts, 1:5; by=p -> nsharedtags[p.ix], rev=true)
(; post..., related)
end
collect
end
end
# code above works without these definitions,
# but Base.partialsort is muuuch slower
using DataStructures: nlargest
partialsort(v, k; kwargs...) = partialsort(v, Base.OneTo(k); kwargs...)
partialsort(v, k::Base.OneTo; by, rev=false) = @p v |> nlargest(length(k), eachindex(__); by=i -> by(__[i])) |> map(v[_])
Julia went from last to the top (while no longer fastest). It’s not new that Julia can be optimized (and it’s neither as scalable in its now optimized version, as e.g. Go’s).
What I would be curious about, then know problem with unoptimized code, could that be optimized (i.e. by Julia improvements, no code changes to the code/benchmark)? I’m ok with it, since I know of it, but dynamic code is maybe slower than it needs to be? And at least possibly surprising to some (new users)?
It’s sort of good that unoptimized is not too fast… then people would “know”, or not. Unoptimized code can’t of course be made as fast, but what would be a reasonable margin?
Do you mean like LitteDict? It’s often much faster, if your Dict is small. But note it’s O(n), why not the default (I think that might though be ok for Julia itself, i.e. what does Julia need when e.g. compiling?).
I did think up a hybrid, LittleDict + regular, I or someone should implement, then it’s back to O(1), first scanning a limited n for LittleDict before the fallback. I could be ok for Base, or not…
That’s not the same, it seems then, though I guess helpful for something.
I tried some other approach, especially the one Go-lang took, which is too insert at the correct location instead of doing the maxv[j-1], maxv[j] dance every time there’s a new top 5 candidate:
function fastmaxindex!(xs::Vector{Int64}, topn, maxn, maxv)
maxn .= 1
maxv .= _min = 0
for (i, x) in enumerate(xs)
x <= _min && continue
pos = findfirst(<(x), maxv)
for j in pos+1:4
maxv[j+1] = maxv[j]
maxn[j+1] = maxn[j]
end
maxv[pos] = x
maxn[pos] = i
_min = last(maxv)
end
return maxn
end
It’s actually in OrderedCollections.jl (and thus also in DataStructures.jl, since it includes and adds to it), so that may be all you need, a faster dependency, at least at one point.
It’s still faster to import/use:
julia> @time import OrderedCollections
0.020884 seconds (21.83 k allocations: 1.599 MiB)
julia> @time using DataStructures
0.109690 seconds (55.32 k allocations: 4.020 MiB)
Both are then slower still if you use using rather than import, not something I’ve considered before. So I’ll use import from now on, unless people think you make of for it later somehow if you use using…
@Palli, I would say that the original (the 600ms one) was just a straightforward implementation without any kind of type-instability. Maybe there was some unnecessary allocation (I don’t remember). And I remember that the original implementation actually used partialsortperm! (which was the sensible thing to do).
However - the point is that one might expect to get a reasonable performance out of the box without the need to start implementing manual stuff like I did with fastmaxindex!.
For example, the Go version seems pretty straightforward without needing acrobatics for a decent speed.
So we brought Julia to the top - but now what? One can honestly ask, what was the development time cost for getting a performant solution for a straightforward and real-world task?
This is in no way a critique (and even less finger-pointing) - just trying to be realistic here.