Funny Benchmark with Julia (no longer) at the bottom

Gotcha so you can’t use any library that disables bounds checking except stuff included as part of the standard library. It seems kinda arbitrary, particularly for a language like Julia where we’ve seen stuff pushed out of stdlib into packages, but it’s at least a reasonably well defined criterion. Thanks.

3 Likes

yeah I would say for example, if Julia compiler figures out you don’t need bound check for

for i in eachindex(ary)
    ary[i]
end

it’s totally fine. But in this case, we have pattern of using value of one vector to index another, I don’t see a way to tell compiler it can ignore bounds

3 Likes

I seem to remember some recent posts by core devs (cannot find them now) saying that people should stop using @inbounds, not just for safety, but also because it could prevent certain types of analysis and optimization.

Can anyone recall or explain what that was all about?

2 Likes

I’m not finding a load of specific examples, but one was offered in this comment on #48245. Some other topics worth a glance might be #50107 #50641.

1 Like

It’s unlikely that a priority-queue approach will beat the manual top five selection as per above. (That is if you just throw the entire array into a priority-queue.)

function fastmaxindex!(xs::Vector{Int}, topn, maxn, maxv)
    ...
end

function special_sort!(xs::Vector{T}) where T
    # Only sorts the first 32 elements of a vector in > order.
    Lft, Rgt = 15, 18
    if xs[16] < xs[17]
        xs[16], xs[17] = xs[17], xs[16]
    end

    for _ in 1:15
        if xs[Lft] < xs[Rgt]
            xs[Lft], xs[Rgt] = xs[Rgt], xs[Lft]
        end
        Tmp = Rgt
        Val = xs[Tmp]
        while Val > xs[Tmp - 1]
            xs[Tmp] = xs[Tmp - 1]
            Tmp -= 1
        end
        xs[Tmp] = Val

        Tmp = Lft
        Val = Xs[Tmp]
        while Val < xs[Tmp + 1]
            xs[Tmp] = xs[Tmp + 1]
            Tmp += 1
        end
        xs[Tmp] = Val

        Rgt += 1
        Lft -= 1
    end
    return
end

function special_heapify!(xs::Vector{T}) where T
    N = length(xs) - 1
    Parent = N ÷ 2
    Child = 2Parent
    while Parent > 0
        CurrParent, CurrChild = Parent, Child
        Val = xs[CurrParent]
        for _ in 1:5
            BiggerChild = CurrChild + (xs[CurrChild + 1] > xs[CurrChild])
            (xs[BiggerChild] <= Val) && break
            xs[CurrParent] = xs[BiggerChild]
            CurrParent = BiggerChild
            CurrChild = 2BiggerChild
            (CurrChild > N) && break
        end
        xs[CurrParent] = Val
        Parent -= 1
        Child -= 2
    end
    return
end

function top_five!(xs::Vector{T}) where T
    special_heapify!(xs)
    xs[32] = xs[end] # Note that we destroy the array here
    special_sort!(xs)
    return
end

"""
julia> Arr = rand(Int, 1_000_000);

julia> top_five!(Arr)

julia> fastmaxindex!(Arr, 5, zeros(Int, 5), zeros(Int, 5))
5-element Vector{Int64}:
 1
 2
 3
 4
 5
"""

Results

julia> U = zeros(Int, 5); V = zeros(Int, 5);

julia> @btime fastmaxindex!(Arr, 5, $U, $V) setup = (Arr = rand(Int, 1_000_000))     
  552.900 μs (0 allocations: 0 bytes)

julia> @btime top_five!(Arr) setup = (Arr = rand(Int, 1_000_000))
  3.377 ms (0 allocations: 0 bytes)

I can make fastmaxindex! faster though. Putting a sentinel (i.e. topn += 1; maxv[begin] = typemax(Int)) to remove the idx == 1 && break is not faster due to the bounds checking forced.

function fastermaxindex!(xs::Vector{Int}, topn, maxv, maxn)
    maxn .= 1
    maxv .= 0
    req = 0
    for (i, x) in enumerate(xs)
        x > req || continue
        idx = topn 
        u = maxv[idx - 1]
        req = min(u, x)
        while x > u
            maxv[idx] = u
            maxn[idx] = maxn[idx - 1]
            idx -= 1
            idx == 1 && break
            u = maxv[idx - 1]
        end

        maxv[idx] = x
        maxn[idx] = i
    end
    
    return maxn
end

Results

julia> U = zeros(Int, 5); V = zeros(Int, 5);

julia> @btime fastmaxindex!(Arr, 5, $U, $V) setup = (Arr = rand(Int, 1_000_000))    
  558.600 μs (0 allocations: 0 bytes)

julia> @btime fastermaxindex!(Arr, 5, $U, $V) setup = (Arr = rand(Int, 1_000_000))  
  395.900 μs (0 allocations: 0 bytes)

Note that there is some variance in the speed of the algorithms base on the input. For example, if the array is sorted smallest to largest, the fastest algorithm is actually a variant of the priority-queue algorithm I have above (the priority-queue has low variance in times and can be made to about 2ms, while on that sorted smallest to largest scenario, fastmaxindex! takes 15ms and fastermaxindex! takes around 3ms).

1 Like

Another thing I tried is to unroll the loop using @generated. But this yielded no improvement on the random case.

@generated function unrollmaxindex!(xs::Vector{Int}, topn::Val{T}, maxv, maxn) where T
    quote
        maxn .= 1
        maxv .= 0
        req = 0
        for (i, x) in enumerate(xs)
            x > req || continue
            $(
                if T == 1
                    quote
                        maxv[1] = req = x
                        maxn[1] = i
                    end
                else # T > 1
                    Exprs = Expr[]
                    push!(Exprs, :(u = maxv[$(T - 1)]))
                    push!(Exprs, :(req = min(u, x)))
                    for j in T:-1:2
                        push!(Exprs, :(x ≤ u && begin
                                                    maxv[$j] = x
                                                    maxn[$j] = i
                                                    continue
                                                end))                    
                        push!(Exprs, :(maxv[$j] = u))
                        push!(Exprs, :(maxn[$j] = maxn[$(j - 1)]))
                        if j != 2
                            push!(Exprs, :(u = maxv[$(j - 2)]))
                        else
                            push!(Exprs, :(maxv[$(j - 1)] = x))
                            push!(Exprs, :(maxn[$(j - 1)] = i))
                        end
                    end
                    Expr(:block, Exprs...)
                end
            )
        end
    
        return maxn
    end
end

However, this does yield an improvement on the worst case.

julia> @btime fastermaxindex!(Arr, 5, $U, $V) setup = (Arr = collect(1:1_000_000))
  3.271 ms (0 allocations: 0 bytes)

julia> @btime unrollmaxindex!(Arr, Val{5}(), $U, $V) setup = (Arr = collect(1:1_000_000))
  2.237 ms (0 allocations: 0 bytes)

This is the fastest version of the function I have so far, even beating my optimised priority-queue methods. Though for the benchmark, I don’t know how much compilation time this would add.

1 Like

if you test this you may find it to be slower than the current one. (slower when I tested it locally on the real problem)

It’s about same speed as the old one

Notice the current U and V are MVector

Your statement on Julia fastest is now outdated (and maybe some of my text from below, which is from a DM, that was suggested I posted to the thread):

Julia is actually now 6% slower than Go, but what is worse, it scales worse, is 18% slower than Go, when scaling to “15k posts” (the new column).

The startup-cost of Julia is known, while after it’s fixed known cost (and trivial compilation time) I thought we would be on equal footing, so this is bad. It’s understandable Zig is fastest in the first column, now, and we could likely match/beat it, but likely only with StaticCompiler.jl. It’s sort of interesting to me that Zig loses this edge, should scale well, as well or better than Go?

It’s very intriguing that Java OOMs (the only language to do so, while its Graal implementation doesn’t).

Would a LittleDict help? Or a hybrid of such (fixed size) and a regular Dict? There are also other Dict implementations available.

I believe you can avoid bounds checks, not just by inbounds that is directly disallowed (supposedly since it’s unsafe), but also always by asserts (and/or clever use of eachindex; and slicing). Arguable using asserts would be bending the rules, and unsafe, just a different for of saying @inbounds.

I’m not sure though, I think with asserts in right places could be safe in all cases (otherwise abort), and you move bounds checking out of loops, where they are only a problem. And you could say the compiler did it…

I assumed that to be the reason. I haven’t (yet) looked at the code in much detail. If there are unnecessary allocations then maybe Bumper.jl helps. It sort of gets you a Mojo advantage, and I think it should be added to Julia as a stdlib, but also to make its use transparent by the compiler. I see by now its changed and allocated 1/8th of physical memory by default (per Task), which sounds bad, but is actually great when you understand why (it’s only virtual memory and Linux overcommits, I would though be worried about Windows that doesn’t). It’s very likely it used @inbounds, but I think it could be avoided as explained.

1 Like

this benchmark does not contain startup or precompile time

Lilith has pointed out Dict-related accounts for 1% of run time, so it’s not just that


I assert it’s impossible to remove the bound check, because we’re using value from one thing to index another thing. In fact, I have confirmed the Golang implementation doesn’t elide bound check:

> go run -gcflags="-d=ssa/check_bce" main.go
# command-line-arguments
./main.go:59:28: Found IsInBounds
./main.go:61:20: Found IsInBounds <--------- this is the hotspot
./main.go:64:18: Found IsInBounds
./main.go:79:15: Found IsSliceInBounds
./main.go:79:29: Found IsSliceInBounds
./main.go:82:9: Found IsInBounds
./main.go:89:26: Found IsInBounds
./main.go:93:18: Found IsInBounds
./main.go:94:18: Found IsInBounds
2 Likes

Here is a completely different implementation, with fairly comparable speed. This might be “cheating” with regards to the rules, but at least it is of some interest.

function fastermaxindex!(tagmask::Vector{T}, i, topn, maxn, maxv) where T
    maxn .= 1
    maxv .= 0
    req = 0

    A = tagmask[i]

    for (j, mask) in enumerate(tagmask)
        i == j && continue
        x = count_ones(A & mask)
        x > req || continue
        idx = topn 
        u = maxv[idx - 1]
        req = min(u, x)
        while x > u
            maxv[idx] = u
            maxn[idx] = maxn[idx - 1]
            idx -= 1
            idx == 1 && break
            u = maxv[idx - 1]
        end

        maxv[idx] = x
        maxn[idx] = j
    end

    maxn    
end

for Type in [UInt128, Int]
    @eval function getrelatedposts!(tagmap, posts, topn, maxn, maxv, relatedposts, ::Val{$Type})
        tagmask = zeros($Type, length(posts))
        for (idx, post) in enumerate(posts)
            for tag in post.tags
                tagmask[idx] |= one($Type) << tagmap[tag]
            end
        end

        for (i, post) in enumerate(posts)
            fastermaxindex!(tagmask, i, topn, maxn, maxv)
            relatedpost = RelatedPost(post._id, post.tags, SVector{topn}(@view posts[maxn]))
            relatedposts[i] = relatedpost
        end

        return
    end
end

function fasterrelated(posts)
    tagmap = Dict{String, Int}()

    Ct = 0
    for (idx, post) in enumerate(posts)
        for tag in post.tags
            get!(() -> (Ct += 1), tagmap, tag)
        end
    end
    
    topn = 5
    maxn = MVector{topn, Int}(undef)
    maxv = MVector{topn, Int}(undef)
    
    relatedposts = Vector{RelatedPost}(undef, length(posts))
    if Ct < 64
        getrelatedposts!(tagmap, posts, topn, maxn, maxv, relatedposts, Val{Int}())
    else
        getrelatedposts!(tagmap, posts, topn, maxn, maxv, relatedposts, Val{UInt128}())
    end
    
    return relatedposts
end

It is marginally faster on my machine with the small test data. I’m not sure how well my solution would scale comparatively.

julia> @benchmark fasterrelated($posts)
BenchmarkTools.Trial: 334 samples with 1 evaluation.
 Range (min … max):  14.020 ms …  19.219 ms  ┊ GC (min … max): 0.00% … 17.58%
 Time  (median):     14.696 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   14.994 ms ± 985.883 μs  ┊ GC (mean ± σ):  1.76% ±  4.95%

        ▇█▃▃▃▄
  ▂▅▅▇▇███████▆▅▇▃▅▃▂▄▃▂▃▂▁▁▁▂▁▁▁▁▂▂▁▁▁▁▁▁▁▁▂▁▁▂▃▂▂▃▃▂▃▁▃▂▃▃▁▃ ▃
  14 ms           Histogram: frequency by time         18.5 ms <

 Memory estimate: 3.66 MiB, allocs estimate: 55014.

julia> @benchmark related($posts)
BenchmarkTools.Trial: 303 samples with 1 evaluation.
 Range (min … max):  15.764 ms …  18.739 ms  ┊ GC (min … max): 0.00% … 9.99%
 Time  (median):     16.419 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   16.472 ms ± 399.564 μs  ┊ GC (mean ± σ):  0.13% ± 1.00%

          ▃▁▂▅▅▄▁▃▁ █▃▂▁
  ▃▁▄▃▃▃▅▇█████████▆█████▇▅▅▄▇▆█▅▆▃▆▃▄▇▅▃▃▃▃▁▁▃▁▁▁▃▁▁▁▁▃▁▁▁▁▁▃ ▄
  15.8 ms         Histogram: frequency by time         17.9 ms <

 Memory estimate: 1.29 MiB, allocs estimate: 181.
4 Likes

Let’s go concurrent.

I did the most naive thing possible: add a parallel loop with a degree of parallelism equal to nthreads(). Nothing fancy.

If you notice flagrantly stupid, please slap me.

On my local machine with --threads 4 (because now the testing server is 4 vCPUs):

21ms → 8ms

Concurrent Julia - testing the waters by algunion · Pull Request #178 · jinyus/related_post_gen (github.com)

2 Likes

They updated the machine:

NB: The benchmark runs on the Digital Ocean Droplet (CPU-Optimized).

  • CPU: 4 vCPUs
  • RAM: 8GB
  • OS: Ubuntu 22.04

@jling ,

I revived your PR on my machine and added PrecompileTools.

I use this workflow to ensure the precompilation of the related function and its methods:

@setup_workload begin
    json_string = read("../posts.json", String)
    posts = JSON3.read(json_string, Vector{PostData})

    @compile_workload begin
        related(posts)
    end
end

However, when first-calling the related using the precompiled project, I get:

allocations: 3.928 MiB, 47.20% compilation time

If I remove the PrecompileTools workflow I get:

1.29 M allocations: 83.222 MiB, 986.72% compilation time

So it is clear that PrecompileTools is doing a great amount of compilation there - but not enough to just be able to call the function right away without some additional compilation?

I didn’t care too much about precompilation until now (always sufficed to just do a preemptive function call).

Any suggestions?

Let’s hope that this benchmark becomes more popular :slight_smile:

And the first shot at the parallel implementation seems decent, too:

So that everybody knows - the last jump is due to @Lilith improvement. My work for the parallel implementation was just adding a naive parallel loop over the standard implementation.

8 Likes

Exercise for the reader: do a plot speed vs lines of code (excluding comments, docstrings and blank lines) for the various languages, like

7 Likes

There is the bucket queue algorithm for priority queues, which is fast for a small number of values/buckets. Here is a Julia implementation (26 lines): https://github.com/korbinian90/ROMEO.jl/blob/master/src/priorityqueue.jl

Not sure if this can improve anything, since Julia is already fastest. I might try today

6 Likes

No more comment.

Thanks @Lilith for spotting and fixing this:

8 Likes

Compilation time is not relevant for this benchmark. So, if you can confirm the improvement on the dataset, feel free to do a PR.

Also, if you want to test it on multiple random datasets, you can generate new datasets of various sizes using (the script is found in the root directory of the repo):

python3 gen_fake_posts.py 30000

Oh, I tried it, but the bucket queue was 10x slower for retrieving the largest 5 elements than fastmaxindex!
2 reasons I think

  1. It’s hard to preallocate and reuse the memory with a bucket queue
  2. It solves a slightly different problem well, iteratively enqueue and dequeue for a long run

fastmaxindex! looks super efficient for the given problem

2 Likes