Gotcha so you can’t use any library that disables bounds checking except stuff included as part of the standard library. It seems kinda arbitrary, particularly for a language like Julia where we’ve seen stuff pushed out of stdlib into packages, but it’s at least a reasonably well defined criterion. Thanks.
yeah I would say for example, if Julia compiler figures out you don’t need bound check for
for i in eachindex(ary)
ary[i]
end
it’s totally fine. But in this case, we have pattern of using value of one vector to index another, I don’t see a way to tell compiler it can ignore bounds
I seem to remember some recent posts by core devs (cannot find them now) saying that people should stop using @inbounds
, not just for safety, but also because it could prevent certain types of analysis and optimization.
Can anyone recall or explain what that was all about?
I’m not finding a load of specific examples, but one was offered in this comment on #48245. Some other topics worth a glance might be #50107 #50641.
It’s unlikely that a priority-queue approach will beat the manual top five selection as per above. (That is if you just throw the entire array into a priority-queue.)
function fastmaxindex!(xs::Vector{Int}, topn, maxn, maxv)
...
end
function special_sort!(xs::Vector{T}) where T
# Only sorts the first 32 elements of a vector in > order.
Lft, Rgt = 15, 18
if xs[16] < xs[17]
xs[16], xs[17] = xs[17], xs[16]
end
for _ in 1:15
if xs[Lft] < xs[Rgt]
xs[Lft], xs[Rgt] = xs[Rgt], xs[Lft]
end
Tmp = Rgt
Val = xs[Tmp]
while Val > xs[Tmp - 1]
xs[Tmp] = xs[Tmp - 1]
Tmp -= 1
end
xs[Tmp] = Val
Tmp = Lft
Val = Xs[Tmp]
while Val < xs[Tmp + 1]
xs[Tmp] = xs[Tmp + 1]
Tmp += 1
end
xs[Tmp] = Val
Rgt += 1
Lft -= 1
end
return
end
function special_heapify!(xs::Vector{T}) where T
N = length(xs) - 1
Parent = N ÷ 2
Child = 2Parent
while Parent > 0
CurrParent, CurrChild = Parent, Child
Val = xs[CurrParent]
for _ in 1:5
BiggerChild = CurrChild + (xs[CurrChild + 1] > xs[CurrChild])
(xs[BiggerChild] <= Val) && break
xs[CurrParent] = xs[BiggerChild]
CurrParent = BiggerChild
CurrChild = 2BiggerChild
(CurrChild > N) && break
end
xs[CurrParent] = Val
Parent -= 1
Child -= 2
end
return
end
function top_five!(xs::Vector{T}) where T
special_heapify!(xs)
xs[32] = xs[end] # Note that we destroy the array here
special_sort!(xs)
return
end
"""
julia> Arr = rand(Int, 1_000_000);
julia> top_five!(Arr)
julia> fastmaxindex!(Arr, 5, zeros(Int, 5), zeros(Int, 5))
5-element Vector{Int64}:
1
2
3
4
5
"""
Results
julia> U = zeros(Int, 5); V = zeros(Int, 5);
julia> @btime fastmaxindex!(Arr, 5, $U, $V) setup = (Arr = rand(Int, 1_000_000))
552.900 μs (0 allocations: 0 bytes)
julia> @btime top_five!(Arr) setup = (Arr = rand(Int, 1_000_000))
3.377 ms (0 allocations: 0 bytes)
I can make fastmaxindex!
faster though. Putting a sentinel (i.e. topn += 1; maxv[begin] = typemax(Int)
) to remove the idx == 1 && break
is not faster due to the bounds checking forced.
function fastermaxindex!(xs::Vector{Int}, topn, maxv, maxn)
maxn .= 1
maxv .= 0
req = 0
for (i, x) in enumerate(xs)
x > req || continue
idx = topn
u = maxv[idx - 1]
req = min(u, x)
while x > u
maxv[idx] = u
maxn[idx] = maxn[idx - 1]
idx -= 1
idx == 1 && break
u = maxv[idx - 1]
end
maxv[idx] = x
maxn[idx] = i
end
return maxn
end
Results
julia> U = zeros(Int, 5); V = zeros(Int, 5);
julia> @btime fastmaxindex!(Arr, 5, $U, $V) setup = (Arr = rand(Int, 1_000_000))
558.600 μs (0 allocations: 0 bytes)
julia> @btime fastermaxindex!(Arr, 5, $U, $V) setup = (Arr = rand(Int, 1_000_000))
395.900 μs (0 allocations: 0 bytes)
Note that there is some variance in the speed of the algorithms base on the input. For example, if the array is sorted smallest to largest, the fastest algorithm is actually a variant of the priority-queue algorithm I have above (the priority-queue has low variance in times and can be made to about 2ms, while on that sorted smallest to largest scenario, fastmaxindex!
takes 15ms and fastermaxindex!
takes around 3ms).
Another thing I tried is to unroll the loop using @generated
. But this yielded no improvement on the random case.
@generated function unrollmaxindex!(xs::Vector{Int}, topn::Val{T}, maxv, maxn) where T
quote
maxn .= 1
maxv .= 0
req = 0
for (i, x) in enumerate(xs)
x > req || continue
$(
if T == 1
quote
maxv[1] = req = x
maxn[1] = i
end
else # T > 1
Exprs = Expr[]
push!(Exprs, :(u = maxv[$(T - 1)]))
push!(Exprs, :(req = min(u, x)))
for j in T:-1:2
push!(Exprs, :(x ≤ u && begin
maxv[$j] = x
maxn[$j] = i
continue
end))
push!(Exprs, :(maxv[$j] = u))
push!(Exprs, :(maxn[$j] = maxn[$(j - 1)]))
if j != 2
push!(Exprs, :(u = maxv[$(j - 2)]))
else
push!(Exprs, :(maxv[$(j - 1)] = x))
push!(Exprs, :(maxn[$(j - 1)] = i))
end
end
Expr(:block, Exprs...)
end
)
end
return maxn
end
end
However, this does yield an improvement on the worst case.
julia> @btime fastermaxindex!(Arr, 5, $U, $V) setup = (Arr = collect(1:1_000_000))
3.271 ms (0 allocations: 0 bytes)
julia> @btime unrollmaxindex!(Arr, Val{5}(), $U, $V) setup = (Arr = collect(1:1_000_000))
2.237 ms (0 allocations: 0 bytes)
This is the fastest version of the function I have so far, even beating my optimised priority-queue methods. Though for the benchmark, I don’t know how much compilation time this would add.
if you test this you may find it to be slower than the current one. (slower when I tested it locally on the real problem)
It’s about same speed as the old one
Notice the current U
and V
are MVector
Your statement on Julia fastest is now outdated (and maybe some of my text from below, which is from a DM, that was suggested I posted to the thread):
Julia is actually now 6% slower than Go, but what is worse, it scales worse, is 18% slower than Go, when scaling to “15k posts” (the new column).
The startup-cost of Julia is known, while after it’s fixed known cost (and trivial compilation time) I thought we would be on equal footing, so this is bad. It’s understandable Zig is fastest in the first column, now, and we could likely match/beat it, but likely only with StaticCompiler.jl. It’s sort of interesting to me that Zig loses this edge, should scale well, as well or better than Go?
It’s very intriguing that Java OOMs (the only language to do so, while its Graal implementation doesn’t).
Would a LittleDict help? Or a hybrid of such (fixed size) and a regular Dict? There are also other Dict implementations available.
I believe you can avoid bounds checks, not just by inbounds
that is directly disallowed (supposedly since it’s unsafe), but also always by asserts (and/or clever use of eachindex; and slicing). Arguable using asserts would be bending the rules, and unsafe, just a different for of saying @inbounds
.
I’m not sure though, I think with asserts in right places could be safe in all cases (otherwise abort), and you move bounds checking out of loops, where they are only a problem. And you could say the compiler did it…
I assumed that to be the reason. I haven’t (yet) looked at the code in much detail. If there are unnecessary allocations then maybe Bumper.jl helps. It sort of gets you a Mojo advantage, and I think it should be added to Julia as a stdlib, but also to make its use transparent by the compiler. I see by now its changed and allocated 1/8th of physical memory by default (per Task), which sounds bad, but is actually great when you understand why (it’s only virtual memory and Linux overcommits, I would though be worried about Windows that doesn’t). It’s very likely it used @inbounds
, but I think it could be avoided as explained.
this benchmark does not contain startup or precompile time
Lilith has pointed out Dict-related accounts for 1% of run time, so it’s not just that
I assert it’s impossible to remove the bound check, because we’re using value from one thing to index another thing. In fact, I have confirmed the Golang implementation doesn’t elide bound check:
> go run -gcflags="-d=ssa/check_bce" main.go
# command-line-arguments
./main.go:59:28: Found IsInBounds
./main.go:61:20: Found IsInBounds <--------- this is the hotspot
./main.go:64:18: Found IsInBounds
./main.go:79:15: Found IsSliceInBounds
./main.go:79:29: Found IsSliceInBounds
./main.go:82:9: Found IsInBounds
./main.go:89:26: Found IsInBounds
./main.go:93:18: Found IsInBounds
./main.go:94:18: Found IsInBounds
Here is a completely different implementation, with fairly comparable speed. This might be “cheating” with regards to the rules, but at least it is of some interest.
function fastermaxindex!(tagmask::Vector{T}, i, topn, maxn, maxv) where T
maxn .= 1
maxv .= 0
req = 0
A = tagmask[i]
for (j, mask) in enumerate(tagmask)
i == j && continue
x = count_ones(A & mask)
x > req || continue
idx = topn
u = maxv[idx - 1]
req = min(u, x)
while x > u
maxv[idx] = u
maxn[idx] = maxn[idx - 1]
idx -= 1
idx == 1 && break
u = maxv[idx - 1]
end
maxv[idx] = x
maxn[idx] = j
end
maxn
end
for Type in [UInt128, Int]
@eval function getrelatedposts!(tagmap, posts, topn, maxn, maxv, relatedposts, ::Val{$Type})
tagmask = zeros($Type, length(posts))
for (idx, post) in enumerate(posts)
for tag in post.tags
tagmask[idx] |= one($Type) << tagmap[tag]
end
end
for (i, post) in enumerate(posts)
fastermaxindex!(tagmask, i, topn, maxn, maxv)
relatedpost = RelatedPost(post._id, post.tags, SVector{topn}(@view posts[maxn]))
relatedposts[i] = relatedpost
end
return
end
end
function fasterrelated(posts)
tagmap = Dict{String, Int}()
Ct = 0
for (idx, post) in enumerate(posts)
for tag in post.tags
get!(() -> (Ct += 1), tagmap, tag)
end
end
topn = 5
maxn = MVector{topn, Int}(undef)
maxv = MVector{topn, Int}(undef)
relatedposts = Vector{RelatedPost}(undef, length(posts))
if Ct < 64
getrelatedposts!(tagmap, posts, topn, maxn, maxv, relatedposts, Val{Int}())
else
getrelatedposts!(tagmap, posts, topn, maxn, maxv, relatedposts, Val{UInt128}())
end
return relatedposts
end
It is marginally faster on my machine with the small test data. I’m not sure how well my solution would scale comparatively.
julia> @benchmark fasterrelated($posts)
BenchmarkTools.Trial: 334 samples with 1 evaluation.
Range (min … max): 14.020 ms … 19.219 ms ┊ GC (min … max): 0.00% … 17.58%
Time (median): 14.696 ms ┊ GC (median): 0.00%
Time (mean ± σ): 14.994 ms ± 985.883 μs ┊ GC (mean ± σ): 1.76% ± 4.95%
▇█▃▃▃▄
▂▅▅▇▇███████▆▅▇▃▅▃▂▄▃▂▃▂▁▁▁▂▁▁▁▁▂▂▁▁▁▁▁▁▁▁▂▁▁▂▃▂▂▃▃▂▃▁▃▂▃▃▁▃ ▃
14 ms Histogram: frequency by time 18.5 ms <
Memory estimate: 3.66 MiB, allocs estimate: 55014.
julia> @benchmark related($posts)
BenchmarkTools.Trial: 303 samples with 1 evaluation.
Range (min … max): 15.764 ms … 18.739 ms ┊ GC (min … max): 0.00% … 9.99%
Time (median): 16.419 ms ┊ GC (median): 0.00%
Time (mean ± σ): 16.472 ms ± 399.564 μs ┊ GC (mean ± σ): 0.13% ± 1.00%
▃▁▂▅▅▄▁▃▁ █▃▂▁
▃▁▄▃▃▃▅▇█████████▆█████▇▅▅▄▇▆█▅▆▃▆▃▄▇▅▃▃▃▃▁▁▃▁▁▁▃▁▁▁▁▃▁▁▁▁▁▃ ▄
15.8 ms Histogram: frequency by time 17.9 ms <
Memory estimate: 1.29 MiB, allocs estimate: 181.
Let’s go concurrent.
I did the most naive thing possible: add a parallel loop with a degree of parallelism equal to nthreads()
. Nothing fancy.
If you notice flagrantly stupid, please slap me.
On my local machine with --threads 4
(because now the testing server is 4 vCPUs):
21ms → 8ms
They updated the machine:
NB: The benchmark runs on the Digital Ocean Droplet (CPU-Optimized).
- CPU: 4 vCPUs
- RAM: 8GB
- OS: Ubuntu 22.04
@jling ,
I revived your PR on my machine and added PrecompileTools
.
I use this workflow to ensure the precompilation of the related
function and its methods:
@setup_workload begin
json_string = read("../posts.json", String)
posts = JSON3.read(json_string, Vector{PostData})
@compile_workload begin
related(posts)
end
end
However, when first-calling the related
using the precompiled project, I get:
allocations: 3.928 MiB, 47.20% compilation time
If I remove the PrecompileTools
workflow I get:
1.29 M allocations: 83.222 MiB, 986.72% compilation time
So it is clear that PrecompileTools
is doing a great amount of compilation there - but not enough to just be able to call the function right away without some additional compilation?
I didn’t care too much about precompilation until now (always sufficed to just do a preemptive function call).
Any suggestions?
Let’s hope that this benchmark becomes more popular
And the first shot at the parallel implementation seems decent, too:
So that everybody knows - the last jump is due to @Lilith improvement. My work for the parallel implementation was just adding a naive parallel loop over the standard implementation.
Exercise for the reader: do a plot speed vs lines of code (excluding comments, docstrings and blank lines) for the various languages, like
There is the bucket queue algorithm for priority queues, which is fast for a small number of values/buckets. Here is a Julia implementation (26 lines): https://github.com/korbinian90/ROMEO.jl/blob/master/src/priorityqueue.jl
Not sure if this can improve anything, since Julia is already fastest. I might try today
Compilation time is not relevant for this benchmark. So, if you can confirm the improvement on the dataset, feel free to do a PR.
Also, if you want to test it on multiple random datasets, you can generate new datasets of various sizes using (the script is found in the root directory of the repo):
python3 gen_fake_posts.py 30000
Oh, I tried it, but the bucket queue was 10x slower for retrieving the largest 5 elements than fastmaxindex!
2 reasons I think
- It’s hard to preallocate and reuse the memory with a bucket queue
- It solves a slightly different problem well, iteratively enqueue and dequeue for a long run
fastmaxindex! looks super efficient for the given problem