Hi!
I often use reduction like sum
or prod
by specifying dimensions with the dims
keyword argument. I’m wondering now how to get rid of the singleton dimension. Typically, if X
is an array of size (1, n)
, we can do
selectdim(X, 1, 1)
, which produces a view,X[1, :]
, which indexes and produces an arraydropdims(X; dims=1)
, which also produces an array
I thought they were equivalent, but timing indicates otherwise:
using LinearAlgebra: dot
X = randn(1, 10000)
Y = randn(1, 10000)
f1(X, Y) = dot(view(X, 1, :), view(Y, 1, :))
f2(X, Y) = dot(X[1, :], Y[1, :])
f3(X, Y) = dot(dropdims(X, dims=1), dropdims(Y, dims=1))
@btime f1($X, $Y)
# 2.471 μs (0 allocations: 0 bytes)
@btime f2($X, $Y)
# 24.621 μs (4 allocations: 156.34 KiB)
@btime f3($X, $Y)
# 2.591 μs (4 allocations: 160 bytes)
So I’ve discovered that indexing allocates, which is quite a surprise for me… I’ve been using this pattern (which is common with NumPy) extensively in my code since I started using Julia. Looking closely, I saw that it is mentioned in stackoverflow and that the behavior of sum
and prod
already sparked debates on Julia’s repo. So my questions are the following:
- Should it be expressed more clearly in the documentation that indexing allocates? Coming from NumPy, such manipulation with broadcasting are quite common, so maybe it is worth issuing a warning not to do that in Julia.
- What should we use for the best performances? Views seem the best, but is it always the case?
- Are there best practices on which dimension to reduce? For instance if I can design my algorithms to reduce over the first or the last dimension, is one going to be better?
For (1), maybe it’s just me
For (3), I’m thinking for instance to batch dimension. I got that it is better to put it in the last so that each batch is contiguous in memory.
I came across this issue when porting my code to support GPU. Basically, it results in the following behaviour
using CUDA
using NNlib: batched_mul
CUDA.allowscalar(false)
X = cu(randn(1, 2, 12))
S = cu(randn(5, 2, 4))
Rindex = reshape(X[1, :, :, :], 2, 3, 4)
Rdrop = reshape(dropdims(X; dims=1), 2, 3, 4)
Rview = reshape(selectdim(X, 1, 1), 2, 3, 4)
batched_mul(S, Rindex) # Works
batched_mul(S, Rdrop) # Works
batched_mul(S, Rview) # Scalar indexing
Edit: on the previously mentioned issue (I was a bit put off by the volume of message to start reading it) it is mentioned that
eachslice
is currently slow because it’s type-unstable. However, #32310 fixes this, sosum(eachslice(...))
should be just as fast as the other two methods starting with Julia 1.9.
2.0 should probably removemapslices
anddims
keyword arguments entirely, since they can be replaced by the more-consistentmap(f, eachslice(...))
.
Still, I’d be interested on feedback before version 1.9 is out; and question 1 & 3 still apply