Best practices for removing singleton dimension

Hi!
I often use reduction like sum or prod by specifying dimensions with the dims keyword argument. I’m wondering now how to get rid of the singleton dimension. Typically, if X is an array of size (1, n), we can do

  • selectdim(X, 1, 1), which produces a view,
  • X[1, :], which indexes and produces an array
  • dropdims(X; dims=1), which also produces an array

I thought they were equivalent, but timing indicates otherwise:

using LinearAlgebra: dot
X = randn(1, 10000)
Y = randn(1, 10000)
f1(X, Y) = dot(view(X, 1, :), view(Y, 1, :))
f2(X, Y) = dot(X[1, :], Y[1, :])
f3(X, Y) = dot(dropdims(X, dims=1), dropdims(Y, dims=1))
@btime f1($X, $Y)
#   2.471 μs (0 allocations: 0 bytes)
@btime f2($X, $Y)
#   24.621 μs (4 allocations: 156.34 KiB)
@btime f3($X, $Y)
#   2.591 μs (4 allocations: 160 bytes)

So I’ve discovered that indexing allocates, which is quite a surprise for me… I’ve been using this pattern (which is common with NumPy) extensively in my code since I started using Julia. Looking closely, I saw that it is mentioned in stackoverflow and that the behavior of sum and prod already sparked debates on Julia’s repo. So my questions are the following:

  1. Should it be expressed more clearly in the documentation that indexing allocates? Coming from NumPy, such manipulation with broadcasting are quite common, so maybe it is worth issuing a warning not to do that in Julia.
  2. What should we use for the best performances? Views seem the best, but is it always the case?
  3. Are there best practices on which dimension to reduce? For instance if I can design my algorithms to reduce over the first or the last dimension, is one going to be better?

For (1), maybe it’s just me :slight_smile:
For (3), I’m thinking for instance to batch dimension. I got that it is better to put it in the last so that each batch is contiguous in memory.

I came across this issue when porting my code to support GPU. Basically, it results in the following behaviour

using CUDA
using NNlib: batched_mul

CUDA.allowscalar(false)
X = cu(randn(1, 2, 12))
S = cu(randn(5, 2, 4))
Rindex = reshape(X[1, :, :, :], 2, 3, 4)
Rdrop = reshape(dropdims(X; dims=1), 2, 3, 4)
Rview = reshape(selectdim(X, 1, 1), 2, 3, 4)
batched_mul(S, Rindex)  # Works
batched_mul(S, Rdrop)  # Works
batched_mul(S, Rview)   # Scalar indexing

Edit: on the previously mentioned issue (I was a bit put off by the volume of message to start reading it) it is mentioned that

eachslice is currently slow because it’s type-unstable. However, #32310 fixes this, so sum(eachslice(...)) should be just as fast as the other two methods starting with Julia 1.9.
2.0 should probably remove mapslices and dims keyword arguments entirely, since they can be replaced by the more-consistent map(f, eachslice(...)).

Still, I’d be interested on feedback before version 1.9 is out; and question 1 & 3 still apply :slight_smile:

1 Like

In this case, I would tend to just suggest vec(X), which is essentially free for an Array (it shares the same underlying data rather than making a copy).

You can also do @views X[1, :], which returns an explicit view (a SubArray).

As always, if you are concerned about performance, you should read the Performance Tips chapter of the manual, which has a section Consider using views for slices.

Just put @views in front of your code (e.g. on a line of code, a loop, or a whole function), and then do as much slicing as you want.

This is also covered in the performance tips: Access arrays in memory order, along columns.

4 Likes

Indeed, I read this section a long time ago but did not fully grasp all the information! Thanks for the pointer.
So I’ll use vec for arrays of (1, n), and selectdims until Julia 1.9; for a reason which escapes me, it seems to work better with GPUs, as I mentioned in my example. If someone has an explanation for that btw I’d be glad to hear!

one thing worth mentioning is that the singleton dimension can often be avoided in the first place. people coming from python and Matlab often run into extra singleton dimensions because they create matrices where they wanted vectors. (e.g. rand(n,1) instead of rand(n))

1 Like