Performance issue with Vectors vs. 1 Column Matrices

Buried inside a for loop of mine, I have a function which, morally, looks like:

function f!(a,b)
  @. a = b;
  a
end

It gets used inside the for loop like:

# construct vector x of length nx
# construct vector y of length ny > nx
for i in 1:iters
# code
rows = sample(1:ny, nx,replace = false); # grab random rows of y;
f!(x, y[rows]);
end

The above code works without issue. However, there are some cases where, instead of x and y being vectors, they are matrices with the same number of columns. The equivalent code is then:

# construct matrix x of size (nx,d)
# construct vector y of size (ny,d)
for i in 1:iters
# code
rows = sample(1:ny, nx,replace = false); # grab random rows of y;
f!(x, y[rows,:]);
end

and notice that I have had to use y[rows,;]. Now, I could use this in the vector case, but I take a performance hit:

Random.seed!(100);
x = zeros(100);
y = randn(1000);
rows = sample(1:1000, 100, replace=false);
@btime f!(x,y[rows]);

  124.491 ns (2 allocations: 928 bytes)

@btime f!(x,y[rows,:]);

  320.127 ns (2 allocations: 944 bytes)

Also, if I create x and y as zeros(100,1) and randn(1000,1), I also take the performance hit.

The case where I am only working with vectors is common enough for me that I don’t want to take this performance hit. To avoid the hit now, I have two versions of my outer function (which has the aforementioned for loop), and one is for the vector case and one is for the matrix case via multiple dispatch. But it seems really silly to have that when the only distinction is y[rows] and y[rows,:]. Is there a cleaner way to reconcile this?

I think multiple dispatch is your answer here. There might be variants of selectdim which would work for a list of indices but I’m not aware of them.

About performance itself, two little tips:

  • don’t forget to interpolate global variables when you use BenchmarkTools.jl
  • try array views to avoid allocations (not always faster, see below)
const AV = AbstractVector
const AM = AbstractMatrix

partialcopy!(x::AV, y::AV, rows) = x .= y[rows]
partialcopy!(x::AM, y::AM, rows) = x .= y[rows, :]

partialcopy2!(x::AV, y::AV, rows) = x .= @view y[rows]
partialcopy2!(x::AM, y::AM, rows) = x .= @view y[rows, :]

Setup code:

using BenchmarkTools, StatsBase
x, X = zeros(100), zeros(100, 1)
y, Y = ones(1000), ones(1000, 1)
rows = sample(1:1000, 100, replace=false)

Results:

julia> @btime partialcopy!($x, $y, $rows);
  211.100 ns (1 allocation: 896 bytes)

julia> @btime partialcopy2!($x, $y, $rows);
  113.900 ns (0 allocations: 0 bytes)

julia> @btime partialcopy!($X, $Y, $rows);
  248.100 ns (1 allocation: 896 bytes)

julia> @btime partialcopy2!($X, $Y, $rows);
  332.664 ns (0 allocations: 0 bytes)

Interestingly, the array views speed things up for vectors and slow them down for matrices. I think you’d actually be better served by writing out the loop explicitly here.

1 Like

Inspired by this, I came up the following variation which also seems to accomplish the task:

function f!(a, b)
    @. a = b;
    a
end
function extract_rows(a::TA, rows) where {TA<:AbstractVector}
    return @view a[rows]
end

function extract_rows(a::TA, rows) where {TA<:AbstractMatrix}
    return @view a[rows,:]
end

Random.seed!(100);
nx = 100;
ny = 1000;
x = zeros(nx);
y = randn(ny);
X = x[:, :];
Y = y[:, :];
rows = sample(1:ny, nx, replace=false);
@btime f!($(x), extract_rows($(y), $(rows)));
@btime f!($(X), extract_rows($(Y), $(rows)));

  81.176 ns (0 allocations: 0 bytes)
  123.551 ns (0 allocations: 0 bytes)

Thoughts?

The reason you’re getting better times is that you’re removing the time it takes to compute extract_rows from the benchmark (I think). If I understand correctly, BenchmarkTools only benchmarks the outer function, so to speak.

Is your assessment on timing the same for this?

function g!(a,b,rows)
    f!(a,extract_rows(b, rows))
    a
end

@btime g!($(x), $(y), $(rows));
@btime g!($(X), $(Y), $(rows));
  82.361 ns (0 allocations: 0 bytes)
  124.780 ns (0 allocations: 0 bytes)

I am pretty sure this one is the correct way to benchmark, so I’m very unsettled that it gives the exact same results ^^

I’m just happy that based on your comments I’ve got a viable solution to my code:)

1 Like

Is the returned x a typo?

Yes. Edited