Below is my implementation of the block principal pivoting algorithm for solving a non-negative linear least squares problem. This function is the bottleneck in my code (running on v0.6.2), which calls this in an inner loop with square matrices `AtA`

of size 200-600 and appropriate vectors `Atb`

and `x0`

as inputs.

Is there any way to speed this up?

Based on profiling a Matlab version of this, it appears that essentially all the time is spent solving the linear system: `x[F]=AtA[F,F]\Atb[F]`

, where `F`

is a mask that changes significantly every iteration. I have tried prepending `@views`

and/or `@inbounds`

, without any effect on the timings (measured using `@btime`

). Also, `@code_warntype`

does not indicate any type instability.

Any advice would be much appreciated!

Here is the full function:

```
function nnls_bpp(AtA,Atb,x0)
n = length(Atb)
F = x0 .> 0
x = zeros(n)
x[F] = AtA[F,F]\Atb[F]
y = AtA*x - Atb
xFneg = (x .< 0) .& F
yGneg = (y .< 0) .& .!F
nInfeasible = sum(xFneg .| yGneg)
p = 3
t = n+1
while nInfeasible > 0
if nInfeasible < t
t = nInfeasible
p = 3
F[xFneg] = false
F[yGneg] = true
elseif p >= 1
p = p - 1
F[xFneg] = false
F[yGneg] = true
else
idx = findlast(xFneg .| yGneg)
F[idx] = .!F[idx]
end
x = zeros(n)
x[F] = AtA[F,F]\Atb[F]
y = AtA*x - Atb
xFneg = (x .< 0) .& F
yGneg = (y .< 0) .& .!F
nInfeasible = sum(xFneg .| yGneg)
end
return x
end
```