Well this is a whole different thing.
inds = shuffle(1:1_000_000)[1:800000];
that’s not contiguous, so
copy!(xtmp, view(x, inds))
copy!(Atmp, view(A, :, inds))
those are views but do not contiguously address the memory. Then you do
Atmp * xtmp
that’s the operation which really matters here, but it’s really slow because the pointers are jumping wildly about in memory so it can’t SIMD or BLAS or anything to make it fast.
On the otherhand,
A[:,inds]*x[inds]
that slices to allocate two new arrays, but when it puts the values in there the values are perfectly aligned in memory, so *
in this case will be really fast and will be a BLAS call.
Moral of the story: blindly avoiding allocations will not make code faster.
Edit
Just to drive this point home, notice how much better it does when the view is more sane. With your code I get:
x = randn(1_000_000);
inds = shuffle(1:1_000_000)[1:800000];
A = randn(50, 1_000_000);
xtmp = zeros(800_000);
Atmp = zeros(50, 800_000);
function f(xtmp,x,Atmp,inds)
copy!(xtmp, view(x, inds))
copy!(Atmp, view(A, :, inds))
sum(Atmp * xtmp)
end
function g(xtmp,x,Atmp,inds)
sum(A[:,inds]*x[inds])
end
@time f(xtmp,x,Atmp,inds)
@time f(xtmp,x,Atmp,inds)
@time g(xtmp,x,Atmp,inds)
@time g(xtmp,x,Atmp,inds)
0.273511 seconds (36 allocations: 1.313 KiB)
0.272545 seconds (36 allocations: 1.313 KiB)
0.253679 seconds (13 allocations: 311.280 MiB, 9.58% gc time)
0.266512 seconds (13 allocations: 311.280 MiB, 9.40% gc time)
but if I sort
the indices, then the view
is better behaved and it’s much faster:
x = randn(1_000_000);
inds = sort(shuffle(1:1_000_000)[1:800000]);
A = randn(50, 1_000_000);
xtmp = zeros(800_000);
Atmp = zeros(50, 800_000);
function f(xtmp,x,Atmp,inds)
copy!(xtmp, view(x, inds))
copy!(Atmp, view(A, :, inds))
sum(Atmp * xtmp)
end
function g(xtmp,x,Atmp,inds)
sum(A[:,inds]*x[inds])
end
@time f(xtmp,x,Atmp,inds)
@time f(xtmp,x,Atmp,inds)
@time g(xtmp,x,Atmp,inds)
@time g(xtmp,x,Atmp,inds)
0.136312 seconds (36 allocations: 1.313 KiB)
0.145749 seconds (36 allocations: 1.313 KiB)
0.207015 seconds (13 allocations: 311.280 MiB, 11.87% gc time)
0.202954 seconds (13 allocations: 311.280 MiB, 12.78% gc time)