Hi,
I have this simple code to add part of an array (contiguous rows / some columns) into another arrays (some rows / some columns). The first array is stored as a long vector so some reshape & other are necessary. In practice, I accumulate stuff at various places, but for the purpose of this I think we don’t care.
I’d like to understand the right way to make this as efficient as possible.
Some C code that’s pretty trivial, for the sake of comparison :
extern "C" {
extern void asc(double* A, double* B,
long long lda, long long ldb,
long long start,
long long rows, long long cols,
long long* ra, long long* ca, long long* cb);
}
void asc(double* A, double* B,
long long lda, long long ldb, // Int64 arrays
long long start, // Int64
long long rows, long long cols, // Int64
long long* ra, long long* ca, long long* cb) // Int64 arrays
{
for(int j = 0; j < cols; j++) {
for(int i = 0; i < rows; i++) {
A[ (ra[i]-1) + (ca[j]-1) * lda ] -=
B[ ((start-1)+i) + (cb[j]-1) * ldb ];
}
}
}
The comparison
function test(n::Integer)
A = zeros(Float64, 5000*5000)
B = ones(Float64, 2500, 2500)
A2 = copy(A)
B2 = copy(B)
ra = Array{Int64,1}(collect(1:2:2500))
ca = Array{Int64,1}(collect(1:2:2500))
cb = Array{Int64,1}(collect(1:2:2500))
pA = pointer(A)
pB = pointer(B)
lda = 2500
ldb = 2500
start = 50
rows = length(ra)
cols = length(ca)
pra = pointer(ra)
pca = pointer(ca)
pcb = pointer(cb)
shift = 1000*1000
len = 2500*2500
@time for i = 1:n
pA2 = pA + sizeof(Float64) * (shift-1)
ccall( (:asc, "libasc.dylib"),
Void,
(Ptr{Float64}, Ptr{Float64}, Clonglong, Clonglong, Clonglong, Clonglong, Clonglong, Ptr{Clonglong}, Ptr{Clonglong}, Ptr{Clonglong}),
pA2, pB, lda, ldb, start, rows, cols, pra, pca, pcb )
end
@time for i = 1:n
A2loc = view(A2, shift:shift+len-1)
A2mat = reshape(A2loc, 2500, 2500)
A2mat[ra,ca] -= B2[start:(start+rows-1),cb]
end
@show vecnorm(A-A2) # Check ok
@show vecnorm(B-B2)
end
test(1)
test(100)
I get
0.674902 seconds
1.701959 seconds (1.20 k allocations: 3.493 GiB, 22.06% gc time)
Basically, the first way (in C) is 3x faster. The other one allocates a lot of stuff as you can see.
Am I doing something wrong ? Any way to do this without allocating stuff ?
Thanks so much
Julia is great btw
PS: try to guess the application