Strided addition on 2 arrays in place

Hi,

I have this simple code to add part of an array (contiguous rows / some columns) into another arrays (some rows / some columns). The first array is stored as a long vector so some reshape & other are necessary. In practice, I accumulate stuff at various places, but for the purpose of this I think we don’t care.
I’d like to understand the right way to make this as efficient as possible.

Some C code that’s pretty trivial, for the sake of comparison :

extern "C" {
    extern void asc(double* A, double* B,
         long long lda, long long ldb,
         long long start,
         long long rows, long long cols,
         long long* ra, long long* ca, long long* cb);
}

void asc(double* A, double* B,
         long long lda, long long ldb, // Int64 arrays
         long long start, // Int64
         long long rows, long long cols, // Int64
         long long* ra, long long* ca, long long* cb) // Int64 arrays
{
    for(int j = 0; j < cols; j++) {
        for(int i = 0; i < rows; i++) {
            A[ (ra[i]-1) + (ca[j]-1) * lda ] -=
                B[ ((start-1)+i) + (cb[j]-1) * ldb ];
        }
    }
}

The comparison

function test(n::Integer)
    A = zeros(Float64, 5000*5000)
    B = ones(Float64, 2500, 2500)
    A2 = copy(A)
    B2 = copy(B)
    ra = Array{Int64,1}(collect(1:2:2500))
    ca = Array{Int64,1}(collect(1:2:2500))
    cb = Array{Int64,1}(collect(1:2:2500))
    pA = pointer(A)
    pB = pointer(B)
    lda = 2500
    ldb = 2500
    start = 50
    rows = length(ra)
    cols = length(ca)
    pra = pointer(ra)
    pca = pointer(ca)
    pcb = pointer(cb)

    shift = 1000*1000
    len   = 2500*2500
    @time for i = 1:n
	pA2 = pA + sizeof(Float64) * (shift-1)
	ccall( (:asc, "libasc.dylib"),
            Void,
            (Ptr{Float64}, Ptr{Float64}, Clonglong, Clonglong, Clonglong, Clonglong, Clonglong, Ptr{Clonglong}, Ptr{Clonglong}, Ptr{Clonglong}),
            pA2,           pB,           lda,       ldb,       start,     rows,      cols,      pra,            pca,            pcb             )
    end
    @time for i = 1:n
        A2loc = view(A2, shift:shift+len-1)
        A2mat = reshape(A2loc, 2500, 2500)
	A2mat[ra,ca] -= B2[start:(start+rows-1),cb]
    end
    @show vecnorm(A-A2) # Check ok
    @show vecnorm(B-B2)
end

test(1)
test(100)

I get
0.674902 seconds
1.701959 seconds (1.20 k allocations: 3.493 GiB, 22.06% gc time)

Basically, the first way (in C) is 3x faster. The other one allocates a lot of stuff as you can see.
Am I doing something wrong ? Any way to do this without allocating stuff ?

Thanks so much

Julia is great btw :heart: :smiley:

PS: try to guess the application :stuck_out_tongue:

You can try change to

@views A2mat[ra,ca] .-= B2[start:(start+rows-1), cb]

makes it take

  0.718644 seconds (900 allocations: 31.250 KiB)

instead of

  1.651046 seconds (1.20 k allocations: 3.493 GiB, 19.26% gc time)
1 Like

Perfect, it works very well. Thanks.