Multiply each slice of an array by a different number

I want to multiply each slice of an array by a different number. I want to do it for arbitrary array and slice dimensions, inplace and efficient if possible. How to do this? Here is an example:

Multipy each row of a matrix by a different number:

function multiply_slices!(A, v; dims)
    # only one special case implemented
    @assert dims == 2
    @assert ndims(A) == 2
    @assert ndims(v) == 1
    
    @assert axes(A, 1) == axes(v,1)
    @inbounds @simd for j in axes(A, 2)
        for i in axes(A, 1)
            A[i, j] *= v[i]
        end
    end
    A
end

nrows = 2
ncols = 3
A = collect(reshape(1:(nrows*ncols), (nrows, ncols)))
v = 1:nrows
multiply_slices!(A, v)

Does this work?

using LazyArrays: @~

function multiply_slices!(A, v; dims)
    select1 = ntuple(i -> i in dims ? axes(A, i) : Ref(:), ndims(A))
    select2 = ntuple(i -> i in dims ? axes(v, i) : Ref(:), ndims(v))
    RefA = Ref(A)
    Refv = Ref(v)
    slices = @~ tuple.(view.(RefA, select1...), view.(Refv, select2...))
    for (x, y) in slices
        x .*= y
    end
    return A
end
1 Like

Thanks! This does look good. It is not as fast as the hand coded variant, but fast enough for my use case. It can be made a little faster by adding a function barrier.

using LazyArrays: @~

function multiply_slices_hand!(A, v; dims)
    # only one special case implemented
    @assert dims == 2
    @assert ndims(A) == 2
    @assert ndims(v) == 1
    
    @assert axes(A, 1) == axes(v,1)
    @inbounds @simd for j in axes(A, 2)
        for i in axes(A, 1)
            A[i, j] *= v[i]
        end
    end
    A
end


function multiply_slices_lazy!(A, v; dims)
    select1 = ntuple(i -> i in dims ? axes(A, i) : Ref(:), ndims(A))
    select2 = ntuple(i -> i in dims ? axes(v, i) : Ref(:), ndims(v))
    RefA = Ref(A)
    Refv = Ref(v)
    # @show select1
    # @show select2
    slices = @~ tuple.(view.(RefA, select1...), view.(Refv, select2...))
    for (x, y) in slices
        x .*= y
    end
    return A
end

function multiply_slices_barrier!(A, v; dims)
    select1 = ntuple(i -> i in dims ? axes(A, i) : Ref(:), ndims(A))
    select2 = ntuple(i -> i in dims ? axes(v, i) : Ref(:), ndims(v))
    RefA = Ref(A)
    Refv = Ref(v)
    # @show select1
    # @show select2
    slices = @~ tuple.(view.(RefA, select1...), view.(Refv, select2...))
    kernel!(slices)
    return A
end
@noinline function kernel!(slices)
    for (x, y) in slices
        x .*= y
    end
end

using BenchmarkTools
nrows = 100
ncols = 2000
A = randn(nrows, ncols)
v = randn(nrows)
@btime multiply_slices_hand!(A, v, dims=2)
@btime multiply_slices_lazy!(A, v, dims=2)
@btime multiply_slices_barrier!(A, v, dims=2)
  1.423 ms (1 allocation: 16 bytes)
  3.376 ms (18013 allocations: 594.00 KiB)
  2.095 ms (4013 allocations: 187.75 KiB)

That’s a good idea!

But, actually, I think the slowness might be coming from overflow/underflow. I think it’s better to use setup here.

@btime multiply_slices_hand!(A, v, dims=2) setup=(A = randn(nrows, ncols))
@btime multiply_slices_lazy!(A, v, dims=2) setup=(A = randn(nrows, ncols))
@btime multiply_slices_barrier!(A, v, dims=2) setup=(A = randn(nrows, ncols))

gives me

  65.471 μs (1 allocation: 16 bytes)
  1.423 ms (18013 allocations: 594.00 KiB)
  155.574 μs (4013 allocations: 187.75 KiB)

…so 2+ times slower than hand-coded version… (But we can observe that the function barrier does a great job here.)

1 Like

Thanks! I can see that my above timings are far too long and that your timings are much more reasonable. But I don’t understand what slowness from underflow/overflow is. Can you explain?

My guess was that the slowness was due to denormals (Wikipedia) since you are keep multiplying by some numbers (and many of them are smaller than one).

using BenchmarkTools
n = 2^15
@btime x .*= 0.5 setup = (x = ones(n))
@btime x .*= 0.5 setup = (x = nextfloat.(zeros(n)))

prints

  3.851 μs (0 allocations: 0 bytes)
  42.459 μs (0 allocations: 0 bytes)
1 Like

Makes perfect sense and frightenly makes a lot of benchmarks I did questionable.

Yea, benchmark is hard…

BTW, in Julia 1.4.0-DEV.634, the broadcast-based version is much closer to the hand-coded version:

  63.656 μs (1 allocation: 16 bytes)       # multiply_slices_hand!
  82.218 μs (2013 allocations: 94.00 KiB)  # multiply_slices_barrier!
1 Like