Multiply each slice of an array by a different number

jw3126 · December 20, 2019, 7:28am

I want to multiply each slice of an array by a different number. I want to do it for arbitrary array and slice dimensions, inplace and efficient if possible. How to do this? Here is an example:

Multipy each row of a matrix by a different number:

function multiply_slices!(A, v; dims)
    # only one special case implemented
    @assert dims == 2
    @assert ndims(A) == 2
    @assert ndims(v) == 1
    
    @assert axes(A, 1) == axes(v,1)
    @inbounds @simd for j in axes(A, 2)
        for i in axes(A, 1)
            A[i, j] *= v[i]
        end
    end
    A
end

nrows = 2
ncols = 3
A = collect(reshape(1:(nrows*ncols), (nrows, ncols)))
v = 1:nrows
multiply_slices!(A, v)

tkf · December 20, 2019, 8:02am

Does this work?

using LazyArrays: @~

function multiply_slices!(A, v; dims)
    select1 = ntuple(i -> i in dims ? axes(A, i) : Ref(:), ndims(A))
    select2 = ntuple(i -> i in dims ? axes(v, i) : Ref(:), ndims(v))
    RefA = Ref(A)
    Refv = Ref(v)
    slices = @~ tuple.(view.(RefA, select1...), view.(Refv, select2...))
    for (x, y) in slices
        x .*= y
    end
    return A
end

jw3126 · December 20, 2019, 8:17am

Thanks! This does look good. It is not as fast as the hand coded variant, but fast enough for my use case. It can be made a little faster by adding a function barrier.

using LazyArrays: @~

function multiply_slices_hand!(A, v; dims)
    # only one special case implemented
    @assert dims == 2
    @assert ndims(A) == 2
    @assert ndims(v) == 1
    
    @assert axes(A, 1) == axes(v,1)
    @inbounds @simd for j in axes(A, 2)
        for i in axes(A, 1)
            A[i, j] *= v[i]
        end
    end
    A
end


function multiply_slices_lazy!(A, v; dims)
    select1 = ntuple(i -> i in dims ? axes(A, i) : Ref(:), ndims(A))
    select2 = ntuple(i -> i in dims ? axes(v, i) : Ref(:), ndims(v))
    RefA = Ref(A)
    Refv = Ref(v)
    # @show select1
    # @show select2
    slices = @~ tuple.(view.(RefA, select1...), view.(Refv, select2...))
    for (x, y) in slices
        x .*= y
    end
    return A
end

function multiply_slices_barrier!(A, v; dims)
    select1 = ntuple(i -> i in dims ? axes(A, i) : Ref(:), ndims(A))
    select2 = ntuple(i -> i in dims ? axes(v, i) : Ref(:), ndims(v))
    RefA = Ref(A)
    Refv = Ref(v)
    # @show select1
    # @show select2
    slices = @~ tuple.(view.(RefA, select1...), view.(Refv, select2...))
    kernel!(slices)
    return A
end
@noinline function kernel!(slices)
    for (x, y) in slices
        x .*= y
    end
end

using BenchmarkTools
nrows = 100
ncols = 2000
A = randn(nrows, ncols)
v = randn(nrows)
@btime multiply_slices_hand!(A, v, dims=2)
@btime multiply_slices_lazy!(A, v, dims=2)
@btime multiply_slices_barrier!(A, v, dims=2)

  1.423 ms (1 allocation: 16 bytes)
  3.376 ms (18013 allocations: 594.00 KiB)
  2.095 ms (4013 allocations: 187.75 KiB)

tkf · December 20, 2019, 8:29am

That’s a good idea!

But, actually, I think the slowness might be coming from overflow/underflow. I think it’s better to use setup here.

@btime multiply_slices_hand!(A, v, dims=2) setup=(A = randn(nrows, ncols))
@btime multiply_slices_lazy!(A, v, dims=2) setup=(A = randn(nrows, ncols))
@btime multiply_slices_barrier!(A, v, dims=2) setup=(A = randn(nrows, ncols))

gives me

  65.471 μs (1 allocation: 16 bytes)
  1.423 ms (18013 allocations: 594.00 KiB)
  155.574 μs (4013 allocations: 187.75 KiB)

…so 2+ times slower than hand-coded version… (But we can observe that the function barrier does a great job here.)

jw3126 · December 20, 2019, 8:36am

Thanks! I can see that my above timings are far too long and that your timings are much more reasonable. But I don’t understand what slowness from underflow/overflow is. Can you explain?

tkf · December 20, 2019, 8:44am

My guess was that the slowness was due to denormals (Wikipedia) since you are keep multiplying by some numbers (and many of them are smaller than one).

using BenchmarkTools
n = 2^15
@btime x .*= 0.5 setup = (x = ones(n))
@btime x .*= 0.5 setup = (x = nextfloat.(zeros(n)))

prints

  3.851 μs (0 allocations: 0 bytes)
  42.459 μs (0 allocations: 0 bytes)

jw3126 · December 20, 2019, 8:54am

Makes perfect sense and frightenly makes a lot of benchmarks I did questionable.

tkf · December 20, 2019, 9:05am

Yea, benchmark is hard…

BTW, in Julia 1.4.0-DEV.634, the broadcast-based version is much closer to the hand-coded version:

  63.656 μs (1 allocation: 16 bytes)       # multiply_slices_hand!
  82.218 μs (2013 allocations: 94.00 KiB)  # multiply_slices_barrier!

Topic		Replies	Views
Multi-arguments mapslices General Usage	3	554	March 16, 2023
Can't mul! to a slice New to Julia linearalgebra	2	382	March 20, 2022
Iterating over and slicing a multidimensional array New to Julia arrays	2	2779	May 18, 2020
Broadcasting matrix-vector multiplication over specific axis for n-dimensional arrays Performance	3	523	August 9, 2022
Multiplication of sub matrices General Usage	7	663	September 7, 2018

Multiply each slice of an array by a different number

Related topics