Scalar multiplication makes array reallocation

alequa · October 15, 2020, 10:12am

Hi everyone,

While testing the fastest approach to multiply a matrix to a vector a discovered a nasty fact about reallocation.
Multiplying a matrix for a constant-defined value (namely a scalar) makes Julia reallocate the array, even within the @views scope. If the multiplication does not happen, the memory is never reallocated.

Could you please explain to me why?

The code below shows what I mean

function network_test()
	n =1000
	w = rand(3,n,n) .- 0.5
	r = 1:n
    state = falses(n)
	out = ones(Float64,n)
    for tt in 1:2000
		state = rand([false,true],n)
		@views out = w[3,:,:]*state
	end
    return  w
end

function network_scalar()
	n =1000
	w = rand(3,n,n) .- 0.5
	r = 1:n
    state = falses(n)
	out = ones(Float64,n)
    for tt in 1:2000
		#

		state = rand([false,true],n)
		@views out = 0.1*w[3,:,:]*state
	end
    return  w
end

using BenchmarkTools
@btime network_test()
#  2.618 s (12006 allocations: 79.20 MiB)
@btime network_scalar()
#  7.480 s (14006 allocations: 14.98 GiB)

Raf · October 15, 2020, 10:25am

I think what you want is to create the view, then broadcast setindex!

@views out = w[3,:,:]
out .*= 0.1 .* state

That shouldn’t allocate. But you are intending to modify the original array?

Impressium · October 15, 2020, 10:26am

Maybe use . to broadcast

julia> w = rand(1_000_000);

julia> @time w = w + w;
  0.004330 seconds (2 allocations: 7.629 MiB)

julia> @time w = w + w;
  0.004495 seconds (2 allocations: 7.629 MiB)

julia> @time w .= w .+ w;
  0.000981 seconds (2 allocations: 48 bytes)

julia> @time w .= w .+ w;
  0.001092 seconds (2 allocations: 48 bytes)

Skoffer · October 15, 2020, 10:28am

It would be interesting to understand, what is going on behind the scene, but one way to solve it is to use parenthesis:

@views out .= 0.1 .* (w[3,:,:] * state)

On a side note, you can also reduce number of allocations by changing state generation to

state = rand((false,true),n)

Impressium · October 15, 2020, 10:32am

Please tell us about the speed up after fixing

mcabbott · October 15, 2020, 10:46am

Maybe the first thing to know is that * proceeds left to right:

julia> @which 0.1*w[3,:,:]*state  # @less will show the code
*(a, b, c, xs...) in Base at operators.jl:538

and so this takes the view, multiplies it by a scalar to make a matrix, and then multiplies that by the vector. It would be more efficient from this perspective to write 0.1 * (w[3,:,:]*state) which is the same as 0.1 .* w[3,:,:]*state, or better lmul!(0.1, w[3,:,:]*state). (Soon, perhaps, https://github.com/JuliaLang/julia/pull/37898 may automate such things.) If out was pre-allocated, you could also write mul!(out, w[3,:,:], state, 0.1, 0).

julia> using LinearAlgebra

julia> out = rand(1000); w = rand(3,1000,1000) .- 0.5; state = rand(Bool, 1000);

julia> @btime (0.1 * (@view $w[3,:,:])) * $state; # allocates a new matrix, then a vector
  1.853 ms (4 allocations: 7.64 MiB)

julia> @btime 0.1 * ((@view $w[3,:,:]) * $state); # allocates two vectors
  1.193 ms (5 allocations: 23.89 KiB)

julia> @btime mul!($out, $(@view w[3,:,:]), $state, 0.1, 0.0); # no allocation
  1.189 ms (0 allocations: 0 bytes)

But the other thing to know is that you aren’t hitting the fast * here at all, because the element types don’t match, and because the view of w isn’t a happy one. You might do better to re-arrange the dimensions of w, and to take state = rand((1.0, 0.0),n) instead.

julia> @btime mul!($out, $(@view w[3,:,:]), $(state .+ 0.0)); # both Float64
  1.186 ms (2 allocations: 80 bytes)

julia> @btime mul!($out, $(w[3,:,:]), $(state .+ 0.0)); # BLAS
  151.750 μs (0 allocations: 0 bytes)

julia> w2 = permutedims(w,(2,3,1)); size(w2)
(1000, 1000, 3)

julia> @btime mul!($out, $(@view w2[:,:,3]), $(state .+ 0.0)); # safe view
  151.850 μs (0 allocations: 0 bytes)

julia> strides(@view w[3,:,:]) # problem
(3, 3000)

julia> strides(@view w2[:,:,3]) # no problem
(1, 1000)

DNF · October 15, 2020, 10:53am

@views only affect indexing expressions. So the above is equivalent to

out = 0.1 * view(w, 3, :, :) * state

so you don’t allocate anything in the slicing operation. But both view(w,3,:,:) * state will allocate a completely new array, and multiplication with 0.1 will also allocate a new array. There is also no in-place copying into out going on here.

You have to make your operations in-place, the view only helps with the slicing.

A couple of other remarks:

These pre-allocations have no effect:

state = falses(n)
out = ones(Float64,n)

because you over-write them later (not in-place, just new assignment), so they are wasted. In other words, you re-use the labels state and out, but you don’t re-use the arrays that those labels were applied to.

Also, false(n) is very different from rand([false, true], n). The former creates a BitVector, while the latter makes a vector of Bools.

For a vector of random true/false use either state = rand(Bool, n) or state = Random.bitrand(n). Then, in order to update this vector in-place, write rand!(state).

alequa · October 15, 2020, 11:03am

Hi DNF,
Thanks for the reply.

How to make the operations in place?
I thought that pre-defining the variable out of the loop-cycle was enough!
Also ´cause this loop is much more inefficient (3x) although the allocations are less

function network_scalar()
	n =1000
	w = rand(3,n,n) .- 0.5
	r = 1:n
    state = falses(n)
	out = ones(Float64,n)
    for tt in 1:2000
		#

		for cc in r
			for dd in r
				@views out[cc] =w[3,cc,dd]*state[dd]
			end
		end
	end
    return  w
end

## with matrix multiplication
  2.357 s (12006 allocations: 79.20 MiB)

## with indexing
  12.710 s (7 allocations: 45.78 MiB)

DNF · October 15, 2020, 11:14am

I was writing an answer, but realized I don’t understand what you are trying to do. Why do you return w from your function? Nothing has happened to that.

DNF · October 15, 2020, 11:23am

The answer from @mcabbott shows how you can do this using mul!. Here’s an example without the benchmarking code:

function network_scalar()
	n = 1000
	w = rand(n, n, 3) .- 0.5  # this is better than rand(3, n, n)
    state = rand((0.0, 1.0), n)
	out = ones(Float64, n)
    for tt in 1:2000
        state .= (rand.() .< 0.5)
        wv = @view w[:, :, 3]
        mul!(out, wv, state, 0.1, 0.0)
	end
    return out
end

I changed w from rand(3, n, n) to rand(n, n, 3), that makes a huge difference in performance.

alequa · October 15, 2020, 11:23am

Yes, that’s right.
Itis because this is the minimal working example of 600 lines code… Which also have other trade-offs (like have a boolean state array)

So a meaningful example would be

function network_test()
	n =1000
	w = rand(3,n,n) .- 0.5
	r = 1:n
    state = falses(n)
	out = zeros(Float64,n)
    for tt in 1:2000
		for cc in n
			state[cc] = rand((0.,1.))
		end
		@views out = w[3,:,:]*state
	end
    return  out,w
end

alequa · October 15, 2020, 11:26am

Ok, that’s fine.

Can you explain to me why the (n,n,3) is better than (n,n,3) ?
I thought that being Julia column-major you always want to set the accessed dimensions at the outmost of the matrix.

Thanks

DNF · October 15, 2020, 11:28am

It’s because Julia arrays are column-major that (n,n,3) is better than (3,n,n). You want to slice along the innermost dimensions of the array, not the outermost.

It’s the same reason that it is faster to loop like this

for j in 1:n
    for i in 1:m
        x[i, j] = foo()
    end
end

than like this

for i in 1:m
    for j in 1:n
        x[i, j] = foo()
    end
end

alequa · October 15, 2020, 11:37am

Ok, I misunderstood it
Thanks

DNF · October 15, 2020, 11:39am

Here you can see the difference between row- and column-major, and why it is better to move down along the columns for column major:

Topic		Replies	Views
Passing views to function without allocation New to Julia	14	1534	December 18, 2020
Understanding major order performance when broadcasting in column vs row operations Performance question , array , benchmark	9	1005	June 21, 2021
Allocation when applying in place multiplication to matrix of SVector Performance question	2	192	March 15, 2023
Why does `mul!(u, A, v)` allocate when `A` is sparse and `u, v` are views? Performance linearalgebra , sparse	9	601	November 10, 2023
Common allocation mistakes Performance memory-allocation	47	7174	August 21, 2023

Scalar multiplication makes array reallocation

Related topics