Matrix permanent

Any idea on how to improve the following code for computing the permanent of a matrix?
I used the Glynn formula where delta is obtained via Gray code.

# from  https://stackoverflow.com/questions/34236332/c-code-for-generating-next-bit-to-flip-in-a-gray-code

function grayBitToFlip(n::Int)
	n1 = (n-1) ⊻ ((n-1)>>1)
	n2 = n ⊻ (n>>1)
	d = n1 ⊻ n2

	j = 0
	while d>0
		d >>= 1
		j += 1
	end
	j
end


function perm(A::AbstractMatrix{T}) where T
	n,m = size(A)
	if (n == m)
		D = Int8(1)
		δ = [Int8(2) for i ∈ 1:n]
		v = sum(A,2)
		p = prod(v)

		@inbounds for i ∈ 1:(2^(n-1)-1)
			a = grayBitToFlip(i)+1
			@views v .-= δ[a].*A[:,a]
			δ[a] = -δ[a]
			D = -D

			p += D*prod(v)
		end

		p * 2.0^(1-n)
	else
		throw(ArgumentError("perm: argument must be a square matrix"))
	end
end

A few details of the code. First of all, technically I compute perm(transpose(A)), which is equal to perm(A), but expressions are faster because of column-major ordering. D is the product of deltas. Because of the Gray code, at each step D changes its sign. The vector v is the sum of delta*A. The above formula comes from calculating how v changes after a step i, and including the resulting overall factor 2 into the definition of delta.

I got a nice factor three on 0.62 by removing the allocations and hoisting multiplications and branches out of the inner loops. I think quite a bit of that speed up will come for free on 0.7 without having to make the code ugly.

function perm2(A::AbstractMatrix{T}) where T
	n,m = size(A)
	if (n == m)
        AA = 2*A
		D = true
		rho = [true for i = 1:n]
		v = sum(A,2)
		p = prod(v)
		@inbounds for i = 1:(2^(n-1)-1)
			a = grayBitToFlip(i)+1
            if rho[a]
                @simd for j=1:n
                    v[j] -= AA[j,a]
                end
            else
                @simd for j=1:n
                    v[j] += AA[j,a]
                end
            end
            rho[a] = !rho[a]
            pv = one(typeof(p))
            @simd for j=1:n
                pv *= v[j] #most expensive part
            end
			D = !D
            if D
                p += pv
            else
                p -= pv
            end
		end

		return p * 2.0^(1-n)
	else
		throw(ArgumentError("perm: argument must be a square matrix"))
	end
end

what kind of matrices have you tried? integer elements? with real elements I get an error on A[j,a]<<1.

I tried with matrices of integer elements and my original version is about 5x faster on 0.6.2.
However, if I change A[j,a]<<1 with A[j,a]*2 then I also observe a factor 3 improvement (and no error with real matrices)… Thanks!

What is the status of @simd on 0.7? I though it was discouraged on the official documentation

I updated the above; no idea why it appeared to work in my notebook before. I now get the following timings post warnup:

A1 = rand(Int64, 24, 24);
A2 = rand(Int32, 24, 24);
A3 = rand(Float64, 24,24);
A4 = rand(Float32, 24, 24);

@time perm(A1);
@time perm2(A1);
println("--")
@time perm(A2);
@time perm2(A2);
println("--")
@time perm(A3);
@time perm2(A3);
println("--")
@time perm(A4);
@time perm2(A4);

for an output of:

  0.877816 seconds (8.39 M allocations: 384.001 MiB, 2.36% gc time)
  0.308223 seconds (13 allocations: 5.328 KiB)
--
  0.989661 seconds (8.39 M allocations: 384.001 MiB, 2.32% gc time)
  0.318777 seconds (13 allocations: 5.328 KiB)
--
  0.591410 seconds (8.39 M allocations: 384.001 MiB, 3.32% gc time)
  0.253649 seconds (13 allocations: 5.328 KiB)
--
  0.576536 seconds (8.39 M allocations: 384.000 MiB, 3.77% gc time)
  0.216791 seconds (13 allocations: 2.922 KiB)

edit, PS. For matrix size 26, I get

  3.627946 seconds (33.55 M allocations: 1.500 GiB, 2.34% gc time)
  1.394570 seconds (13 allocations: 6.141 KiB)
--
  3.933052 seconds (33.55 M allocations: 1.500 GiB, 2.14% gc time)
  1.396673 seconds (13 allocations: 6.141 KiB)
--
  2.465867 seconds (33.55 M allocations: 1.500 GiB, 3.46% gc time)
  1.101601 seconds (13 allocations: 6.141 KiB)
--
  2.254616 seconds (33.55 M allocations: 1.500 GiB, 3.70% gc time)
  0.912361 seconds (13 allocations: 3.375 KiB)

This was on versioninfo():

Julia Version 0.6.2
Commit d386e40 (2017-12-13 18:08 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-5###U CPU @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas
  LIBM: libm
  LLVM: libLLVM-3.9.1 (ORCJIT, broadwell)

Checking on 0.7 is currently inconvenient for me, because my 07-repl does not like non-ascii symbols and my jupyter does not like my 07.

Thanks, with the update code I can observe indeed the speedup. What is the major bottleneck of my previous approach, is it the prod()?

But really the main part to remember, after you got the usual gotchas (type stability, everything in-place), is imho the following:

Views and fancy mapreduce calls (sum, prod), as well as some broadcasts, can have nonzero overhead. Avoid them and write straight loops if these constructs appear in an inner loop. Your matrices for permanents can’t be very large anyway, and an iteration over 30 elements will never amortize all the big machinery.

Profile.clear()
@profile perm(A3)
Profile.print()

The major bottleneck was @views v .-= δ[a].*A[:,a].

Got it… thanks!