Irfft with A_mul_B! messes up its input

Using A_mul_B! seems to be messing up it’s input. I attach a .jl file (INPUT) and the output I get on my machine. The output’s last line is the one which is puzzling!

INPUT

# Code snippet to test

nx, ny = 8, 8
nk, nl = nx, ny
nkr = trunc(Int, round(nx/2+1))

# Initialize
f  = randn(nx, ny)
fh = rfft(f)

fcopy = deepcopy(f)
fhcopy = deepcopy(fh)
f3 = zeros(f)

# Plan FFT
effort = FFTW.MEASURE

rfftplan  =  plan_rfft(f; flags=effort)
irfftplan = plan_irfft(fh, nx; flags=effort)

# Restore original values without changing address in memory
f  .= fcopy
fh .= fhcopy

@printf("\nAfter deepcopy( ):
  norm(fh-fhcopy) = %e", norm(fh-fhcopy))

f2 = irfft(fh, nx)

@printf("\nInverse with irfft( ):
  norm(f-f2)      = %e
  norm(fh-fhcopy) = %e", norm(f-f2), norm(fh-fhcopy))

A_mul_B!(f3, irfftplan, fh)

@printf("\nInverse with A_mul_B!( ):
  norm(f-f3)      = %e
  norm(fh-fhcopy) = %e\n\n", norm(f-f3), norm(fh-fhcopy))

OUTPUT

After deepcopy( ):
  norm(fh-fhcopy) = 0.000000e+00
Inverse with irfft( ):
  norm(f-f2)      = 1.233205e-15
  norm(fh-fhcopy) = 0.000000e+00
Inverse with A_mul_B!( ):
  norm(f-f3)      = 1.233205e-15
  norm(fh-fhcopy) = 1.034644e+02

This is a limitation of FFTW. As explained in the FFTW manual,

As noted above, the c2r transform destroys its input array even for out-of-place transforms. This can be prevented, if necessary, by including FFTW_PRESERVE_INPUT in the flags, with unfortunately some sacrifice in performance. This flag is also not currently supported for multi-dimensional real DFTs (next section).

Should probably be documented in Julia (or the new FFTW.jl package) too.