I am puzzled by this.
(i) Why doesn’t fft() do this scaling internally? I have to do

X = (1/L)*fft(x,N);% Now we compute the coefficients X_k

in order to get the correct spectrum. Julia appears to do the same thing. There are no examples in the docs of the Julia FFT library, but it looks that way.
(ii) What are the reasons for writing the formulas as shown (Julia’s are identical, AFAICT)? Clearly, including the division with n would make sense in the first formula, as that would yield the correct amplitudes of the spectrum of signal X…

The MATLAB documentation is misleading, there isn’t some spurious scaling factor in the DFT implementations. This is their signal before adding noise: S = 0.8 + 0.7*sin(2*pi*50*t) + sin(2*pi*120*t);, summing a flat line and a couple sinusoids. When you take the continuous FT of that you would get infinitely high and infinitesimally wide peaks, scaling the unit impulse function. It’s the integral that reflects the amplitude of the sinusoids. Since the DFT works on a sampled signal, the peaks will just be large instead of infinite, scaling with the input length relative to a derived continuous FT like the rest of the DFT. To normalize across input length, which better approximates continuous FTs, DFTs are often scaled by 1/L, but that’s not always necessary. Note that the MATLAB docs actually scale most of the values by 2/L in separate steps to make the peaks match the sinusoid amplitudes, that’s because the continuous Fourier transform actually has 2 peaks for a real sinusoid, and the integral under each is half the amplitude.

Tracked down a stackexchange post with some math relating a derived continuous FT and the DFT, so that’ll explain how the DFT is used to approximate FTs. The DFT input being sampled and implicitly periodic adds a lot of approximation headaches.

An FFT computes a discrete Fourier transform (DFT). You can interpret it as an approximation for a Fourier series or continuous Fourier transform or DTFT, but by itself the DFT knows nothing about sample period (L?) or “sampling rates” etc.

The normalizations of fft and ifft are somewhat arbitrary (along with the choice of ± signs in their phase exponents) — the only mathematical requirement, if you want them to be inverses of one another, is that the product of their normalization factors should be 1/n.

Conceptually, it can be attractive to give them both a normalization of 1/\sqrt{n}, so that each of the transformations is unitary. However, this has a couple of disadvantages: first, you pay the (slight) cost of a normalization pass twice, while in practical situations the unitary normalization is rarely needed. Second, the convolution theorem would then need an additional factor of \sqrt{n}.

I’m not sure where the convention that it is the ifft that gets the factor of 1/n originated, but it seems likely to have been popularized by Matlab, from which it was copied into Julia, SciPy, and probably others. This choice has the advantages that the normalization is done only once, and the convolution theorem then has a simple form: ifft(fft(x) .* fft(y)) is the cyclic convolution of x and y (whereas other normalizations would require an additional scaling).

The underlying FFTW library (used by both Matlab and Julia) computes transforms with no normalization factor at all. (In Julia, you can compute the lower-level unnormalized “inverse” transform with bfft.) The rationale was that (a) many normalization conventions are common and we didn’t want to pay the price of picking one (possibly wrong) for the user and (b) in practice, you invariably do some pre- or post-processing of the FFT inputs/outputs, and it is more efficient if you combine the scale factor into that computation (rather than doing a separate scaling loop, as ifft does).

Matching some other arbitrary scaling choice ≠ “right”. All of these normalizations are just conventions.

The scaling factor you seem to consider “right” is to make the fft outputs equal the amplitudes of the sinusoids e^{\frac{2\pi i}{n}(j-1)(k-1)} in trigonometric expansion of the original signal. But why normalize your basis functions this way? Why not normalize them by 1/\sqrt{n} to make them orthonormal? Why not normalize them by 1/n?

(Furthermore, if you really want this scaling convention, you can just interpret ifft as your Fourier transform, and fft as its inverse, which merely corresponds to flipping the sign convention of the sinusoid’s exponent.)

There is no “right” choice. There is only what is convenient for particular applications, and/or what is conventional in certain contexts.

Some of the FFT scaling conventions also stem from DSP processing with integer hardware/data and a group exponent (for the whole vector).

Division by \sqrt(n) was not practical. Instead, a division in stages by 2 (shift right whenever needed) would ascertain that there is no overflow. This also played well with the fact that most algorithms supported only N that are powers of 2.

Doubtless what you say is mathematically sound, but it is still nice when 2 * abs(fhat) produces the “right” peaks of the DFT of f (where fhat = DFT(f)).
So, my “ideal” DFT would be (1/n) * fft(f, n). When I manufacture a signal, for instance

f = sin (2*pi*50*t )+ sin (2*pi*120*t ); % Sum of 2 frequencies
f = f + 2.5e-1*randn ( size ( t )); % Add some noise