Your machine B may run into some of the following issues:
-
The CPU has hyperthreading (2 threads per core). When you do the FFT with 2 threads, how can you be sure that two physical cores are used, and not two threads on the same core? Ideally, the threads should first be distributed over the physical cores, and only if these are exhausted, the 2nd thread on each core should be used. But I’m afraid I don’t know whether FFTW takes care to pin the threads to physical cores. If not, the behavior depends on the OS and maybe on what else is running on the machine.
-
Modern CPUs have “turbo-boost”, i.e. when just one thread is running, the CPU frequency can be (much) higher than when multiple threads are running (due to the thermal budget). So by using more threads/cores, the CPU might clock down, i.e. the performance won’t scale.
-
FFT of size N has a runtime of O(N log N), i.e. the work per data element is actually not large. If an optimized implementation (like FFTW) is used, such an algorithm’s performance may actually not be limited by CPU resources, but by memory bandwidth. Note that your machine A has two CPU sockets, i.e. when you use two threads it can probably use twice the memory bandwidth because each CPU has its own memory controller.
For a proper scaling benchmark, it is often recommended to turn off turbo-boost and hyperthreading, though this may require changing settings in the BIOS. Modern Linux kernels may offer tools to change that behavior for a running system, but I’m not up-to-date on that.