That paper (and the rest of the information bottleneck work) has generated a fair amount of controversy (see this ICLR paper and the open reviews). For example, the “compression phase” where the mutual information decreases with more training doesn’t happen when you use ReLU (instead of tanh).
Interesting. One comment I would have is that despite the extensive discussion about it in the original paper, I would not have expected the fitting and compression to have distinct and identifiable phases in the general case, but while that’s an interesting detail, it hardly seems like a crucial one. Second, I’d certainly expect that there would be cases where it’s possible to get good representations of the target variable without compression but (perhaps?) the interesting cases are those in which it’s too difficult to achieve this as a practical matter. That said, the examples where they show that they get over-fitting in spite of compression seem extremely worrying for the whole information bottleneck picture, so that’s definitely very interesting.