Nonlinear activation functions (and Transformers) on the way out?! And "SimpleGate" the replacement

our SimpleGate could be implemented by an element-wise multiplication, that’s all:
SimpleGate(X, Y) = X ⊙ Y
where X and Y are feature maps of the same size.

What caught my eye first was:

https://arxiv.org/pdf/2204.04676.pdf

NAFNet: Nonlinear Activation Free Network for Image Restoration

Although there have been significant advances in the field of image restoration recently, the system complexity of the state-of-the-art (SOTA) methods is increasing as well, which may hinder the convenient analysis and comparison of methods. In this paper, we propose a simple baseline that exceeds the SOTA methods and is computationally efficient. To further simplify the baseline, we reveal that the nonlinear activation functions, e.g. Sigmoid, ReLU, GELU, Softmax, etc. are not necessary: they could be replaced by multiplication or removed.

As far as I understood, non-linearity is necessary for activation functions, if not, e.g. if identity function used, you collapse the layers, i.e. deep-learning wouldn’t be possible. I’ve been on the lookout for the best activation function (and it might, or might not be application specific), and there are better ones than ReLU, GELU etc. out there.

Doting away with them may be about computer vision, and not translate to other NN tasks. I’ve seen Transformers take over for basically everything (from convolutional networks), then some of the older methods improved to compete again, or maybe hybrids too if I recall, but that SOTA neural network also does away with Transformers. Would that e.g. not translate to NLP too where they originated?

See e.g. in the paper:

Fig. 3: Intra-block structure comparison. ⊗:matrix multiplication, ⊙/⊕:elementwise multiplication/addition. dconv: Depthwise convolution. Nonlinear activation functions are represented by yellow boxes. (a) Restormer’s block[37], some details are omitted for simplicity, e.g. reshaping the feature maps. […] Besides, ReLU is replaced by GELU. (d) Our proposed Nonlinear Activation Free Network’s block. It replaces CA/GELU with Simplified Channel Attention(SCA) and SimpleGate respectively. The details of these components are shown in Fig 4

and:

3.2 A Plain Block
Neural Networks are stacked by blocks. We have determined how to stack blocks
in the above (i.e. stacked in a UNet architecture), but how to design the internal
structure of the block is still a problem. We start from a plain block with the most
common components, i.e. convolution, ReLU, and shortcut[13], and the arrangement of these components follows [12,21], as shown in Figure 3a. We will note it as PlainNet for simplicity. Using a convolution network instead of a transformer is based on the following considerations. First, although transformers show good performance in computer vision, some works[12,22] claim that they may not be necessary for achieving SOTA results. Second, depthwise convolution is simpler
than the self-attention[32] mechanism. Third, this paper is not intended to discuss the advantages and disadvantages of transformers and convolutional neural networks, but just to provide a simple baseline. The discussion of the attention mechanism is proposed in the subsequent subsection.

3.4 Activation
The activation function in the plain block, Rectified Linear Unit[27] (ReLU),
is extensively used in computer vision. However, there is a tendency to replace
ReLU with GELU[14] in SOTA methods[22,37,30,21,11]. […]

3.5 Attention
Considering the recent popularity of the transformer in computer vision, its
attention mechanism is an unavoidable topic in the design of the internal structure of the block. There are many variants of attention mechanisms, and we
discuss only a few of them here. […]

From Eqn. 1 and Eqn. 2, it can be noticed that GELU is a special case of GLU, […] Through the similarity, we conjecture from another perspective that GLU may be regarded as a generalization of activation functions, and it might be able to replace the nonlinear activation functions.

2 Likes

Simple gate is non linear in that case. Because it is non linear in X0, such that X = AX0 + b0 and Y=BX0 + b1. @Palli
PS : even for a multilayer perceptron it works better than any other activation for me, no matter the quantity of used neurons, its a pitty they did not make a separate paper for it as I suspect it is robust to vanishing gradient ([1805.11604] How Does Batch Normalization Help Optimization?).

2 Likes