##### Blog by Author : Herumb

- I love ML and Data Science or anything remotely related to it.

- Data Science Intern at CrowdANALYTIX | Ex-NLP Research Intern at CAIR, DRDO | Crework | Ex-Jr. ML Engineer at Omdena.

- If you have something to talk about AI/ML/DS, feel free to message me.

**1. Introduction**

Deep Learning is probably one of those things that everyone thinks is magical and usually, I take immense pleasure in seeing the reaction they have when I tell them it’s essentially matrices that are unlike the nodes they expect but Neural Networks aren’t limited to that and neither, is that the thing, that makes it special.

For all the beautiful minds out there, the modified quote below is from The Lord of the Rings but sad that it couldn’t be true.

“One Activation to Rule them all

One Activation to find them

One Activation to bring them all

And in Layer bind them”

Behind the scenes, neural networks learn by a process known as ** backpropagation** which is essentially propagating the loss backward in the neural network or as a few like to say chain rule coming to life. On the other hand, the output in neural networks is generated by the process known as

**which is basically nothing but a set of linear operations.**

*feed-forward*The mathematics behind it is intriguing but these statements lead to one interesting question:

- If feed-forward is linear then why are neural networks so hyped?

Well, my friend, I’ll ask you to hold your horses and fasten your seatbelt because it’s gonna be one hell of a ride!

P.S. I expect you are familiar with Feed Forward and Backpropagation. If not – no issue, we’ll be getting a bit into their working before completely understanding Activation Functions.

**2. What are Activation Functions?**

Activation Function, also known as Transfer Function is nothing but a simple mathematical function and like any function, it takes an input and gives some output corresponding to it.

In neural networks, the activation function of a node defines the output of that node given an input or set of inputs. An integrated circuit could be considered as a digital network of activation functions that can be “ON” or “OFF”, based on the input.

Yes, I’m indeed saying that it could be any function – sigmoid, tanh, Softmax, ReLU, Leaky ReLU, etc. but the below questions may arise?

- Does using a particular activation function have any effect on performance?
- Why use activation functions anyways?
- What are the popular ones?

Let’s see all of them one by one.

**3. Why use Activation Functions?**

Before diving into the answer let’s take a walk down the memory lane and understand the feed-forward process again. Take a look at the picture below.

Although that pretty much sums it up. Let’s go through it briefly. In the image on top, there is a single node of the layer with w1 and w2 being the weights associated with the inputs x1 and x2. This node does nothing but multiplies the weights to the corresponding inputs and summing them up.

*For example*, x1 has weight w1 and x2 has weight w2 associated with them so they multiplied and summed the result in x1*w1+x2*w2.

Now let’s talk about the network at the bottom which is a simple neural network. In the hidden layer, we have 2 nodes, let’s name the top node as node A and the bottom node of the hidden layer as node B.

The output from node A is the result of summing the products of x1 and x2 with their weights associated to node A i.e. w1 and w3 respectively, let’s refer to the output of node A as o1. Similarly, the output from node B is the result of summing the products of x1 and x2 with their weights associated to node B i.e. w4 and w2 respectively, let’s refer to the output of node B as o2.

Now that we have the outputs of both nodes in the hidden layer, we can now calculate the output of the neural network by summing the products of o1 and o2 with their weights associated with the output node i.e. w5 and w6 respectively. But we know that this operation is basically a matrix multiplication and we have the matrix W or weight matrix and O the hidden layer output matrix so the output becomes W x O. If we add the bias term it becomes W x O + B.

The Problem is, this output is linear in nature so it’s not able to capture the complex patterns in the dataset. If somehow we can *add non-linearity in the result* we might be able to do that and *that’s exactly what activation functions do*. We apply activation functions on the outputs of the hidden layers to introduce inequality so that our model is able to capture complex patterns in the data.

Well, that was elaborate but hopefully, this is clear now. Let’s go ahead and explore more about the 3 activation functions, the Problems associated with them, and How other activation functions overcome them?

**4. Three Activation Functions**

4.1 Sigmoid Function

4.2 Hyperbolic Tangent Function

4.3 ReLU – Rectified Linear Unit

**4.1 Sigmoid Function**

If you are familiar with Logistic Regression then chances are, you already know about sigmoid or logistic function one but just for formality let’s revisit it again:

The function above is the sigmoid function. It basically takes an input and maps it to a value between 0 to 1. Now, this sigmoid function can be used as an activation function for our hidden layer and it’ll work fine, kinda. I mean it does introduce non-linearity leading to complex decision boundaries but as the number of hidden layers increases, we’ll run into the problem of Vanishing Gradients. Just for a refresher, Decision Boundary is a line that separates the points into the different regions and each region corresponds to a different class.

In the Vanishing Gradient problem, the gradients become extremely small leading to almost no change in weights and no change in weights means no learning. Before understanding why it happens let’s take a look at the derivative of the sigmoid function.

Interesting isn’t it? The more interesting thing is that the maximum value attained by the derivative of the sigmoid function is 0.25, I know you’ll be all like “what’s so interesting in that?”.

To that, I’ll say that the product of 2 numbers, both less than 1, results in a lesser value and where does that multiplication happen?

Chain rule! And like I said before, backpropagation is nothing but chain rule coming to life. So as the network back propagates the error deeper, the values of gradients keep getting smaller since the no. of gradients being multiplied increases. To all those who forgot chain rule and backpropagation, the following image should help.

That’s the reason that in deep networks with sigmoid activation the end layers keep learning but the learning becomes less until it doesn’t happen as you go deeper.

For shallow networks and not-so-deep neural networks, it’s kind of ok to use sigmoid but as you go to neural nets with 10 or 20 or 50 layers things get really serious.

So, what is the solution for this? Well, the simple solution is to use an activation function that does not squish the inputs and ReLU does exactly that. But before that let’s take a look at another function loved, or used, by LSTMs i.e. tanh function.

**4.2 Hyperbolic Tangent Function**

If you have studied Trigonometry then chances are you’ve already heard about it. I apologize in advance because I’ll have to take you through that trauma again. Equation wise tanh function looks like this:-

Hyperbolic Tangent function, or tanh function, is a rescaled version of sigmoid function and like the sigmoid function, it squishes the input between a range, but this range is a bit bigger than sigmoid i.e. [-1, 1]. Not just that in comparison it is also much steeper near zero than sigmoid. But these things make much more sense when you look at the following equation:-

This equation justifies the point that tanh is basically stretched and readjusted sigmoid function which probably is the reason why property-wise tanh is quite similar to the sigmoid function. In fact, just like the sigmoid function, its derivative can be represented by its original value as well.

These things aside, tanh actually performs better than the sigmoid function when used as an activation function. But why? For that let’s take a look at a research paper by LeCun named “Efficient BackProp” written in 1998 the time when deep learning was usually limited to paper. Here is an excerpt from it:-

Sigmoids that are symmetric about the origin are preferred for the same reason that inputs should be normalized, namely, because they are more likely to produce outputs (which are inputs to the next layer) that are on average close to zero.

This is in contrast, say, to the logistic function whose outputs are always positive and so must have a mean that is positive.

Let’s dissect it piece by piece. The first point says that symmetric sigmoids, i.e. tanh in this case, are preferred because they produce outputs nearing zero. That’s true in logistic function output range was from [0,1] making the output positives but the tanh function has a range [-1, 1] hence the mean value of layer comes very close to 0. When weights are centered around 0 the value of gradients is larger and yes that’s like sigmoid except in this case the gradients are much stronger and hence convergence is faster.

But sadly it suffers from the problem of vanishing gradient too. So we still need a fix for that. There is another problem of the error surface being flat near zero but as LeCun said adding a Linear term can fix that. By that, he meant using tanh(x) + ax and as a bonus fact let me tell you this, LeCun actually told the preferred sigmoid to use as:-

**4.3 ReLU – Rectified Linear Unit**

Of all the activation functions I have known, this was the most simple…

– Spock

We talked about what a vanishing gradient is and how it can hinder the learning process of initial layers, so a good question to ask would be is there any activation function that won’t let the vanishing gradient problem happen? Well, that’s exactly what ReLU helps us with. If squishing the output is making the gradients disappear then let’s not! Relu is a simple guy, he sees negative and he makes it zero . To understand it better let’s see its equation.

So how does ReLU solve the vanishing gradient problem? Well, the derivative of ReLU can only be either 0 or 1 so you don’t need to worry about the gradients decreasing as it gets deeper. Even functions with pointy edges or breaks, like ReLU at x = 0, are considered non-differential at that point but for ReLU we can simply define the gradient at x = 0 as 0 or 1 won’t matter.

I mean this is as simple as it can get and that’s basically the principle behind ReLU is the output is negative make it 0 else keep it as is. Wait wait wait! What? If ReLU just lets the input be in case it is positive and only makes it 0 when negative then it’s it basically a linear function with a restriction? Well, you are not wrong in that statement, in fact if you look at the graph it is the same as a linear function for the +ve x-axis.

**Why does ReLU work?**

If ReLU is basically a linear function for the +ve x-axis then how does it introduce non-linearity to the net? Well, to keep it simple the ReLU function is nonlinear overall so it does introduce non-linearity. But you won’t be satisfied with that sort of explanation right? So let’s do a bit of math to understand, a function is said to be non-linear if it does not follow the following property:-

For ReLU the above statement is false in the case where either a or b is negative. So that does prove that ReLU is not linear and hence can introduce non-linearity to the net. But if that doesn’t seem convincing to you take a look at the following picture:-

So let’s understand the above picture but before that let’s establish some basic understanding of graph manipulation. When you add some value to a graph you shift the graph on the x-axis so for identity function i.e. f(x) = x if we add a value of 5 making it f(x) = x + 5 the graph will shift by 5 units in the direction of -ve x-axis.

That means adding a value to the function makes it shift in -ve x-axis direction and subtracting a value to the function makes it shift in +ve x-axis direction. Similarly, multiplying a +ve value to the function makes it stretch in +ve y-axis direction, and multiplying a -ve value to the function makes it stretch in -ve y-axis direction.

Now that we have some basic idea about graph manipulation we can go ahead and ask ourselves, Where have we seen these multiplying and adding? The answer is Feed-Forward! Yep, we multiply the input value with weights and add bias, if any, to the result.

Doing this to a ReLU function can make stretch and shift and since the same thing happens to other nodes too we’ll have a bunch of transformed ReLU functions which will be added together and passed to the next layer, and as seen in the picture when you add a bunch of transformed ReLU you get a non-linear function and that’s why even though ReLU may seem linear but when transformed and added together creates a non-linear function.

**So ReLU is perfect?**

ReLU is widely used in neural networks and even now you’ll see it being used a lot, like a lot. So that’s good, right? We finally have our perfect activation function. YAY! Well, NO. Sorry about that really but even though it might work in most cases but in some cases, it won’t, due to something known as the Dying ReLU Problem.

**5. Dying ReLU Problem**

It was going so well that we felt like we finally met the chosen one but all the good things eventually came to an end and in the case of ReLU the reason was Dying ReLU Problem, which is a really cool name but what makes ReLU have this problem? Or even a more basic question, what is the Dying ReLU problem? Do we have a fix? Let’s discuss that.

So as discussed earlier we know that ReLU is zero for all the negative input. That actually is something that worked for us adding non-linearity to output and for the most part, it’ll work fine but in some cases what happens is that many nodes start giving negative output either due to a large weight or bias term, which leads to the nodes outputting 0 when ReLU is applied. Now if this happens to a lot of nodes then the gradient also becomes zero and if there is no gradient there is no learning.

This doesn’t happen every time though so ReLU is not useless. So why does this happen? Well a couple of reasons:-

High Learning Rate can lead to the weights becoming negative which may make the output term negative. Setting the learning rate to a lower value might fix the problem.*High Learning Rate:*If the bias term is largely negative then it could lead to the final output becoming negative.**Large Negative Bias Term:**

So we know what the Dying ReLU problem is, we know what causes it and we know one way to fix it. Well, actually there is another way to fix it i.e. New Activation Function. Yay?

**6. Modifying ReLU**

The main cause of the dying ReLU problem is due to negative output becoming zero, in order to cure this what we can do is not make the negative output zero. Cool! So what should we replace it with? Well, plenty of things as it seems and that’s where variations of ReLU come and the most popular one being Leaky-ReLU. So what are these variants and what’s the difference? Let’s take a look at 5 Variants:

- Shifted ReLU
- Leaky ReLU
- P-ReLU
- ELU
- GELU

**6.1 Shifted ReLU**

Shifted ReLU is an adjusted ReLU which instead of making negative values as zero you change it with another value. In fact, equation wise it is pretty similar to ReLU. Since the negative value doesn’t become zero the nodes don’t get switched off and the Dying ReLU Problem doesn’t occur.

**6.2 Leaky ReLU**

Leaky ReLU is another ReLU variate that aims to replace the negative part of ReLU with a line of a small slope, commonly 0.01 or 0.001, now since negative values don’t become 0 the nodes don’t get switched off and the dying ReLU problem doesn’t occur.

**6.3 P-ReLU**

P-ReLU or Parametric ReLU is a variation of Leaky ReLU variate that aims to replace the negative part of ReLU with a line of a slope α and since negative values don’t become 0 the nodes don’t get switched off and the dying ReLU problem doesn’t occur.

**6.4 Exponential Linear Unit – ELU**

ELU is another activation function and it is known for being able to converge faster and produce better results. It replaces the negative part with a modified exponential function. There is an issue of added computational cost but at least we don’t have a dying ReLU problem.

That’s pretty much all the ReLU variates you should be familiar with and each of them has its set of advantages and disadvantages but one thing they have in common is their ability to prevent dying ReLU problems.

**6.5 GELU – Gaussian Error Linear Unit**

I know and yes dying ReLU is fixed but this activation function is something I wanted to talk about because of it being used in SOTA models. GELU or Gaussian Error Linear Unit is an activation function that has become a popular choice in Transformer models. Its most recent use case is in the CMT model, a paper published in July 2021. GELU function is a modification on ReLU, well sort of. For positive values, the output will be the same, except from 0 <= x <= 1, where it becomes a slightly smaller value.

**7. Why Activation on CNN?**

In CNN-based networks, you must have often seen ReLU being used over feature maps generated by CNN Blocks but why do we use that? I mean in normal neural nets it is to introduce non-linearity but what about CNN? Well, let’s find out.

CNN aims to find various features in an image and even though the process may seem complex the operations in CNN are linear in nature. Thus the reason activation is used in CNN is the same as that of a normal layer i.e. to introduce non-linearity to make CNN capture better features that can generalize to a particular class.

**8. Activation over Output Layer – Is it necessary?**

We’ve learned about the reason we use activation functions over hidden layers but their use is not just limited to hidden layers. We can use the activation function over output layers too. But what is the point? Let’s take a look at the common problems:-

**Regression Problem****:**You need to pass the output values to a sigmoid activation to convert them into a probability to be used to classify it as 0 or 1 depending on the threshold.**Binary Classification Problem:***Multi-Class Classification Problem:*

**9. Custom Activation Function Layer in PyTorch**

Activation functions are just normal mathematical functions so you can create a python function to define the equation and use it in the feedforward. However, you can also create an activation layer in PyTorch using nn.Module. If you are familiar with PyTorch then this will feel very much like defining a feed-forward, probably because it is. Let’s go ahead and create our own activation class for Parametric ReLU.

` ````
```# Code Block 1:
class ParaReLU(nn.Module):
def __init__(self, alpha = 0.01):
super().__init__()
self.alpha = alpha # alpha parameter for P-ReLU
def forward(self, x):
return x if x >= 0 else x*self.alpha

` ````
```# Code Block 2:
import numpy as np
from torch import nn
import matplotlib.pyplot as plt
# Activation Layer
class ParaReLU(nn.Module):
def __init__(self, alpha = 0.01):
super().__init__()
self.alpha = alpha # alpha parameter for P-ReLU
def forward(self, x):
return x if x >= 0 else x*self.alpha
# Layer Instance
prelu = ParaReLU(alpha = 0.2)
# Creating Data Samples to plot
sample = np.arange(-5, 6)
# Making Graph Look Nice
plt.figure(figsize = (8,5))
ax = plt.gca()
ax.spines['top'].set_color('none')
ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')
ax.spines['right'].set_color('none')
#Plotting The Samples and their corresponding P-ReLU value
plt.plot(sample, list(map(prelu, sample)))
plt.plot()

There you go that basically defines the return value of the activation function in the forward function and you are ready to use this as a layer. Let’s go ahead and plot this to see if it’s actually correct.

**10. Conclusion**

Wow, that was a lot of information to process but I hope it was fun and interesting for you to read. *Activation Function is a concept that almost everyone knows but the impact it creates is something very few like to ponder over. The reason they are used, the effect each has on the results, etc. such questions may seem simple but the dynamics behind them could be much more complex, hence it becomes important to understand them and I hope you were able to understand it.*

If you liked this Blog, leave your thoughts and feedback in the comments section, See you again in the next interesting read!

Happy Learning!

Until Next Time, Take care!

– By Herumb.

Checkout this Blog on “A Comprehensive History of AI is here: First AI Model to Latest Trend” here !

Sean flaniganVery nice

Abhishek mishraVery nice information, this is really very detailed

Pingback: A Comprehensive History of AI is here: First AI Model to Latest Trends

appsAlphawell-written and perfect example to understand everything in one place. It can be used as a dictionary of activation functions.

Indhumathy ChelliahExcellent article about activation functions. Very well written and nicely presented. Kudos to your effort!

Arvinth M.V.NExcellent effort on this, very well compiled:)! Kudos to author and team

Rahul MishraImportant information before entering into deep learning 🤓

SenthilkumarVery nicely written article to give the complete knowledge from a beginner to expert level.

Hats off to Author and TheHackweekly team.

One more suggestion, lot of good things happening with Hackweekly community. Please include Data Engineering also!

Roopesh Bharatwaj K RGreat Information with clear and crisp About Activation Function !!!, All the Best for TheHackWeekly.

Sean BenhurDeep and insightful article, Thanks for the great content!

Mubin ShaikhVery interesting blog…… Before reading the blog…I hardly know 3-4 activation functions only…

Aanisha BhattacharyyaLoved the article! Nice initiative HackWeekly!

Kabilan NThis was very informative blog. Types of Activation functions are well organized and easily understandable.

Krishnamurthy JangalaGreat content at a single place

RoshanMuch needed article.. Very well written and nicely presented 👌

NareshAwsome !! Very insight full article. Kudos to the team

Jeese P AbrahamGreat content. Easy to understand.

Bikram SahooGreat share ..

ManiWonderful article.. great read! Useful one.!

YashThis is an amazing blog!! Would love to read more such informative blogs…

Sourav SahaA well written blog. Very insightful

Hithul kannanGreat article. Very useful blog

KalaivaniThank you Team for this Amazing Article! Keep publishing many and educate around the community!

KumarGreat article,Activation function is commedable

Abhishek mishraIt’s a great news, hackweekly has come up with its own website and this whole initiative, glad i m part of the community

Prakash MuthudossAmazing blog content worth spending time to learn new things. I read a lot. To be frank that’s how I stepped to ML but not yet into DL. This statement is me comparing to expert people.

This topic and the way it is written amazingly superb….thank you Mr.herumb and the blog owner Mr.Vetri for creating such wonderful content

DeepthiThis was amazing , everything explained very clearly ,it’s really useful.thank you for sharing and all the best for future blogs…..

Atif HassanA very well researched article. I was pleasantly surprised to see you talk about SOTA models and also provide implementations. This is a very well written article indeed and a great contribution to the community. Thank you for taking the time!!

SiddharthWow!!

Deep insight..

Congratulations to team TheHackweekly for the new venture.