Rectified Linear Unit, Sigmoid and Tanh are three activation functions that play an important role in how neural networks work. In fact, if we do not use these functions, and instead use *no* function, our model will be unable to learn from nonlinear data.

This article zooms into ReLU, Sigmoid and Tanh specifically tailored to the PyTorch ecosystem. With simple explanations and code examples you will understand how they can be used within PyTorch and its variants. In short, after reading this tutorial, you will...

- Understand what activation functions are and why they are required.
- Know the shape, benefits and drawbacks of ReLU, Sigmoid and Tanh.
- Have implemented ReLU, Sigmoid and Tanh with PyTorch, PyTorch Lightning and PyTorch Ignite.

All right, let's get to work! 🔥

Neural networks have boosted the field of machine learning in the past few years. However, they do not work well with nonlinear data natively - we need an activation function for that. Activation functions take any number as input and map inputs to outputs. As any function can be used as an activation function, we can also use nonlinear functions for that goal.

As results have shown, using nonlinear functions for that purpose ensure that the neural network as a whole can learn from nonlinear datasets such as images.

The **Rectified Linear Unit (ReLU), Sigmoid and Tanh** **activation functions** are the most widely used activation functions these days. From these three, ReLU is used most widely. All functions have their benefits and their drawbacks. Still, ReLU has mostly stood the test of time, and generalizes really well across a wide range of deep learning problems.

In this tutorial, we will cover these activation functions in more detail. Please make sure to read the rest of it if you want to understand them better. Do the same if you're interested in better understanding the implementations in PyTorch, Ignite and Lightning. Next, we'll show code examples that help you get started immediately.

In classic PyTorch and PyTorch Ignite, you can choose from one of two options:

- Add the activation functions
`nn.Sigmoid()`

,`nn.Tanh()`

or`nn.ReLU()`

to the neural network itself e.g. in`nn.Sequential`

. - Add the
*functional equivalents*of these activation functions to the forward pass.

The first is easier, the second gives you more freedom. Choose what works best for you!

```
import torch.nn.functional as F
# (1). Add to __init__ if using nn.Sequential
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(28 * 28, 256),
nn.Sigmoid(),
nn.Linear(256, 128),
nn.Tanh(),
nn.Linear(128, 10),
nn.ReLU()
)
# (2). Add functional equivalents to forward()
def forward(self, x):
x = F.sigmoid(self.lin1(x))
x = F.tanh(self.lin2(x))
x = F.relu(self.lin3(x))
return x
```

With Ignite, you can now proceed and finalize the model by adding Ignite specific code.

In Lightning, too, you can choose from one of the two options:

- Add the activation functions to the neural network itself.
- Add the functional equivalents to the forward pass.

```
import torch
from torch import nn
import torch.nn.functional as F
import pytorch_lightning as pl
# (1) IF USED SEQUENTIALLY
class SampleModel(pl.LightningModule):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(28 * 28, 256),
nn.Sigmoid(),
nn.Linear(256, 128),
nn.Tanh(),
nn.Linear(128, 56),
nn.ReLU(),
nn.Linear(56, 10)
)
self.ce = nn.CrossEntropyLoss()
def forward(self, x):
return self.layers(x)
# (2) IF STACKED INDEPENDENTLY
class SampleModel(pl.LightningModule):
def __init__(self):
super().__init__()
self.lin1 = nn.Linear(28 * 28, 256)
self.lin2 = nn.Linear(256, 128)
self.lin3 = nn.Linear(128, 56)
self.lin4 = nn.Linear(56, 10)
self.ce = nn.CrossEntropyLoss()
def forward(self, x):
x = F.sigmoid(self.lin1(x))
x = F.tanh(self.lin2(x))
x = F.relu(self.lin3(x))
x = self.lin4(x)
return x
```

Neural networks are composed of *layers* of *neurons*. They represent a system that together learns to capture patterns hidden in a dataset. Each individual neuron here processes data in the form `Wx + b`

. Here, `x`

represents the input vector - which can either be the input data (in the first layer) or any subsequent and partially processed data (in the downstream layers). `b`

is the bias and `W`

the weights vector, and they represent the trainable components of a neural network.

Performing `Wx + b`

equals making a linear operation. In other words, the mapping from an input value to an output value is always linear. While this works perfectly if you need a model to generate a linear decision boundary, it becomes problematic when you don't. In fact, when you need to learn a decision boundary that is *not* linear (and there are many such use cases, e.g. in computer vision), you can't if only performing the operation specified before.

Activation functions come to the rescue in this case. Stacked directly after the neurons, they take the neuron output values and map this linear input to a nonlinear output. By consequence, each neuron, and the system as a whole, becomes capable of learning nonlinear patterns. The exact flow of data flowing through one neuron is visualized below and can be represented by these three steps:

**Input data flows through the neuron, performing the operation**`Wx + b`

.**The output of the neuron flows through an activation function, such as ReLU, Sigmoid and Tanh.****What the activation function outputs is either passed to the next layer or returned as model output.**

There are many activation functions. In fact, any activation function can be used - even \(f(x) = x\), the linear or identity function. While you don't gain anything compared to using no activation function with that function, it shows that pretty much anything is possible when it comes to activation functions.

The key consideration that you have to make when creating and using an activation function is the function's computational efficiency. For example, if you would design an activation function that trumps any such function in performance, it doesn't really matter if it is *really* slow to compute. In those cases, it's more likely that you can gain similar results in the same time span, but then with more iterations and fewer resources.

That's why today, three key activation functions are most widely used in neural networks:

**Rectified Linear Unit (ReLU)****Sigmoid****Tanh**

Click the link above to understand these in more detail. We'll now take a look at each of them briefly.

The **Tanh** and **Sigmoid** activation functions are the oldest ones in terms of neural network prominence. In the plot below, you can see that Tanh converts all inputs into the `(-1.0, 1.0)`

range, with the greatest slope around `x = 0`

. Sigmoid instead converts all inputs to the `(0.0, 1.0`

) range, also with the greatest slope around `x = 0`

. **ReLU** is different. This function maps all inputs to `0.0`

if `x <= 0.0`

. In all other cases, the input is mapped to `x`

.

While being very prominent, all of these functions come with drawbacks. These are the benefits and drawbacks for ReLU, Sigmoid and Tanh:

- Sigmoid and Tanh suffer greatly from the vanishing gradients problem. This problem occurs because the derivatives of both functions have a peak value at
`x < 1.0`

. Neural networks use the chain rule to compute errors backwards through layers. This chain rule effectively*chains*and thus*multiplies*gradients. You can imagine what happens when, where`g`

is some gradient for a layer, you perform`g * g * g * ...`

. The result for the most upstream layers is then very small. In other words, larger networks struggle or even fail learning when Sigmoid or Tanh is used. - In addition, with respect to Sigmoid, the middle point in terms of the
`y`

value does not lie around`x = 0`

. This makes the process somewhat unstable. On the other hand, Sigmoid is a good choice for binary classification problems. Use at your own caution. - Finally with respect to these two, the functions are more complex than that of ReLU, which essentially boils down to
`[max(x, 0)](https://www.machinecurve.com/index.php/question/why-does-relu-equal-max0-x/)`

. Computing them is thus slower than when using ReLU. -
While it seems to be the case that ReLU trumps all activation functions - and it surely generalizes to many problems and is really useful, partially due to its computational effectiveness - it has its own unique set of drawbacks. It's not smooth and therefore not fully differentiable, neural networks can start to explode because there is no upper limit on the output, and using ReLU also means opening up yourself to the dying ReLU problem. Many activation functions attempting to resolve these problems have emerged, such as Swish, PReLU and Leaky ReLU - and there are many more. But for some reason, they haven't been able to dethrone ReLU yet, and it is still widely used.

Now that we understand how ReLU, Sigmoid and Tanh work, we can take a look at how we can implement with PyTorch. In this tutorial, you'll learn to implement these activation functions with three flavors of PyTorch:

**Classic PyTorch.**This is where it all started and it is PyTorch as we know it.**PyTorch Ignite.**Ignite is a PyTorch-supported approach to streamline your models in a better way.**PyTorch Lightning**. The same is true for Lightning, which focuses on model organization and automation even more.

Let's start with classic PyTorch.

In classic PyTorch, the suggested way to create a neural network is using a class that utilizes `nn.Module`

, the neural networks module provided by PyTorch.

```
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.lin1 = nn.Linear(28 * 28, 256),
self.lin2 = nn.Linear(256, 128)
self.lin3 = nn.Linear(128, 10)
```

You can also choose to already stack the layers on top of each other, like this, using `nn.Sequential`

:

```
import torch.nn.functional as F
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.layers = nn.Sequential(
nn.Linear(28 * 28, 256),
nn.Linear(256, 128),
nn.Linear(128, 10)
)
```

As you can see, this way of working resembles that of the `tensorflow.keras.Sequential`

API, where you add layers on top of each other using `model.add`

.

In a `nn.Module`

, you can then add a `forward`

definition for the forward pass. The implementation differs based on the choice of building your neural network from above:

```
# If stacked on top of each other
def forward(self, x):
return self.layers(x)
# If stacked independently
def forward(self, x):
x = self.lin1(x)
x = self.lin2(x)
return self.lin3(x)
```

Adding **Sigmoid, Tanh or ReLU** to a classic PyTorch neural network is really easy - but it is also dependent on the way that you have constructed your neural network above. When you are using `Sequential`

to stack the layers, whether that is in `__init__`

or elsewhere in your network, it's best to use `nn.Sigmoid()`

, `nn.Tanh()`

and `nn.ReLU()`

. An example can be seen below.

If instead you are specifying the layer composition in `forward`

- similar to the Keras Functional API - then you must use `torch.nn.functional`

, which we import as `F`

. You can then wrap the layers with the activation function of your choice, whether that is `F.sigmoid()`

, `F.tanh()`

or `F.relu()`

. Quite easy, isn't it? :D

```
import torch.nn.functional as F
# Add to __init__ if using nn.Sequential
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(28 * 28, 256),
nn.Sigmoid(),
nn.Linear(256, 128),
nn.Tanh(),
nn.Linear(128, 10),
nn.ReLU()
)
# Add functional equivalents to forward()
def forward(self, x):
x = F.sigmoid(self.lin1(x))
x = F.tanh(self.lin2(x))
x = F.relu(self.lin3(x))
return x
```

You can use the classic PyTorch approach from above for adding Tanh, Sigmoid or ReLU to PyTorch Ignite. Model creation in Ignite works in a similar way - and you can then proceed adding all Ignite specific functionalities.

In Lightning, you can pretty much repeat the classic PyTorch approach - i.e. use `nn.Sequential`

and specify calling the whole system in the `forward()`

definition, or create the forward pass yourself. The first is more restrictive but easy, whereas the second gives you more freedom for creating exotic models at the cost of increasing difficulty.

Here's an example of using ReLU, Sigmoid and Tanh when you stack all layers independently and configure data flow yourself in `forward`

:

```
import torch
from torch import nn
import torch.nn.functional as F
import pytorch_lightning as pl
class SampleModel(pl.LightningModule):
# IF STACKED INDEPENDENTLY
def __init__(self):
super().__init__()
self.lin1 = nn.Linear(28 * 28, 256)
self.lin2 = nn.Linear(256, 128)
self.lin3 = nn.Linear(128, 56)
self.lin4 = nn.Linear(56, 10)
self.ce = nn.CrossEntropyLoss()
def forward(self, x):
x = F.sigmoid(self.lin1(x))
x = F.tanh(self.lin2(x))
x = F.relu(self.lin3(x))
x = self.lin4(x)
return x
def training_step(self, batch, batch_idx):
x, y = batch
x = x.view(x.size(0), -1)
y_hat = self.layers(x)
loss = self.ce(y_hat, y)
self.log('train_loss', loss)
return loss
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
return optimizer
```

Do note that the functional equivalents of Tanh and Sigmoid are deprecated and may be removed in the future:

```
UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.
warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.")
UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.
warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.")
```

The solution would be as follows. You can also choose to use `nn.Sequential`

and add the activation functions to the model itself:

```
import torch
from torch import nn
import torch.nn.functional as F
import pytorch_lightning as pl
class SampleModel(pl.LightningModule):
# IF USED SEQUENTIALLY
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(28 * 28, 256),
nn.Sigmoid(),
nn.Linear(256, 128),
nn.Tanh(),
nn.Linear(128, 56),
nn.ReLU(),
nn.Linear(56, 10)
)
self.ce = nn.CrossEntropyLoss()
def forward(self, x):
return self.layers(x)
def training_step(self, batch, batch_idx):
x, y = batch
x = x.view(x.size(0), -1)
y_hat = self.layers(x)
loss = self.ce(y_hat, y)
self.log('train_loss', loss)
return loss
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
return optimizer
```

That's it, folks! As you can see, adding ReLU, Tanh or Sigmoid to any PyTorch, Ignite or Lightning model is *a piece of cake*. 🍰

If you have any comments, questions or remarks - please feel free to leave a comment in the comments section 💬 I'd love to hear from you. You can also leave your question here.

Thanks for reading MachineCurve today and happy engineering! 😎

PyTorch Ignite. (n.d.). *Ignite your networks! — ignite master documentation*. PyTorch. https://pytorch.org/ignite/

PyTorch Lightning. (2021, January 12). https://www.pytorchlightning.ai/

PyTorch. (n.d.). https://pytorch.org

PyTorch. (n.d.). *ReLU — PyTorch 1.7.0 documentation*. https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html#torch.nn.ReLU

PyTorch. (n.d.). *Sigmoid — PyTorch 1.7.0 documentation*. https://pytorch.org/docs/stable/generated/torch.nn.Sigmoid.html#torch.nn.Sigmoid

PyTorch. (n.d.). *Tanh — PyTorch 1.7.0 documentation*. https://pytorch.org/docs/stable/generated/torch.nn.Tanh.html#torch.nn.Tanh

PyTorch. (n.d.). *Torch.nn.functional — PyTorch 1.7.0 documentation*. https://pytorch.org/docs/stable/nn.functional.html

Learn how large language models are working and how you can train open source ones yourself.

Keras is a high-level API for TensorFlow. It is one of the most popular deep learning frameworks.

Read about the fundamentals of machine learning, deep learning and artificial intelligence.

To get in touch with me, please connect with me on LinkedIn. Make sure to write me a message saying hi!

The content on this website is written for educational purposes. In writing the articles, I have attempted to be as correct and precise as possible. Should you find any errors, please let me know by creating an issue or pull request in this GitHub repository.

All text on this website written by me is copyrighted and may not be used without prior permission. Creating citations using content from this website is allowed if a reference is added, including an URL reference to the referenced article.

If you have any questions or remarks, feel free to get in touch.

TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc.

PyTorch, the PyTorch logo and any related marks are trademarks of The Linux Foundation.

Montserrat and Source Sans are fonts licensed under the SIL Open Font License version 1.1.

Mathjax is licensed under the Apache License, Version 2.0.