Training a neural network means that you will need to strike a balance between optimization and over-optimization. Over-optimized models work really well on your training set, but due to their complexity - by taking the oddities within a training dataset as part of the mapping that is to be performed - they can fail really hard when the model is used in production.
Regularization techniques can be used to mitigate these issues. In this article, we're going to take a look at L1, L2 and Elastic Net Regularization. Click on the previous link to understand them in more detail in terms of theory, because this article focuses on their implementation in PyTorch. After reading it, you will...
Ready? Let's take a look! 😎
Training a neural network involves creating a mapping between an array of input variables \(\textbf{x}\) to an independent variable, often called \(\text{y}\). Recall that a mapping between such variables can be expressed mathematically, and that a mapping is represented by a function - say, \(f\). In this case, the mapping of the actual function is as follows: \(\text{y}: f(\textbf{x})\).
The way the mapping is performed is dependent on the way that you create it, or fit it. For example, in the image below, we generated two such mappings using exactly the same input data - the set of points. The first is a polyfit with three degrees of freedom, creating the yellow line. The second has ten degrees of freedom, creating the blue line.
Which mapping is more realistic, you say? Yellow or blue?
If you said yellow, you're right. Such extremities in mappings that are visible in the blue one are often very unlikely to be true, and occur likely due to excessive sensitivity of the model to oddities in your data set.
Training a neural network involves using your input data (the set of \(\textbf{x}\)s) to generate predictions for each sample (the corresponding set of \(\text{y}\). The network has trainable components that can jointly attempt to approximate the mapping, \(\text{y}: f(\textbf{x})\). The approximation is then called \(\hat{\text{y}}: f(\textbf{x})\), from y hat.
When feeding forward our samples and optimizing our model we do not know whether our model will learn a mapping like the one in yellow or the one in blue. Rather, it will learn a mapping that minimizes the loss value. This can lead to a situation where a mapping like the one in blue is learned, while such extremities are unwanted.
Adding regularization to your neural network, and specifically to the computed loss values, can help you in guiding the model towards learning a mapping that looks more like the one in yellow. After computing loss (i.e., the model error) after every forward pass, it adds another value to the loss function - and this value is higher when the model is more complex, while lower when it is less complex. In other words, the model is punished for complexity. This leads to a trained model that is as good as it can be when it is as simple as it can be at the same time.
Beyond Dropout, which is another mechanism for regularization, there are three main candidates that are used frequently:
Now that we'll understand what regularization is and which key regularizers there are, you'll take a closer look at each - including examples for implementing them with PyTorch.
Let's get to work! 😎
Suppose that you are using binary crossentropy loss with your PyTorch based classifier. You want to implement L1 Regularization, which effectively involves that \(\sum_f{ _{i=1}^{n}} | w_i |\) is added to the loss.
Here, \(n\) represents the number of individual weights, and you can see that we iterate over these weights. We then take the absolute value for each value \(w_i\) and sum everything together.
In other words, L1 Regularization loss can be implemented as follows:
\(\text{full_loss = original_loss + } \sum_f{ _{i=1}^{n}} | w_i |\)
Here, original_loss
is binary crossentropy. However, it can be pretty much any loss function that you desire!
Implementing L1 Regularization with PyTorch can be done in the following way.
MLP
that extends PyTorch's nn.Module
class. In other words, it's a neural network using PyTorch.def
called compute_l1_loss
. This is an implementation of taking the absolute value and summing all values for w
in a particular trainable parameter.Here is the full example for L1 Regularization with PyTorch:
import os
import torch
from torch import nn
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader
from torchvision import transforms
class MLP(nn.Module):
'''
Multilayer Perceptron.
'''
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Flatten(),
nn.Linear(28 * 28 * 1, 64),
nn.ReLU(),
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, 10)
)
def forward(self, x):
'''Forward pass'''
return self.layers(x)
def compute_l1_loss(self, w):
return torch.abs(w).sum()
if __name__ == '__main__':
# Set fixed random number seed
torch.manual_seed(42)
# Prepare CIFAR-10 dataset
dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor())
trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1)
# Initialize the MLP
mlp = MLP()
# Define the loss function and optimizer
loss_function = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4)
# Run the training loop
for epoch in range(0, 5): # 5 epochs at maximum
# Print epoch
print(f'Starting epoch {epoch+1}')
# Iterate over the DataLoader for training data
for i, data in enumerate(trainloader, 0):
# Get inputs
inputs, targets = data
# Zero the gradients
optimizer.zero_grad()
# Perform forward pass
outputs = mlp(inputs)
# Compute loss
loss = loss_function(outputs, targets)
# Compute L1 loss component
l1_weight = 1.0
l1_parameters = []
for parameter in mlp.parameters():
l1_parameters.append(parameter.view(-1))
l1 = l1_weight * mlp.compute_l1_loss(torch.cat(l1_parameters))
# Add L1 loss component
loss += l1
# Perform backward pass
loss.backward()
# Perform optimization
optimizer.step()
# Print statistics
minibatch_loss = loss.item()
if i % 500 == 499:
print('Loss after mini-batch %5d: %.5f (of which %.5f L1 loss)' %
(i + 1, minibatch_loss, l1))
current_loss = 0.0
# Process is complete.
print('Training process has finished.')
Implementing L2 Regularization with PyTorch is also easy. Understand that in this case, we don't take the absolute value for the weight values, but rather their squares. In other words, we add \(\sum_f{ _{i=1}^{n}} w_i^2\) to the loss component. In the example below, you can find how L2 Regularization can be used with PyTorch:
import os
import torch
from torch import nn
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader
from torchvision import transforms
class MLP(nn.Module):
'''
Multilayer Perceptron.
'''
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Flatten(),
nn.Linear(28 * 28 * 1, 64),
nn.ReLU(),
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, 10)
)
def forward(self, x):
'''Forward pass'''
return self.layers(x)
def compute_l2_loss(self, w):
return torch.square(w).sum()
if __name__ == '__main__':
# Set fixed random number seed
torch.manual_seed(42)
# Prepare CIFAR-10 dataset
dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor())
trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1)
# Initialize the MLP
mlp = MLP()
# Define the loss function and optimizer
loss_function = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4)
# Run the training loop
for epoch in range(0, 5): # 5 epochs at maximum
# Print epoch
print(f'Starting epoch {epoch+1}')
# Iterate over the DataLoader for training data
for i, data in enumerate(trainloader, 0):
# Get inputs
inputs, targets = data
# Zero the gradients
optimizer.zero_grad()
# Perform forward pass
outputs = mlp(inputs)
# Compute loss
loss = loss_function(outputs, targets)
# Compute l2 loss component
l2_weight = 1.0
l2_parameters = []
for parameter in mlp.parameters():
l2_parameters.append(parameter.view(-1))
l2 = l2_weight * mlp.compute_l2_loss(torch.cat(l2_parameters))
# Add l2 loss component
loss += l2
# Perform backward pass
loss.backward()
# Perform optimization
optimizer.step()
# Print statistics
minibatch_loss = loss.item()
if i % 500 == 499:
print('Loss after mini-batch %5d: %.5f (of which %.5f l2 loss)' %
(i + 1, minibatch_loss, l2))
current_loss = 0.0
# Process is complete.
print('Training process has finished.')
L2 based weight decay can also be implemented by setting a delta value for weight_decay
in the optimizer.
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
PyTorch (n.d.)
For example:
optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4, weight_decay=1.0)
It is also possible to perform Elastic Net Regularization with PyTorch. This type of regularization essentially computes a weighted combination of L1 and L2 loss, with the weights of both summing to 1.0
. In other words, we add \(\lambda_{L1} \times \sum_f{ _{i=1}^{n}} | w_i | + \lambda_{L2} \times \sum_f{ _{i=1}^{n}} w_i^2\) to the loss component:
\(\text{full_loss = original_loss + } \lambda_{L1} \times \sum_f{ _{i=1}^{n}} | w_i | + \lambda_{L2} \times \sum_f{ _{i=1}^{n}} w_i^2 \)
In this example, Elastic Net (L1 + L2) Regularization is implemented with PyTorch:
def
s which are used to compute L1 and L2 loss, respectively.class MLP(nn.Module):
'''
Multilayer Perceptron.
'''
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Flatten(),
nn.Linear(28 * 28 * 1, 64),
nn.ReLU(),
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, 10)
)
def forward(self, x):
'''Forward pass'''
return self.layers(x)
def compute_l1_loss(self, w):
return torch.abs(w).sum()
def compute_l2_loss(self, w):
return torch.square(w).sum()
if __name__ == '__main__':
# Set fixed random number seed
torch.manual_seed(42)
# Prepare CIFAR-10 dataset
dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor())
trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1)
# Initialize the MLP
mlp = MLP()
# Define the loss function and optimizer
loss_function = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4)
# Run the training loop
for epoch in range(0, 5): # 5 epochs at maximum
# Print epoch
print(f'Starting epoch {epoch+1}')
# Iterate over the DataLoader for training data
for i, data in enumerate(trainloader, 0):
# Get inputs
inputs, targets = data
# Zero the gradients
optimizer.zero_grad()
# Perform forward pass
outputs = mlp(inputs)
# Compute loss
loss = loss_function(outputs, targets)
# Specify L1 and L2 weights
l1_weight = 0.3
l2_weight = 0.7
# Compute L1 and L2 loss component
parameters = []
for parameter in mlp.parameters():
parameters.append(parameter.view(-1))
l1 = l1_weight * mlp.compute_l1_loss(torch.cat(parameters))
l2 = l2_weight * mlp.compute_l2_loss(torch.cat(parameters))
# Add L1 and L2 loss components
loss += l1
loss += l2
# Perform backward pass
loss.backward()
# Perform optimization
optimizer.step()
# Print statistics
minibatch_loss = loss.item()
if i % 500 == 499:
print('Loss after mini-batch %5d: %.5f (of which %.5f L1 loss; %0.5f L2 loss)' %
(i + 1, minibatch_loss, l1, l2))
current_loss = 0.0
# Process is complete.
print('Training process has finished.')
By reading this article, you have...
I hope that this article was useful for you! :) If it was, please feel free to let me know through the comments section 💬 Please let me know as well if you have any questions or other remarks. Where necessary, I will make sure to adapt the article.
What remains is to thank you for reading MachineCurve today. Happy engineering! 😎
PyTorch. (n.d.). Adam — PyTorch 1.9.0 documentation. https://pytorch.org/docs/stable/generated/torch.optim.Adam.html#torch.optim.Adam
StackOverflow. (n.d.). L1 norm as regularizer in Pytorch. Stack Overflow. https://stackoverflow.com/questions/46797955/l1-norm-as-regularizer-in-pytorch
Learn how large language models and other foundation models are working and how you can train open source ones yourself.
Keras is a high-level API for TensorFlow. It is one of the most popular deep learning frameworks.
Read about the fundamentals of machine learning, deep learning and artificial intelligence.
To get in touch with me, please connect with me on LinkedIn. Make sure to write me a message saying hi!
The content on this website is written for educational purposes. In writing the articles, I have attempted to be as correct and precise as possible. Should you find any errors, please let me know by creating an issue or pull request in this GitHub repository.
All text on this website written by me is copyrighted and may not be used without prior permission. Creating citations using content from this website is allowed if a reference is added, including an URL reference to the referenced article.
If you have any questions or remarks, feel free to get in touch.
TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc.
PyTorch, the PyTorch logo and any related marks are trademarks of The Linux Foundation.
Montserrat and Source Sans are fonts licensed under the SIL Open Font License version 1.1.
Mathjax is licensed under the Apache License, Version 2.0.