In the old days of deep learning, pracitioners ran into many problems - vanishing gradients, exploding gradients, a non-abundance of compute resources, and so forth. In addition, not much was known about the theoretic behavior of neural networks, and by consequence people frequently didn't know why their model worked.
While that is still the case for many models these days, much has improved, but today's article brings a practical look to a previous fix that remains useful, even today. You're going to take a look at greedy layer-wise training of a PyTorch neural network using a practical point of view. Firstly, we'll briefly explore greedy layer-wise training, so that you can get a feeling about what it involves. Then, we continue with a Python example - by building and training a neural network greedily and layer-wise ourselves.
Are you ready? Let's take a look! 😎
In the early days of deep learning, an abundance of resources was not available when training a deep learning model. In addition, deep learning practitioners suffered from the vanishing gradients problem and the exploding gradients problem.
This was an unfortunate combination when one wanted to train a model with increasing depth. What depth would be best? From what depth would we suffer from vanishing and/or exploding gradients? And how can we try to find out without wasting a lot of resources?
Greedy layer-wise training of a neural network is one of the answers that was posed for solving this problem. By adding a hidden layer every time the model finished training, it becomes possible to find what depth is adequate given your training set.
It works really simply. You start with a simple neural network - an input layer, a hidden layer, and an output layer. You train it for a fixed number of epochs - say, 25. Then, after training, you freeze all the layers, except for the last one. In addition, you cut it off the network. At the tail of your cutoff network, you now add a new layer - for example, a densely-connected one. You then re-add the trained final layer, and you end up with a network that is one layer deeper. In addition, because all layers except for the last two are frozen, your progress so far will help you to train the final two better.
The idea behind this strategy is to find an optimum number of layers for training your neural network.
Let's now take a look at how you can implement greedy layer-wise training with PyTorch. Even though the strategy is really old (in 2022, it's 15 years ago that it was proposed!), there are cases when it may be really useful today.
Implementing greedy layer-wise training with PyTorch involves multiple steps:
nn.Module
structure; in other words, your PyTorch model.Let's begin writing some code. Open up a Python supporting IDE, create a file - say, greedy.py
- or a Jupyter Notebook, and add the following imports:
import os
import torch
from torch import nn
from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader
from torchvision import transforms
from collections import OrderedDict
from accelerate import Accelerator
You will use the following dependencies:
os
, which is a Python dependency for Operating System calls. For this reason, you'll need to make sure that you have a recent version of Python installed, too.torch
package. Besides the package itself, you will also import the CIFAR10
dataset (which you will train today's model with) and the DataLoader
, which is used for loading the training data.torchvision
, a sub package that must be installed jointly with PyTorch, you will import transforms
, which is used for transforming the input data into Tensor format, and allows you to perform additional transformations otu of the box.collections
, you import an ordered dictionary - OrderedDict
. You will see that it will play a big role in structuring the layers of your neural network. It is a default Python API, so if you have installed Python, nothing else needs to be installed.Accelerator
- which is the HuggingFace Accelerate package. It can be used to relieve you from all the .to(cuda)
calls, moving your data and your model to your CUDA device if available. It handles everything out of the box! Click the link if you want to understand it in more detail.Samples from the CIFAR-10 dataset, which is what you will use for training today's model.
Now that you know what you will use, it's time to actually define your neural network. Here's the full code, which you'll learn more about after the code segment:
class LayerConfigurableMLP(nn.Module):
'''
Layer-wise configurable Multilayer Perceptron.
'''
def __init__(self, added_layers = 0):
super().__init__()
# Retrieve model configuration
config = get_model_configuration()
shape = config.get("width") * config.get("height") * config.get("channels")
layer_dim = config.get("layer_dim")
num_classes = config.get("num_classes")
# Create layer structure
layers = [
(str(0), nn.Flatten()),
(str(1), nn.Linear(shape, layer_dim)),
(str(2), nn.ReLU())
]
# Create output layers
layers.append((str(3), nn.Linear(layer_dim, num_classes)))
# Initialize the Sequential structure
self.layers = nn.Sequential(OrderedDict(layers))
def forward(self, x):
'''Forward pass'''
return self.layers(x)
def set_structure(self, layers):
self.layers = nn.Sequential(OrderedDict(layers))
Let's break this class apart by its definitions - __init__
, forward
and set_structure
.
__init__
definition. In ours, which is the constructor for the nn.Module
(the base PyTorch class for a neural network), the constructor does the following:layer_dim
, which is the dimensionality of each hidden layer - including the layers that we will add later, during greedy layer-wise training.num_classes
represents the number of output classes. In the case of the CIFAR-10 dataset, that's ten classes.Flatten
layer, which flattens each three-dimensional input Tensor (width, height, channels) into a one-dimensional Tensor (hence the multiplication). This is bad practice in neural networks, because we have convolutional layers for learning features from such image-like data, but for today's model, we simply flatten it - because it's about the greedy layer-wise training rather than convolutions.Flatten
layer, you will add a Linear
layer. This layer has shape
inputs and produces layer_dim
outputs. It is followed by a ReLU activation function for nonlinearity.layer_dim
input dimensionality into num_classes
- after which a Softmax activation can be applied by the loss function.nn.Sequential
layer is built up by an OrderedDict
, created from the layers
. Normally, using such dictionaries is not necessary, but to preserve order when adding layers later, we do need it now.forward
definition - which represents the forward pass, to speak in deep learning language. It simply passes the input Tensor x
through the layers, and returns the result.set_structure
- which you don't see in neural networks often. It simply takes a new layers
structure, creates an OrderedDict
from it, and replaces the layers with the new structure. You will see later how this is used.!First, however, let's create a definition with global settings.
def get_global_configuration():
""" Retrieve configuration of the training process. """
global_config = {
"num_layers_to_add": 10,
}
return global_config
It's pretty simple - the global configuration specifies the number of layers that must be added. For your model, this means that a base model will be trained at first, after which another layer will be added and training will be continued; another; another, and so forth, until 10 such iterations have been performed.
The model configuration is a bit more complex - it specifies all the settings that are necessary for successsfully training your model. In addition, these settings are model specific rather than specific to the training process.
For example, through the width
, height
and channels
, the shape of your image Tensor is represented. Indeed, a CIFAR-10 sample is a 32 x 32 pixels image with 3 channels. The number of classes in the output is 10, and we use a 250-sample batch size when training. We also specify (but not initialize!) the loss function and optimizer. We use CrossEntropyLoss
for computing how poorly the model performs.
This criterion combines
nn.LogSoftmax()
andnn.NLLLoss()
in one single class.PyTorch docs
Using CrossEntropyLoss
is also why we don't use Softmax activation in our layer structure! This PyTorch loss function combines both softmax and NLL loss and hence pushes Softmax computation to the loss function, which is more stable numerically.
For optimization, we use Adam, which is an adaptive optimizer and one of the default optimizers that are used in neural networks these days.
For educational purposes, we set num_epochs
to 1 - to allow you to walk through greedy layer-wise training quickly. However, a better setting would be num_epochs = 5
, or num_epochs = 25
.
Finally, you set the layer_dim
to 256. This is the dimensionality of all hidden layers. Obviously, if you want to have a varying layer dimensionality or a different approach, you can alter layer construction and have it your way - but for today's example, having hidden layers with equal dimensionality is the simplest choice :)
def get_model_configuration():
""" Retrieve configuration for the model. """
model_config = {
"width": 32,
"height": 32,
"channels": 3,
"num_classes": 10,
"batch_size": 250,
"loss_function": nn.CrossEntropyLoss,
"optimizer": torch.optim.Adam,
"num_epochs": 1,
"layer_dim": 256
}
return model_config
Now that you have specified global and model configurations, it's time to retrieve the DataLoader
.
Its functionality is pretty simple - it initializes the CIFAR10
dataset with a simple ToTensor()
transform applied, and inits a DataLoader
which constructs shuffled batches per your batch size configuration.
def get_dataset():
""" Load and convert dataset into inputs and targets """
config = get_model_configuration()
dataset = CIFAR10(os.getcwd(), download=True, transform=transforms.ToTensor())
trainloader = torch.utils.data.DataLoader(dataset, batch_size=config.get("batch_size"), shuffle=True, num_workers=1)
return trainloader
Next up is adding a layer to an existing
model.
Recall that greedy layer-wise training involves training a model for a full amount of epochs, after which a layer is added, while all trained layers (except for the last layer) are set to nontrainable.
This means that you will need functionality which:
Here's the definition which performs precisely that. It first retrieves the current layers, prints them to your terminal, saves the last layer, and defines a new layer structure to which all existing layers (except for the last one) are added. These layers are also made nontrainable by setting requires_grad
to False
.
When these have been added, a brand new hidden layer that respects the layer_dim
configuration is added to your new layer structure. Finally, the last layer is re-added, and the model
structure is changed (indeed, via set_structure
). Now, you hopefully realize too why we're using the OrderedDict
- the keys of this dictionary simply specify the layer order of your new nn.Sequential
structure, allowing the layers to be added properly.
Finally, after restructuring your model, you simply return it for later usage.
def add_layer(model):
""" Add a new layer to a model, setting all others to nontrainable. """
config = get_model_configuration()
# Retrieve current layers
layers = model.layers
print("="*50)
print("Old structure:")
print(layers)
# Save last layer for adding later
last_layer = layers[-1]
# Define new structure
new_structure = []
# Iterate over all except last layer
for layer_index in range(len(layers) - 1):
# For old layer, set all parameters to nontrainable
old_layer = layers[layer_index]
for param in old_layer.parameters():
param.requires_grad = False
# Append old layer to new structure
new_structure.append((str(layer_index), old_layer))
# Append new layer to the final intermediate layer
new_structure.append((str(len(new_structure)), nn.Linear(config.get("layer_dim"), config.get("layer_dim"))))
# Re-add last layer
new_structure.append((str(len(new_structure)), last_layer))
# Change the model structure
model.set_structure(new_structure)
# Return the model
print("="*50)
print("New structure:")
print(model.layers)
return model
The next definitions is a pretty default PyTorch training loop.
Do note that you're using the HuggingFace Accelerate way of optimization: you first prepare the model
, optimizer
and trainloader
with accelerator.prepare(...)
, and then perform the backward pass with accelerator
, too.
In the end, you return the trained model
as well as the loss value at the end of training, so that you can compare it with the loss value of the next set of epochs, with yet another layer added. This allows you to see whether adding layers yields better performance or whether you've reached layer saturation for your training scenario.
def train_model(model):
""" Train a model. """
config = get_model_configuration()
loss_function = config.get("loss_function")()
optimizer = config.get("optimizer")(model.parameters(), lr=1e-4)
trainloader = get_dataset()
accelerator = Accelerator()
# Set current loss value
end_loss = 0.0
# Accelerate model
model, optimizer, trainloader = accelerator.prepare(model, optimizer, trainloader)
# Iterate over the number of epochs
for epoch in range(config.get("num_epochs")):
# Print epoch
print(f'Starting epoch {epoch+1}')
# Set current loss value
current_loss = 0.0
# Iterate over the DataLoader for training data
for i, data in enumerate(trainloader, 0):
# Get inputs
inputs, targets = data
# Zero the gradients
optimizer.zero_grad()
# Perform forward pass
outputs = model(inputs)
# Compute loss
loss = loss_function(outputs, targets)
# Perform backward pass
accelerator.backward(loss)
# Perform optimization
optimizer.step()
# Print statistics
current_loss += loss.item()
if i % 500 == 499:
print('Loss after mini-batch %5d: %.3f' %
(i + 1, current_loss / 500))
end_loss = current_loss / 500
current_loss = 0.0
# Return trained model
return model, end_loss
Finally, it's time to wrap all the definitions together into a working whole.
In the greedy_layerwise_training
def, you load the global config, initialize your MLP, and iterate over the number of layers that must be added, adding one more at each step. Then, for each layer configuration, you train the model and compare loss.
When you run your Python script, you call greedy_layerwise_training()
for training your neural network in a greedy layer-wise fashion.
def greedy_layerwise_training():
""" Perform greedy layer-wise training. """
global_config = get_global_configuration()
torch.manual_seed(42)
# Initialize the model
model = LayerConfigurableMLP()
# Loss comparison
loss_comparable = 0.0
# Iterate over the number of layers to add
for num_layers in range(global_config.get("num_layers_to_add")):
# Print which model is trained
print("="*100)
if num_layers > 0:
print(f">>> TRAINING THE MODEL WITH {num_layers} ADDITIONAL LAYERS:")
else:
print(f">>> TRAINING THE BASE MODEL:")
# Train the model
model, end_loss = train_model(model)
# Compare loss
if num_layers > 0 and end_loss < loss_comparable:
print("="*50)
print(f">>> RESULTS: Adding this layer has improved the model loss from {loss_comparable} to {end_loss}")
loss_comparable = end_loss
elif num_layers > 0:
print("="*50)
print(f">>> RESULTS: Adding this layer did not improve the model loss.")
elif num_layers == 0:
loss_comparable = end_loss
# Add layer to model
model = add_layer(model)
# Process is complete
print("Training process has finished.")
if __name__ == '__main__':
greedy_layerwise_training()
If you want to get started immediately, this is the full code for greedy layer-wise training with PyTorch:
import os
import torch
from torch import nn
from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader
from torchvision import transforms
from collections import OrderedDict
from accelerate import Accelerator
class LayerConfigurableMLP(nn.Module):
'''
Layer-wise configurable Multilayer Perceptron.
'''
def __init__(self, added_layers = 0):
super().__init__()
# Retrieve model configuration
config = get_model_configuration()
shape = config.get("width") * config.get("height") * config.get("channels")
layer_dim = config.get("layer_dim")
num_classes = config.get("num_classes")
# Create layer structure
layers = [
(str(0), nn.Flatten()),
(str(1), nn.Linear(shape, layer_dim)),
(str(2), nn.ReLU())
]
# Create output layers
layers.append((str(3), nn.Linear(layer_dim, num_classes)))
# Initialize the Sequential structure
self.layers = nn.Sequential(OrderedDict(layers))
def forward(self, x):
'''Forward pass'''
return self.layers(x)
def set_structure(self, layers):
self.layers = nn.Sequential(OrderedDict(layers))
def get_global_configuration():
""" Retrieve configuration of the training process. """
global_config = {
"num_layers_to_add": 10,
}
return global_config
def get_model_configuration():
""" Retrieve configuration for the model. """
model_config = {
"width": 32,
"height": 32,
"channels": 3,
"num_classes": 10,
"batch_size": 250,
"loss_function": nn.CrossEntropyLoss,
"optimizer": torch.optim.Adam,
"num_epochs": 1,
"layer_dim": 256
}
return model_config
def get_dataset():
""" Load and convert dataset into inputs and targets """
config = get_model_configuration()
dataset = CIFAR10(os.getcwd(), download=True, transform=transforms.ToTensor())
trainloader = torch.utils.data.DataLoader(dataset, batch_size=config.get("batch_size"), shuffle=True, num_workers=1)
return trainloader
def add_layer(model):
""" Add a new layer to a model, setting all others to nontrainable. """
config = get_model_configuration()
# Retrieve current layers
layers = model.layers
print("="*50)
print("Old structure:")
print(layers)
# Save last layer for adding later
last_layer = layers[-1]
# Define new structure
new_structure = []
# Iterate over all except last layer
for layer_index in range(len(layers) - 1):
# For old layer, set all parameters to nontrainable
old_layer = layers[layer_index]
for param in old_layer.parameters():
param.requires_grad = False
# Append old layer to new structure
new_structure.append((str(layer_index), old_layer))
# Append new layer to the final intermediate layer
new_structure.append((str(len(new_structure)), nn.Linear(config.get("layer_dim"), config.get("layer_dim"))))
# Re-add last layer
new_structure.append((str(len(new_structure)), last_layer))
# Change the model structure
model.set_structure(new_structure)
# Return the model
print("="*50)
print("New structure:")
print(model.layers)
return model
def train_model(model):
""" Train a model. """
config = get_model_configuration()
loss_function = config.get("loss_function")()
optimizer = config.get("optimizer")(model.parameters(), lr=1e-4)
trainloader = get_dataset()
accelerator = Accelerator()
# Set current loss value
end_loss = 0.0
# Accelerate model
model, optimizer, trainloader = accelerator.prepare(model, optimizer, trainloader)
# Iterate over the number of epochs
for epoch in range(config.get("num_epochs")):
# Print epoch
print(f'Starting epoch {epoch+1}')
# Set current loss value
current_loss = 0.0
# Iterate over the DataLoader for training data
for i, data in enumerate(trainloader, 0):
# Get inputs
inputs, targets = data
# Zero the gradients
optimizer.zero_grad()
# Perform forward pass
outputs = model(inputs)
# Compute loss
loss = loss_function(outputs, targets)
# Perform backward pass
accelerator.backward(loss)
# Perform optimization
optimizer.step()
# Print statistics
current_loss += loss.item()
if i % 500 == 499:
print('Loss after mini-batch %5d: %.3f' %
(i + 1, current_loss / 500))
end_loss = current_loss / 500
current_loss = 0.0
# Return trained model
return model, end_loss
def greedy_layerwise_training():
""" Perform greedy layer-wise training. """
global_config = get_global_configuration()
torch.manual_seed(42)
# Initialize the model
model = LayerConfigurableMLP()
# Loss comparison
loss_comparable = 0.0
# Iterate over the number of layers to add
for num_layers in range(global_config.get("num_layers_to_add")):
# Print which model is trained
print("="*100)
if num_layers > 0:
print(f">>> TRAINING THE MODEL WITH {num_layers} ADDITIONAL LAYERS:")
else:
print(f">>> TRAINING THE BASE MODEL:")
# Train the model
model, end_loss = train_model(model)
# Compare loss
if num_layers > 0 and end_loss < loss_comparable:
print("="*50)
print(f">>> RESULTS: Adding this layer has improved the model loss from {loss_comparable} to {end_loss}")
loss_comparable = end_loss
elif num_layers > 0:
print("="*50)
print(f">>> RESULTS: Adding this layer did not improve the model loss.")
elif num_layers == 0:
loss_comparable = end_loss
# Add layer to model
model = add_layer(model)
# Process is complete
print("Training process has finished.")
if __name__ == '__main__':
greedy_layerwise_training()
When you run your script, you should see a base model being trained first (given our settings for 1 epoch or given yours for the number of epochs that you have configured), after which another layer is added and the same process is repeated. Then, loss is compared, and yet another layer is added.
Hopefully, this allows you to get a feeling for empirically finding the number of layers that is likely adequate for your PyTorch neural network! :)
====================================================================================================
>>> TRAINING THE BASE MODEL:
Files already downloaded and verified
Starting epoch 1
==================================================
Old structure:
Sequential(
(0): Flatten(start_dim=1, end_dim=-1)
(1): Linear(in_features=3072, out_features=256, bias=True)
(2): ReLU()
(3): Linear(in_features=256, out_features=10, bias=True)
)
==================================================
New structure:
Sequential(
(0): Flatten(start_dim=1, end_dim=-1)
(1): Linear(in_features=3072, out_features=256, bias=True)
(2): ReLU()
(3): Linear(in_features=256, out_features=256, bias=True)
(4): Linear(in_features=256, out_features=10, bias=True)
)
====================================================================================================
>>> TRAINING THE MODEL WITH 1 ADDITIONAL LAYERS:
Files already downloaded and verified
Starting epoch 1
==================================================
>>> RESULTS: Adding this layer did not improve the model loss.
==================================================
Old structure:
Sequential(
(0): Flatten(start_dim=1, end_dim=-1)
(1): Linear(in_features=3072, out_features=256, bias=True)
(2): ReLU()
(3): Linear(in_features=256, out_features=256, bias=True)
(4): Linear(in_features=256, out_features=10, bias=True)
)
==================================================
New structure:
Sequential(
(0): Flatten(start_dim=1, end_dim=-1)
(1): Linear(in_features=3072, out_features=256, bias=True)
(2): ReLU()
(3): Linear(in_features=256, out_features=256, bias=True)
(4): Linear(in_features=256, out_features=256, bias=True)
(5): Linear(in_features=256, out_features=10, bias=True)
)
..........
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer-wise training of deep networks. In Advances in neural information processing systems (pp. 153-160).
MachineCurve. (2022, January 9). Greedy layer-wise training of deep networks, a TensorFlow/Keras example. https://www.machinecurve.com/index.php/2022/01/09/greedy-layer-wise-training-of-deep-networks-a-tensorflow-keras-example/
Learn how large language models and other foundation models are working and how you can train open source ones yourself.
Keras is a high-level API for TensorFlow. It is one of the most popular deep learning frameworks.
Read about the fundamentals of machine learning, deep learning and artificial intelligence.
To get in touch with me, please connect with me on LinkedIn. Make sure to write me a message saying hi!
The content on this website is written for educational purposes. In writing the articles, I have attempted to be as correct and precise as possible. Should you find any errors, please let me know by creating an issue or pull request in this GitHub repository.
All text on this website written by me is copyrighted and may not be used without prior permission. Creating citations using content from this website is allowed if a reference is added, including an URL reference to the referenced article.
If you have any questions or remarks, feel free to get in touch.
TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc.
PyTorch, the PyTorch logo and any related marks are trademarks of The Linux Foundation.
Montserrat and Source Sans are fonts licensed under the SIL Open Font License version 1.1.
Mathjax is licensed under the Apache License, Version 2.0.