When you're training a neural network, you're learning a mapping from some input value to a corresponding expected output value. This is great, because it allows you to create predictive models, but who guarantees that the mapping is correct for the data points that aren't part of your data set?
That is, how do you ensure that your learnt mapping does not oscillate very heavily if you want a smooth function instead?
Regularization can help here. With techniques that take into account the complexity of your weights during optimization, you may steer the networks towards a more general, but scalable mapping, instead of a very data-specific one.
In this blog, we cover these aspects. First, we'll discuss the need for regularization during model training. We then continue by showing how regularizers can be added to the loss value, and subsequently used in optimization. This is followed by a discussion on the three most widely used regularizers, being L1 regularization (or Lasso), L2 regularization (or Ridge) and L1+L2 regularization (Elastic Net). Finally, we provide a set of questions that may help you decide which regularizer to use in your machine learning project.
Are you ready? Let's go! 😎
When you are training a machine learning model, at a high level, you're learning a function \(\hat{y}: f(x) \) which transforms some input value \(x\) (often a vector, so \(\textbf{x}\)) into some output value \(\hat{y}\) (often a scalar value, such as a class when classifying and a real number when regressing). \
Contrary to a regular mathematical function, the exact mapping (to \(y\)) is not known in advance, but is learnt based on the input-output mappings present in your training data (so that \(\hat{y} \approx y\) - hence the name, machine learning :)
This understanding brings us to the need for regularization.
Say that you've got a dataset that contains points in a 2D space, like this small one:
Now suppose that these numbers are reported by some bank, which loans out money (the values on the x axis in $ of dollars). This has an impact on the weekly cash flow within a bank, attributed to the loan and other factors (together represented by the y values).
The bank suspects that this interrelationship means that it can predict its cash flow based on the amount of money it spends on new loans. In practice, this relationship is likely much more complex, but that's not the point of this thought exercise.
Machine learning is used to generate a predictive model - a regression model, to be precise, which takes some input (amount of money loaned) and returns a real-valued number (the expected impact on the cash flow of the bank).
After training, the model is brought to production, but soon enough the bank employees find out that it doesn't work. Upon analysis, the bank employees find that the actual function learnt by the machine learning model is this one:
The employees instantly know why their model does not work, using nothing more than common sense:
The function is way too extreme for the data. It's nonsense that if the bank would have spent $2.5k on loans, returns would be $5k, and $4.75k for $3.5k spendings, but minus $5k and counting for spendings of $3.25k.
They'd rather have wanted something like this:
Which, as you can see, makes a lot more sense:
But how do we get there?
The two functions are generated based on the same data points, aren't they?
For me, it was simple, because I used a polyfit
on the data points, to generate either a polynomial function of the third degree or one of the tenth degree. Obviously, the one of the tenth produces the wildly oscillating function.
Machine learning however does not work this way. Besides not even having the certainty that your ML model will learn the mapping correctly, you also don't know if it will learn a highly specialized mapping or a more generic one.
Or can you? Let's explore a possible route.
From our article about loss and loss functions, you may recall that a supervised model is trained following the high-level supervised machine learning process:
This means that optimizing a model equals minimizing the loss function that was specified for it.
You can imagine that if you train the model for too long, minimizing the loss function is done based on loss values that are entirely adapted to the dataset it is training on, generating the highly oscillating curve plot that we've seen before.
This is not what you want. However, you also don't know exactly the point where you should stop.
This is why you may wish to add a regularizer to your neural network.
Regularizers, which are attached to your loss value often, induce a penalty on large weights or weights that do not contribute to learning. This way, we may get sparser models and weights that are not too adapted to the data at hand. If done well, adding a regularizer should result in models that produce better results for data they haven't seen before.
Let's take a look at some foundations of regularization, before we continue to the actual regularizers.
Before we do so, however, we must first deepen our understanding of the concept of regularization in conceptual and mathematical terms.
Say, for example, that you are training a machine learning model, which is essentially a function \(\hat{y}: f(\textbf{x})\) which maps some input vector \(\textbf{x}\) to some output \(\hat{y}\).
From previously, we know that during training, there exists a true target \(y\) to which \(\hat{y}\) can be compared.
Say that some function \(L\) computes the loss between \(y\) and \(\hat{y}\) (or \(f(\textbf{x})\)). For one sample \(\textbf{x}_i\) with corresponding target \(y_i\), loss can then be computed as \(L(\hat{y}_i, y_i) = L(f(\textbf{x}_i), y_i)\).
Total loss can be computed by summing over all the input samples \(\textbf{x}_i ... \textbf{x}_n\) in your training set, and subsequently performing a minimization operation on this value:
\(\min_f \sum_{i=1}^{n} L(f(\textbf{x}_i), y_i) \)
Before, we wrote about regularizers that they "are attached to your loss value often". Indeed, adding some regularizer \(R(f)\) - "regularization for some function \(f\)" - is easy:
\( L(f(\textbf{x}_i), y_i) = \sum_{i=1}^{n} L_{ losscomponent}(f(\textbf{x}_i), y_i) + \lambda R(f) \)
...where \(\lambda\) is a hyperparameter, to be configured by the machine learning engineer, that determines the relative importance of the regularization component compared to the loss component.
The above means that the loss and the regularization components are minimized, not the loss component alone. Let's take a look at some scenarios:
Now, you likely understand that you'll want to have your outputs for \(R(f)\) to minimize as well. But what is this function? What does it look like? It turns out to be that there is a wide range of possible instantiations for the regularizer.
In the machine learning community, three regularizers are very common:
Next, we'll cover the three of them.
L1 Regularization (or Lasso) adds to so-called L1 Norm to the loss value. A "norm" tells you something about a vector in space and can be used to express useful properties of this vector (Wikipedia, 2004).
The L1 norm of a vector, which is also called the taxicab norm, computes the absolute value of each vector dimension, and adds them together (Wikipedia, 2004). As computing the norm effectively means that you'll travel the full distance from the starting to the ending point for each dimension, adding it to the distance traveled already, the travel pattern resembles that of a taxicab driver which has to drive the blocks of e.g. New York City; hence the name (Wikipedia, 2004).
In terms of maths, this can be expressed as \( R(f) = \sum_f{ _{i=1}^{n}} | w_i |\), where this is an iteration over the \(n\) dimensions of some vector \(\textbf{w}\).
Visually, and hence intuitively, the process goes as follows. Suppose that we have this two-dimensional vector \([2, 4]\):
...our formula would then produce a computation over two dimensions, for the first:
Then the second:
The L1 norm for our vector is thus 6, as you can see:
\( \sum_{i=1}^{n} | w_i | = | 4 | + | 2 | = 4 + 2 = 6\)
Adding L1 Regularization to our loss value thus produces the following formula:
\( L(f(\textbf{x}_i), y_i) = \sum_{i=1}^{n} L_{ losscomponent}(f(\textbf{x}_i), y_i) + \lambda \sum_{i=1}^{n} | w_i | \)
...where \(w_i\) are the values of your model's weights.
This way, our loss function - and hence our optimization problem - now also includes information about the complexity of our weights.
Say we had a negative vector instead, e.g. \([-1, -2.5]\):
As you can derive from the formula above, L1 Regularization takes some value related to the weights, and adds it to the same values for the other weights. As you know, "some value" is the absolute value of the weight or \(| w_i |\), and we take it for a reason:
Taking the absolute value ensures that negative values contribute to the regularization loss component as well, as the sign is removed and only the, well, absolute value remains. This way, L1 Regularization natively supports negative vectors as well, such as the one above.
Next up: model sparsity. L1 Regularization produces sparse models, i.e. models where unnecessary features don't contribute to their predictive power, which - as an additional benefit - may also speed up models during inference (Google Developers, n.d.).
But why is this the case? Let's take a closer look (Caspersen, n.d.; Neil G., n.d.).
This is the derivative for L1 Regularization:
It's either -1 or +1, and is undefined at \(x = 0\).
Now suppose that we have trained a neural network for the first time. We have a loss value which we can use to compute the weight change. Obviously, this weight change will be computed with respect to the loss component, but this time, the regularization component (in our case, L1 loss) would also play a role.
If our loss component were static for some reason (just a thought experiment), our obvious goal would be to bring the regularization component to zero. As you can see, this would be done in small but constant steps, eventually allowing the value to reach minimum regularization loss, at \(x = 0\). This would essentially "drop" a weight from participating in the prediction, as it's set at zero. This is also known as the "model sparsity" principle of L1 loss.
This theoretical scenario is however not necessarily true in real life. Besides the regularization loss component, the normal loss component participates as well in generating the loss value, and subsequently in gradient computation for optimization. This means that the theoretically constant steps in one direction, i.e. sparse models, are less "straight" in practice. Nevertheless, since the regularization loss component still plays a significant role in computing loss and hence optimization, L1 loss will still tend to push weights to zero and hence produce sparse models (Caspersen, n.d.; Neil G., n.d.).
Unfortunately, besides the benefits that can be gained from using L1 regularization, the technique also comes at a cost:
Therefore, always make sure to decide whether you need L1 regularization based on your dataset, before blindly applying it.
Another type of regularization is L2 Regularization, also called Ridge, which utilizes the L2 norm of the vector:
\( R(f) = \sum_f{ _{i=1}^{n}} w_i^2\)
When added to the regularization equation, you get this:
\( L(f(\textbf{x}_i), y_i) = \sum_{i=1}^{n} L_{ losscomponent}(f(\textbf{x}_i), y_i) + \lambda \sum_{i=1}^{n} w_i^2 \)
Visually, it looks as follows:
As you can see, L2 regularization also stimulates your values to approach zero (as the loss for the regularization component is zero when \(x = 0\)), and hence stimulates them towards being very small values.
However, unlike L1 regularization, it does not push the values to be exactly zero.
Let's recall the gradient for L1 regularization:
Regardless of the value of \(x\), the gradient is a constant - either plus or minus one.
This is also true for very small values, and hence, the expected weight update suggested by the regularization component is quite static over time. This, combined with the fact that the normal loss component will ensure some oscillation, stimulates the weights to take zero values whenever they do not contribute significantly enough.
However, the situation is different for L2 loss, where the derivative is \(2x\):
From this plot, you can see that the closer the weight value gets to zero, the smaller the gradient will become.
And the smaller the gradient value, the smaller the weight update suggested by the regularization component.
Much like how you'll never reach zero when you keep dividing 1 by 2, then 0.5 by 2, then 0.25 by 2, and so on, you won't reach zero in this case as well. This is due to the nature of L2 regularization, and especially the way its gradient works. Thus, while L2 regularization will nevertheless produce very small values for non-important values, the models will not be stimulated to be sparse. This is a very important difference between L1 and L2 regularization.
Primarily due to the L1 drawback that situations where high-dimensional data where many features are correlated will lead to ill-performing models, because relevant information is removed from your models (Tripathi, n.d.).
Unfortunately, L2 regularization also comes with a disadvantage due to the nature of the regularizer (Gupta, 2017). It is model interpretability: due to the fact that L2 regularization does not promote sparsity, you may end up with an uninterpretable model if your dataset is high-dimensional.
This may not always be unavoidable (e.g. in the case where you have a correlative dataset), but once again, take a look at your data first before you choose whether to use L1 or L2 regularization.
Now that we have identified how L1 and L2 regularization work, we know the following:
But what if we can combine them?
Say hello to Elastic Net Regularization (Zou & Hastie, 2005). It's a linear combination of L1 and L2 regularization, and produces a regularizer that has both the benefits of the L1 (Lasso) and L2 (Ridge) regularizers. Let's take a look at how it works - by taking a look at a naïve version of the Elastic Net first, the Naïve Elastic Net.
In their work "Regularization and variable selection via the elastic net", Zou & Hastie (2005) introduce the Naïve Elastic Net as a linear combination between L1 and L2 regularization. With hyperparameters \(\lambda_1 = (1 - \alpha) \) and \(\lambda_2 = \alpha\), the elastic net penalty (or regularization loss component) is defined as:
\((1 - \alpha) | \textbf{w} |_1 + \alpha | \textbf{w} |^2 \)
Here, the first part is the L1 penalty \( \sum_{i=1}^{n} | w_i | \), while the second part is the L2 penalty \( \sum_f{ _{i=1}^{n}} w_i^2 \). The hyperparameter to be tuned in the Naïve Elastic Net is the value for \(\alpha\) where, \(\alpha \in [0, 1]\).
With Elastic Net Regularization, the total value that is to be minimized thus becomes:
\( L(f(\textbf{x}_i), y_i) = \sum_{i=1}^{n} L_{ losscomponent}(f(\textbf{x}_i), y_i) + (1 - \alpha) \sum_{i=1}^{n} | w_i | + \alpha \sum_{i=1}^{n} w_i^2 \)
As you can see, for \(\alpha = 1\), Elastic Net performs Ridge (L2) regularization, while for \(\alpha = 0\) Lasso (L1) regularization is performed. Tuning the alpha parameter allows you to balance between the two regularizers, possibly based on prior knowledge about your dataset. Visually, we can see this here:
Do note that frameworks often allow you to specify \(\lambda_1\) and \(\lambda_2\) manually. The penalty term then equals:
\(\lambda_1| \textbf{w} |_1 + \lambda_2| \textbf{w} |^2 \)
The Elastic Net works well in many cases, especially when the final outcome is close to either L1 or L2 regularization only (i.e., \(\alpha \approx 0\) or \(\alpha \approx 1\)), but performs less adequately when the hyperparameter tuning is different. That's why the authors call it naïve (Zou & Hastie, 2005). The cause for this is "double shrinkage", i.e., the fact that both L2 (first) and L1 (second) regularization tend to make the weights as small as possible. As this may introduce unwanted side effects, performance can get lower. Fortunately, the authors also provide a fix, which resolves this problem. I'd like to point you to the Zou & Hastie (2005) paper for the discussion about correcting it.
If you want to add a regularizer to your model, it may be difficult to decide which one you'll need. Generally speaking, it's wise to start with Elastic Net Regularization, because it combines L1 and L2 and generally performs better because it cancels the disadvantages of the individual regularizers (StackExchange, n.d.). However, you may wish to make a more informed choice - in that case, read on :)
Fortunately, there are three questions that you can ask yourself which help you decide where to start. We'll cover these questions in more detail next, but here they are:
The first thing that you'll have to inspect is the following: the amount of prior knowledge that you have about your dataset.
Knowing some crucial details about the data may guide you towards a correct choice, which can be L1, L2 or Elastic Net regularization, no regularizer at all, or a regularizer that we didn't cover here.
For example, it may be the case that your model does not improve significantly when applying regularization - due to sparsity already introduced to the data, as well as good normalization up front (StackExchange, n.d.). In those cases, you may wish to avoid regularization altogether.
If you don't, you'll have to estimate the sparsity and pairwise correlation of and within the dataset (StackExchange). For this purpose, you may benefit from these references:
Depending on your analysis, you might have enough information to choose a regularizer. If your dataset turns out to be very sparse already, L2 regularization may be your best choice. The same is true if the dataset has a large amount of pairwise correlations. If it doesn't, and is dense, you may choose L1 regularization instead. If you don't know for sure, or when your metrics don't favor one approach, Elastic Net may be the best choice for now.
However, before actually starting the training process with a large dataset, you might wish to validate first. You could do the same if you're still unsure.
If you have some resources to spare, you may also perform some validation activities first, before you start a large-scale training process. These validation activities especially boil down to the following two aspects:
Firstly, and obviously, if you choose to validate, it's important to validate the method you want to use. If, when using a representative dataset, you find that some regularizer doesn't work, the odds are that it will neither for a larger dataset.
Secondly, when you find a method about which you're confident, it's time to estimate the impact of the hyperparameter. The hyperparameter, which is \(\lambda\) in the case of L1 and L2 regularization and \(\alpha \in [0, 1]\) in the case of Elastic Net regularization (or \(\lambda_1\) and \(\lambda_2\) separately), effectively determines the impact of the regularizer on the loss value that is optimized during training. The stronger you regularize, the sparser your model will get (with L1 and Elastic Net), but this comes at the cost of underperforming when it is too large (Yadav, 2018).
Thirdly, and finally, you may wish to inform yourself of the computational requirements of your machine learning problem.
Often, and especially with today's movement towards commoditization of hardware, this is not a problem, but Elastic Net regularization is more expensive than Lasso or Ridge regularization applied alone (StackExchange, n.d.). Hence, if your machine learning problem already balances at the edge of what your hardware supports, it may be a good idea to perform additional validation work and/or to try and identify additional knowledge about your dataset, in order to make an informed choice between L1 and L2 regularization.
Now that you have answered these three questions, it's likely that you have a good understanding of what the regularizers do - and when to apply which one. With this understanding, we conclude today's blog :)
In this article, you've found a discussion about a couple of things:
If you have any questions or remarks - feel free to leave a comment 😊 I will happily answer those questions and will improve my blog if you found mistakes. Thank you for reading MachineCurve today and happy engineering! 😎
Wikipedia. (2004, September 16). Norm (mathematics). Retrieved from https://en.wikipedia.org/wiki/Norm_(mathematics)
Chioka. (n.d.). Differences between L1 and L2 as Loss Function and Regularization. Retrieved from http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/
Google Developers. (n.d.). Regularization for Sparsity: L1 Regularization. Retrieved from https://developers.google.com/machine-learning/crash-course/regularization-for-sparsity/l1-regularization
Neil G. (n.d.). Why L1 regularization can "zero out the weights" and therefore leads to sparse models? Retrieved from https://stats.stackexchange.com/questions/375374/why-l1-regularization-can-zero-out-the-weights-and-therefore-leads-to-sparse-m
Wikipedia. (2011, December 11). Elastic net regularization. Retrieved from https://en.wikipedia.org/wiki/Elastic_net_regularization
Khandelwal, R. (2019, January 10). L1 L2 Regularization. Retrieved from https://medium.com/datadriveninvestor/l1-l2-regularization-7f1b4fe948f2
Caspersen, K. M. (n.d.). Why L1 norm for sparse models. Retrieved from https://stats.stackexchange.com/questions/45643/why-l1-norm-for-sparse-models/159379
Kochede. (n.d.). What are disadvantages of using the lasso for variable selection for regression? Retrieved from https://stats.stackexchange.com/questions/7935/what-are-disadvantages-of-using-the-lasso-for-variable-selection-for-regression
cbeleites(https://stats.stackexchange.com/users/4598/cbeleites-supports-monica), What are disadvantages of using the lasso for variable selection for regression?, URL (version: 2013-12-03): https://stats.stackexchange.com/q/77975
Tripathi, M. (n.d.). Are there any disadvantages or weaknesses to the L1 (LASSO) regularization technique? Retrieved from https://www.quora.com/Are-there-any-disadvantages-or-weaknesses-to-the-L1-LASSO-regularization-technique/answer/Manish-Tripathi
Duke University. (n.d.). Sparsity and p >> n - Duke Statistical Science [PDF]. Retrieved from http://www2.stat.duke.edu/~banks/218-lectures.dir/dmlect9.pdf
Gupta, P. (2017, November 16). Regularization in Machine Learning. Retrieved from https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2), 301-320.
StackExchange. (n.d.). What is elastic net regularization, and how does it solve the drawbacks of Ridge ($L^2$) and Lasso ($L^1$)? Retrieved from https://stats.stackexchange.com/questions/184029/what-is-elastic-net-regularization-and-how-does-it-solve-the-drawbacks-of-ridge
Yadav, S. (2018, December 25). All you need to know about Regularization. Retrieved from https://towardsdatascience.com/all-you-need-to-know-about-regularization-b04fc4300369
Learn how large language models and other foundation models are working and how you can train open source ones yourself.
Keras is a high-level API for TensorFlow. It is one of the most popular deep learning frameworks.
Read about the fundamentals of machine learning, deep learning and artificial intelligence.
To get in touch with me, please connect with me on LinkedIn. Make sure to write me a message saying hi!
The content on this website is written for educational purposes. In writing the articles, I have attempted to be as correct and precise as possible. Should you find any errors, please let me know by creating an issue or pull request in this GitHub repository.
All text on this website written by me is copyrighted and may not be used without prior permission. Creating citations using content from this website is allowed if a reference is added, including an URL reference to the referenced article.
If you have any questions or remarks, feel free to get in touch.
TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc.
PyTorch, the PyTorch logo and any related marks are trademarks of The Linux Foundation.
Montserrat and Source Sans are fonts licensed under the SIL Open Font License version 1.1.
Mathjax is licensed under the Apache License, Version 2.0.