What is regularization in machine learning

What are regularities and regularization?

Regularization is used in almost all machine learning algorithms where we try to learn from finite samples of training data.

I will try to answer your specific questions indirectly by explaining how the concept of regularization came about. The full theory is much more detailed, and this explanation should not be interpreted as complete, it is merely intended to guide you through further exploration. Since your primary goal is to get a intuitive To gain an understanding of regularization, I have summarized and greatly simplified the following explanation from Chapter 7 of "Neural Networks and Learning Machines", 3rd Edition by Simon Haykin (and omitted some details).

Let us look again at the supervised learning problem with the independent variables and the dependent variable y i in order to find a function f that is able to map the input X to an output Y.xiyif

To pursue this further, let's understand Hadamard's terminology of a "well-posed" problem - a problem is well posed when it meets the following three conditions:

  1. For every input and output y i exists. Xiyi
  2. For a pair of inputs and x 2, f (x 1) = f (x 2) if and only if x 1 = x 2 .x1x2f (x1) = f (x2) x1 = x2
  3. The mapping is continuous (stability criteria) f

These conditions can be violated for supervised learning because:

  1. A specific input may not have its own output.
  2. The training patterns may not contain enough information to create a unique input-output mapping (since running the learning algorithm for different training patterns will result in different mapping functions).
  3. Noise in the data increases the uncertainty of the reconstruction process, which can affect its stability.

To solve such "ill-posed" problems, Tikhonov proposed a regularization method to stabilize the solution by including a non-negative function that embeds prior information about the solution.

The most common form of prior information involves the assumption that the input-output mapping function is smooth - that is, similar inputs produce similar outputs.



Some examples of such regulated cost functions are:

Linear regression:

J (θ) = 1m∑mi = 1 [hθ (xi) −yi] 2 + λ2m∑nj = 1θ2j

Logistic regression:

J (θ) = 1m∑mi = 1 [−yilog (hθ (xi)) - (1 − yi) log (1 − hθ (xi))] + λ2m∑nj = 1θ2j

θxhθ (x) y


The net effect of regularization is to reduce the complexity of the model, which reduces overfitting. Other approaches to regularization (not listed in the examples above) include making changes to structural models like regression / classification trees, reinforced trees, etc. by removing nodes to create simpler trees. More recently, this has been used in so-called "deep learning" by breaking connections between neurons in a neural network.

A specific answer to Q3 is that some ensembling methods like random forest (or similar voting schemes) achieve regularization due to their inherent method, that is, voting and selecting the answer from a collection of non-regularized trees. Even if the individual trees are adjusted too much, the process of "averaging" their result prevents the ensemble from being adjusted too much to the training set.


The concept of regularity belongs to axiomatic set theory. You can refer to clues in this article - de.wikipedia.org/wiki/Axiom_of_regularity and examine this topic in more depth if you're interested in the details.

On neural network regularization, if you adjust the weights while running the backpropagation algorithm, the regularization term is added to the cost function in the same way as in the linear and logistic regression examples. The addition of the regularization term prevents the retransmission from reaching the global minima.

The article describing Batch Normalization for Neural Networks reads: - Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Ioffe, Szegedy, 2015. It is known that backpropagation can be used to train a neural network Network works better when the input variables are normalized. In this article, the authors normalized each mini-batch used in Stochastic Gradient Descent to avoid the "gradients" problem when training many layers of a neural network. The algorithm described in their article treats the mean and variance calculated in each batch for each activation layer as a further set of parameters that was optimized in the mini-batch SGD (in addition to the NN weights). The activations are then normalized using the entire training set. See their article for full details on this algorithm. With this method, they were able to avoid using dropouts for regularization, so they claim that this is a different type of regularization.