## Statistical learning theory

I want to review some key concepts from statistical learning theory. My knowledge of it is very many influenced by the regularization viewpoint (due to the MIT class with Poggio).

Let $S=(x_1,y_1), ..., (x_n,y_n)$ be training set of labeled examples. We want to learn a function $f_S: X \rightarrow Y$ to predict unseen data. A learning algorithm is a map $A: S \rightarrow f_S$. We are also given a real-valued loss function $V(f_S(x),y)$.

The empirical loss of $f_S$ on $S$ is $I_S[f_S] = \frac{1}{n}\sum V(f_S(x_i),y_i).$ This is the training loss that we observe. What we really care about is the true loss $I[f_S] = E_{(x,y)}[V(f_S(x),y)]$. We don’t observe $I[f_S]$, but if we have a generalization error $D[f_S]=I[f_S]-I_S[f_S]$ then we have an idea of how good the predictor is. The goal of regularization and the following analysis is to bound the generalization error.

One way to achieve good generalization performance is if the learning algorithm does not change much if the training set $S$ changes a bit. An extreme case is if $f_S$ does not depend on $S$ at all. In which case $I[f_S]$ approaches $I_S[f_S]$ per Central Limit Theorem. But this algorithm does not do any “learning” and would have poor empirical error. An algorithm that only seeks to minimize empirical error could be very fragile and can change significantly if the training samples change. Regularization is one way to achieve this balance.