Deep Learning Course v2.0
Master in Fundamental Principles of Data Science
Jordi Vitrià, Universitat de Barcelona, 2020
[fast_rewind Previous Page](deep0.html) [Next Page fast_forward](deep2.html)
We are living in an increasingly data-rich world. These data is useful if we can extract from them insights/knowledge that are useful for humans. The first step is to transform data into information, but we will terms in an interchangeble way.
!!! Note Definition Data is an individual unit that contains raw material, corresponding to some observation, which is agnostic of any specific meaning. Information is a group of data that collectively carry a meaning for the observer.
Information is useful for humans only if there are associations, patterns and relationships present in the data. If not, it is only noise.
!!! Note Definition The process of inferring stable relationships from data is called learning. Learning is a critical component of (human) intelligence and also is necessary for adaptation of all living organisms to a changing environment.
Human knowledge is mainly based on observations of repeatable events. It is a result of generalization from many observed instances. This process is called induction.
Some kind of knowledge can be purely data-driven (empirical) or it can be scientific. Data-driven knowledge can predict some kind of events. Scientific knowledge can predict and explain (in terms of a small number of basic concepts) many seemingly unrelated events.
Knowledge is represented by models. Data-driven modeling poses a set of challenging questions:
- How to measure the quality of explanations and predictions?
- Under what conditions is it possible to produce good predictions?
- How to decide what is the most plausible model when several models are equally good at predicting some kind of events?
!!! Note In general Learning from Data is a scientific discipline that is concerned with the design and development of algorithms that allow computers to infer (from data) a model that allows compact representation (unsupervised learning) and/or good generalization (supervised learning).
Learning from data can be approached from two distinct (and sometimes complementary) disciplines: statistical modeling and statistical learning theory.
In classical statistics, regularity in data from an unkwnown system is described via a probabilistic model or statistical distribution that represents the real world. Thus, given observations of data samples, our goal is to estimate this distribution. The main assumption in this case is that the distribution can be used for both explaining the system and making predictions.
In practice, this approach can be too limited because probabilities depend on too many unknown factors or becasue we don't know how the world works. In this case, the alternative approach is to make predictions/decisions based on a model devised to minimize the known risk of past events, resulting in a function that is estimated from noisy samples. This approach does not explain why things happen, but imitates certain properties of the unknown statistical system that are useful for succesful predictions.
!!! Note The risk minimization (machine learning) approach is also the main practical alternative when the number of data samples is reduced by comparison with the number of features of the data. In this scenario the classical approach falls apart.
Deep learning can be used in both approaches and in this course we will see several examples of its application.
This is an important technology because it enables computational systems to adaptively improve their performance with experience accumulated from the observed data.
!!! Warning Learning from Data is not the answer to all questions. Observed data represents the world as it is. If we want to explore alternative worlds we must rely in other disciplines: causality, imagination, etc.
**Learning from Data** is the answer to *"How"* questions. *"Why"* and *"What if"* questions are beyond its scope.
In the next we will interpret learning as function estimation from noisy samples. This estimation is based in past observations, known as training data. The estimated function is used to predict values for new inputs.
The most common learning tasks are supervised learning and regression.
Consider an unknown joint probability distribution
!!! Note Notation
Simple letters such as
Capital letters such as $X$ represent random variables.
Assume we have a training dataset
!!! Note In probability theory and statistics, a collection of random variables is independent and identically distributed (i.i.d.) if each random variable has the same probability distribution as the others and all are mutually independent.
Supervised learning is usually concerned with the two following (inference) problems:
-
Classification: Given a set of
$(\mathbf{x}_i, y_i) \in \mathcal{X}\times\mathcal{Y} = \mathbb{R}^p \times {1, ..., C}$ , for$i=1, ..., N$ , we want to estimate for any new$\mathbf{x}$ ,$\operatorname{argmax}_y P(Y=y|X=\mathbf{x})$ . Informally, we can say that Classification consists in identifying a decision boundary between elements of distinct classes. -
Regression: Given a set of
$(\mathbf{x}_i, y_i) \in \mathcal{X}\times\mathcal{Y} = \mathbb{R}^p \times \mathbb{R}$ , for$i=1, ..., N$ , we want to estimate for any new$\mathbf{x}$ , a sample statistic such as the expectation:$\mathbb{E}\left[ Y|X=\mathbf{x} \right]$ . Informally we can say that regression aims at estimating numerical relationships among several variables.
!!! Note Definition A sample statistic is any quantity computed from values in a sample that is used for a statistical purpose such as representing the sample.
In general, inference is concerned with the conditional estimation
!!! Warning
We are in front of an interpolation problem. In order to produce a good estimation for any new
Consider a function
Let's suppose the predictions of this function can be evaluated through a loss function
These are some examples:
!!! Note Regression Loss Function
The most traditional regression loss function is the squared error loss function:
!!! Note Classification Loss Function
The most simple classification loss function is the zero-one loss function:
As we said before, there are two goals of learning: explanation and prediction. Using the loss function we can measure the quality of these goals.
Given a set of traning data, the empirical risk can be defined as:
It measures how well a model fits or explains training data.
The quality of prediction is more difficult to evaluate, but if test data (data that have not been used to build the model) is available, we can assume that the quality of the prediction is well approximated simply by average test error. This is called prediction risk:
!!! Note Definition
The goal of learning is estimating a function
The process of estimating a predictive model involves two components:
- Input samples
$(\mathbf{x},y)$ , including training and test data, generated according to some unknown conditional probability desity$P(Y|\mathbf{X})$ . - A predictive estimator, implemented by an algorithm that selects the best model from a given set of functions (or admissible models)
$f(\mathbf{x}, \omega), \omega \in \Omega$ , where$\Omega$ is a set of parameters used to index the set of functions. This set of functions must be set a priori.
Let
The capacity of an hypothesis space induced by a learning algorithm intuitively represents the ability to find a good model
!!! Warning
The linear decision function
In practice, capacity can be controlled through hyper-parameters of the learning algorithm. For example:
- The degree of the family of polynomials;
- The number of units and layers in a neural network;
- Etc.
Statistical Learning Theory states that our best option when learning from data is to look for a function
This means that given
Unfortunately, since
However, if we have i.i.d. training data
It can be shown that this estimate is unbiased and can be used for finding a good enough approximation of
Under some mild assumptions, empirical risk minimizers converge:
Most machine learning algorithms, including neural networks, implement empirical risk minimization.
Most of these algorithms are based on the iterative solution of a mathematical problem that involves data and model. If there was an analytical solution to the problem, this should be the adopted one, but this is not the case in general.
So, the most common strategy for learning from data is based on finding a series of parameters of the model that minimizes a mathematical problem. This is called optimization.
The most important technique for solving optimization problems is gradient descend.
!!! Note Homework Reading: How do I evaluate a model?
Reading: [How to measure classification performance?](https://becominghuman.ai/understand-classification-performance-metrics-cad56f2da3aa)
!!! Warning The maximum or minimum over the entire function is called a global maximum or minimum. There is only one global maximum (and one global minimum) but there can be more than one local maximum or minimum. The most common optimization algorithms, such as Gradient Descend, are not devised to find global extrema and can be "attracted" by local extrema. Deep Learning circumvents this problem by using Stochastic Gradient Descend, a noisy version of Gradient Descend.
In the following we will only consider minimization of $f(x)$. Maximization can be implemented by considering the minimization of $-f(x)$.
The most simple thing we can try to minimize a function
This simple algorithm has a severe limitation: it can't get closer to the true minima than the step size we consider to implement "not far from".
The Nelder-Mead method dynamically adjusts the step size based on the loss of the new point. If the new point is better than any previously seen value, it expands the step size to accelerate towards the bottom. Likewise if the new point is worse it contracts the step size to converge around the minima. The usual settings are to half the step size when contracting and double the step size when expanding.
This method can be easily extended into higher dimensional examples, all that's required is taking one more point than there are dimensions. Then, the simplest approach is to replace the worst point with a point reflected through the centroid of the remaining n points. If this point is better than the best current point, then we can try stretching exponentially out along this line. On the other hand, if this new point isn't much better than the previous value, then we are stepping across a valley, so we shrink the step towards a better point.
!!! Note Interactive: Nelder Mead Optimization Method See "An Interactive Tutorial on Numerical Optimization"
Let's suppose that we have a function
Our objective is to find the argument
The derivative of
The derivative specifies how to scale a small change in the input in order to obtain the corresponding change in the output:
The limit as
!!! Note Numerical derivatives
It can be shown that the “centered difference formula" is better when computing numerical derivatives:
$$ \lim_{h \rightarrow 0} \frac{f(x + h) - f(x - h)}{2h} $$
The error in the "finite difference" approximation can be derived from Taylor's theorem and, assuming that
The derivative tells how to chage
- Start from a random
$x$ value. - Compute the derivative
$f'(x) = \lim_{h \rightarrow 0} \frac{f(x + h) - f(x - h)}{2h}$ . - Walk a small step (possibly weighted by the derivative module) in the opposite direction of the derivative, because we know that
$f(x - h \mbox{ sign}(f'(x))$ is less than$f(x)$ for small enough$h$ .
The search for the minima ends when the derivative is zero because we have no more information about which direction to move.
- A minimum (maximum) is a critical point where
$f(x)$ is lower (higher) than at all neighboring points. - There is a third class of critical points: saddle points.
If
There are two problems with numerical derivatives:
- They are approximate.
- They are slower than necessary to evaluate (we need two function evaluations: $f(x + h) , f(x - h)$).
Our knowledge from Calculus could help!
We know that we can get an analytical expression of the derivative for some functions.
For example, let's suppose we have a simple quadratic function,
As a first approach, we can solve this analytically using Calculus, by finding the derivate
\begin{equation} \begin{split} 2x-6 & = & 0 \ 2x & = & 6 \ x & = & 3 \ \end{split} \end{equation}
But you can also find the local minimum by using gradient descend:
- Start from a random
$x$ value. - Compute the derivative
$f'(x)$ analitically. - Walk a small step in the opposite direction of the derivative.
!!! Note Exercise Open this notebook in Colab and solve exercise 1.
Let's consider a
Our objective is to find the argument
The gradient of
The gradient points in the direction of the greatest rate of increase of the function.
Then, we can follow this steps to maximize (or minimize) the function:
- Start from a random
$\mathbf{x}$ vector. - Compute the gradient vector.
- Walk a small step in the opposite direction of the gradient vector.
!!! Note
It is important to be aware that this gradient computation is very expensive: if
Implementation in Python:
def fin_dif_partial_centered(x,
f,
i,
h=1e-6):
'''
This method returns the partial derivative of the i-th component of f at x
by using the centered finite difference method
'''
w1 = [x_j + (h if j==i else 0) for j, x_j in enumerate(x)]
w2 = [x_j - (h if j==i else 0) for j, x_j in enumerate(x)]
return (f(w1) - f(w2))/(2*h)
def gradient_centered(x,
f,
h=1e-6):
'''
This method returns the gradient vector of f at x
by using the centered finite difference method
'''
return[round(fin_dif_partial_centered(x,f,i,h), 10) for i,_ in enumerate(x)]
def f(x):
return sum(x_i**2 for x_i in x)
x = [1.0,1.0,1.0]
gradient_centered(x,f)
>>> 3.000000 [2.0000000001, 2.0000000001, 2.0000000001]The step size, alpha or
There are several policies to follow when selecting the step size:
- Constant size steps. In this case, the size step determines the precision of the solution.
- Decreasing step sizes.
- At each step, select the optimal step.
The last policy is good, but too expensive. In this case we would consider a fixed set of values.
!!! Note Exercise Open this Notebook in Colab.
In general, we have:
- A dataset
$(\mathbf{x},y)$ of$N$ examples. - A target function
$f_\mathbf{w}$ , that we want to minimize, representing the discrepancy between our data and the model we want to fit. The model is indexed by a set of parameters$\mathbf{w}$ . - The gradient of the target function,
$g_f$ .
In the most common case
For example,
-
$\mathbf{x}$ : the behavior of a "Candy Crush" player;$y$ : monthly payments. -
$\mathbf{x}$ : sensor data about your car engine;$y$ : probability of engine error. -
$\mathbf{x}$ : finantial data of a bank customer;$y$ : customer rating.
Let's suppose that our model is a one-dimensional linear model
We can implement gradient descend in the following way (batch gradient descend):
import numpy as np
import random
# f = 2x
x = np.arange(10)
y = np.array([2*i for i in x])
# f_target = 1/n Sum (y - wx)**2
def target_f(x,y,w):
return np.sum((y - x * w)**2.0) / x.size
# gradient_f = 2/n Sum 2wx**2 - 2xy
def gradient_f(x,y,w):
return 2 * np.sum(2*w*(x**2) - 2*x*y) / x.size
def step(w,grad,alpha):
return w - alpha * grad
def BGD(target_f,
gradient_f,
x,
y,
toler = 1e-6,
alpha=0.01):
'''
Batch gradient descend by using a given step
'''
w = random.random()
val = target_f(x,y,w)
i = 0
while True:
i += 1
gradient = gradient_f(x,y,w)
next_w = step(w, gradient, alpha)
next_val = target_f(x,y,next_w)
if (abs(val - next_val) < toler):
return w
else:
w, val = next_w, next_val
BGD(target_f, gradient_f, x, y)
>>> 2.000093We can also implement the multi-step version:
def BGD_multi_step(target_f,
gradient_f,
x,
y,
toler = 1e-6):
'''
Batch gradient descend by using a multi-step approach
'''
alphas = [100, 10, 1, 0.1, 0.001, 0.00001]
w = random.random()
val = target_f(x,y,w)
i = 0
while True:
i += 1
gradient = gradient_f(x,y,w)
next_ws = [step(w, gradient, alpha) for alpha in alphas]
next_vals = [target_f(x,y,w) for w in next_ws]
min_val = min(next_vals)
next_w = next_ws[next_vals.index(min_val)]
next_val = target_f(x,y,next_w)
if (abs(val - next_val) < toler):
return w
else:
w, val = next_w, next_val
BGD_multi_step(target_f, gradient_f, x, y)
>>> 1.999616It is called (batch) gradient descend because at every step, when computing
next_val = target_f(x,y,next_w)we are using the whole dataset!
The last function evals the whole dataset
When learning from data, the cost function is additive: it is computed by adding sample reconstruction errors. In this case, it can be shown that we can compute a good estimate of the gradient (and move towards the minimum) by using only one data sample (or a small data sample).
Thus, we will find the minimum by iterating this gradient estimation over the dataset.
If we apply this method we have some theoretical guarantees to find a good minimum:
- SGD essentially uses the inaccurate gradient per iteration. Since there is no free food, what is the cost by using approximate gradient? The answer is that the convergence rate is slower than the gradient descent algorithm.
- The convergence of SGD has been analyzed using the theories of convex minimization and of stochastic approximation: it converges almost surely to a global minimum when the objective function is convex or pseudoconvex, and otherwise converges almost surely to a local minimum.
!!! Note Definition A full iteration over the dataset is called epoch.
!!! Warning In order to fullfill the guarantees of SGD, at every epoch, data must be processed in a random order.
In general, batch gradient descent looks something like this:
nb_epochs = 100
for i in range(nb_epochs):
grad = evaluate_gradient(target_f, data, w)
w = w - learning_rate * gradFor a pre-defined number of epochs, we first compute the gradient vector of the target function for the whole dataset w.r.t. our parameter vector.
Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example and label:
nb_epochs = 100
for i in range(nb_epochs):
np.random.shuffle(data)
for sample in data:
grad = evaluate_gradient(target_f, sample, w)
w = w - learning_rate * gradMini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of
nb_epochs = 100
for i in range(nb_epochs):
np.random.shuffle(data)
for batch in get_batches(data, batch_size=50):
grad = evaluate_gradient(target_f, batch, w)
w = w - learning_rate * gradMinibatch SGD has the advantage that it works with a slightly less noisy estimate of the gradient. However, as the minibatch size increases, the number of updates done per computation done decreases (eventually it becomes very inefficient, like batch gradient descent).
There is an optimal trade-off (in terms of computational efficiency) that may vary depending on the data distribution and the particulars of the class of function considered, as well as how computations are implemented.
Loss functions
In classification this function is often the zero-one loss, that is,
In regression problems, the most common loss function is the square loss function:
The square loss function can be re-written and utilized for classification:
The hinge loss function is defined as:
The hinge loss provides a relatively tight, convex upper bound on the 0–1 Loss.
This function displays a similar convergence rate to the hinge loss function, and since it is continuous, simple gradient descent methods can be utilized.
Cross-Entropy is a loss function that is very used for training multiclass problems. We'll focus on models that assume that classes are mutually exclusive.
In this case, our labels have this form
C.Shannon showed that if you want to send a series of messages composed of symbols from an alphabet with distribution
The optimal number of bits is known as entropy:
Cross entropy is the number of bits we'll need if we encode symbols by using a wrong distribution
In our case, the real distribution is
Cross entropy is used in combination with Softmax classifier. In order to classify
where
SGD has trouble navigating ravines, i.e. areas where the surface curves much more steeply in one dimension than in another, which are common around local optima. In these scenarios, SGD oscillates across the slopes of the ravine while only making hesitant progress along the bottom towards the local optimum.
Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations. It does this by adding a fraction of the update vector of the past time step to the current update vector:
The momentum
However, a ball that rolls down a hill, blindly following the slope, is highly unsatisfactory. We'd like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again.
Nesterov accelerated gradient (NAG) is a way to give our momentum term this kind of prescience. We know that we will use our momentum term
All previous approaches manipulated the learning rate globally and equally for all parameters. Tuning the learning rates is an expensive process, so much work has gone into devising methods that can adaptively tune the learning rates, and even do so per parameter.
Adagrad is an algorithm for gradient-based optimization that does just this: It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters.
RMSProp update adjusts the Adagrad method in a very simple way in an attempt to reduce its aggressive, monotonically decreasing learning rate. In particular, it uses a moving average of squared gradients instead, giving:
where
(Image credit: Alec Radford)
(Image credit: Alec Radford)
Vapnik, V. (1992). Principles of risk minimization for learning theory. In Advances in neural information processing systems (pp. 831-838).
[fast_rewind Previous Page](deep0.html) [Next Page fast_forward](deep2.html)



