Similarly, we may want to predict more than one variable from the data that we're given. This can easily be accommodated by adding more output units (Fig. 2). The loss function for a network with multiple outputs is obtained simply by adding the loss for each output unit together. The network now has a typical layered structure: a layer of input units (and the bias), connected by a layer of weights to a layer of output units.

(Fig. 1) | (Fig. 2) |

In order to train neural networks such as the ones shown above by gradient
descent, we need to be able to compute the gradient *G* of the loss
function with respect to each weight *w _{ij}* of the network.
It tells us how a small change in that weight will affect the overall
error

(1) |

where *o* ranges over the output units of the network. (Note that
we use the superscript *p* to denote the training point - this is
not an exponentiation!) Since differentiation and summation are
interchangeable, we can likewise split the gradient into separate
components for each training point:

(2) |

In what follows, we describe the computation of the gradient for a single
data point, omitting the superscript *p* in order to make the notation
easier to follow.

First use the **chain rule** to decompose the gradient into two factors:

(3) |

The first factor can be obtained by differentiating Eqn. 1 above:

(4) |

Using , the second factor becomes

(5) |

Putting the pieces (equations 3-5) back together, we obtain

(6) |

To find the gradient *G* for the entire data set, we sum at each
weight the contribution given by equation 6 over all the data points.
We can then subtract a small proportion *µ* (called the
**learning rate**) of *G* from the weights to perform gradient
descent.

The algorithm terminates once we are at, or sufficiently near to, the minimum of the error function, where G = 0. We say then that the algorithm has

- Initialize all weights to small random values.

- REPEAT until done

- For each weight
wset_{ij}^{}- For each data point (
x, t)^{p}

- set input units to
x- compute value of output units
- For each weight
wset_{ij}^{}

- For each weight
wset_{ij}^{}converged.In summary:

general case linear network Training data ( x,t)( x,t)Model parameters wwModel y=g(w,x)Error function E( y,t)Gradient with respect to w_{ij}- (t_{i}- y_{i}) y_{j}Weight update rule

## The Learning Rate

An important consideration is the learning rate

µ, which determines by how much we change the weightswat each step. Ifµis too small, the algorithm will take a long time to converge (Fig. 3).

(Fig. 3) Conversely, if

µis too large, we may end up bouncing around the error surface out of control - the algorithmdiverges(Fig. 4). This usually ends with an overflow error in the computer's floating-point arithmetic.

(Fig. 4)

## Batch vs. Online Learning

Above we have accumulated the gradient contributions for all data points in the training set before updating the weights. This method is often referred to as

batchlearning. An alternative approach isonlinelearning, where the weights are updated immediately after seeing each data point. Since the gradient for a single data point can be considered a noisy approximation to the overall gradientG(Fig. 5), this is also calledstochastic(noisy) gradient descent.

(Fig. 5) Online learning has a number of advantages:

These advantages are, however, bought at a price: many powerful optimization techniques (such as: conjugate and second-order gradient methods, support vector machines, Bayesian methods, etc.) - which we will not talk about in this course! - are batch methods that cannot be used online. (Of course this also means that in order to implement batch learning really well, one has to learn an awful lot about these rather complicated methods!)

- it is often much faster, especially when the training set is
redundant(contains many similar data points),- it can be used when there is no fixed training set (new data keeps coming in),
- it is better at tracking
nonstationaryenvironments (where the best model gradually changes over time),- the noise in the gradient can help to escape from
local minima(which are a problem for gradient descent in nonlinear models).A compromise between batch and online learning is the use of "mini-batches": the weights are updated after every

ndata points, wherenis greater than 1 but smaller than the training set size.In order to keep things simple, we will focus very much on online learning, where plain gradient descent is among the best available techniques. Online learning is also highly suitable for implementing things such as reactive control strategies in adapative agents, and should thus fit in well with the rest of your course.

goto top of page [Next: Multi-layer networks] [Back to the first page]