Neural Network Notational Symbols
x - input vectorh - hidden layer output vector
y - ground truth vector
p - predicted output probability vector
n - # of neurons (in a layer)
l - layer supersript
i - index subscript
s - index for a current layer neuron
t - index for the previous layer neuron
W - weight matrix (of a layer)
b - bias vector
σ - sigmoid (activation) function (applied on a layer). More generally, activation function need not be a sigmoid function.
σ - sigmoid (activation) function (applied on a layer). More generally, activation function need not be a sigmoid function.
z - cumulative weighted input going into a layer, on which the activation function is applied.
Backpropagation "Intuition"
For a 4-layer network
i.e., Input + 2 Hidden Layers + Output layer i.e., layers 0, 1, 2, 3.
i.e., Input + 2 Hidden Layers + Output layer i.e., layers 0, 1, 2, 3.
- dz[3] i.e., differential of loss function w.r.t. to z for layer 3 (output) = (p - y) i.e., delta of predicted output and actual output (ground truth).
- dW[3] i.e., differential of loss function w.r.t to weights (w) for layer 3 (output) = dz[3] . (h[2])T i.e., differential of loss function w.r.t. to z for layer 3 affected by (activated) output of layer 2.
---- - dh[2] i.e, differential of loss function w.r.t (activated) output for layer-2 = (W[3])T . dz[3] i.e., weights going into layer-3 affected (multiplied) by differential of loss function w.r.t. to z for layer 3.
- dz[2] i.e., differential of loss function w.r.t to z for layer-2 = dh[2] ⓧ σ'(z[2]) i.e., differential of loss function w.r.t to layer-2 output multiplied element-wise with differential of layer-2 (activated) output w.r.t to z for layer-2.
- dW[2] i.e., differential of loss function w.r.t to weights for layer 2 = dz[2] . (h[1])T i.e., (activated) output of layer 1 affected (multiplied) by differential of loss function w.r.t z for layer-2.
---- - dh[1] i.e, differential of loss function w.r.t (activated) output for layer-1 = (W[2])T . dz[2] i.e., weights for layer-2 affected (multiplied) by differential of loss function w.r.t. to z for layer 2.
- dz[1] i.e., differential of loss function w.r.t to z for layer-1 = dh[1] ⓧ σ'(z[1]) i.e., differential of loss function w.r.t to layer-1 output multiplied element-wise with differential of layer-1 (activated) output w.r.t to z for layer-1.
- dW[1] i.e., differential of loss function w.r.t to weights for layer-1 = dz[1] . (x)T i.e., input layer affected (multiplied) by differential of loss function w.r.t z for layer-1.
Backpropagation Algo (batch-based)
Firstly, note that the loss is computed over a batch. In a way, the batch acts as a proxy for the whole dataset. Hence, for a batch size of , the average loss is:
This is the average loss over the data points of the batch. is the network output (the vector ). Let's denote a batch of input data points by the matrix . The backpropagation algorithm for a batch is:
- for in
- = (
- =
- = .
- for in
- =
- =
- =
- =
Note that earlier represented a vector while now represents the matrix consisting of the vectors of all the data points in (each being a column of ). Similarly, , , and are all matrices of the corresponding individual vectors stacked side by side.
No comments:
Post a Comment