Neural Network Notational Symbols
x - input vector
h - hidden layer output vector
y - ground truth vector
p - predicted output probability vector
c - # of output classes
n - # of neurons (in a layer)
l - layer supersript
i - index subscript
s - index for a current layer neuron
t - index for the previous layer neuron
W - weight matrix (of a layer)
b - bias vector
σ - sigmoid (activation) function (applied on a layer). More generally, activation function need not be a sigmoid function.
z - cumulative weighted input going into a layer, on which the activation function is applied.
Backpropagation "Intuition"
For a 4-layer network
i.e., Input + 2 Hidden Layers + Output layer i.e., layers 0, 1, 2, 3.
- dz[3] i.e., differential of loss function w.r.t. to z for layer 3 (output) = (p - y) i.e., delta of predicted output and actual output (ground truth).
- dW[3] i.e., differential of loss function w.r.t to weights (w) for layer 3 (output) = dz[3] . (h[2])T i.e., differential of loss function w.r.t. to z for layer 3 affected by (activated) output of layer 2.
----
- dh[2] i.e, differential of loss function w.r.t (activated) output for layer-2 = (W[3])T . dz[3] i.e., weights going into layer-3 affected (multiplied) by differential of loss function w.r.t. to z for layer 3.
- dz[2] i.e., differential of loss function w.r.t to z for layer-2 = dh[2] ⓧ σ'(z[2]) i.e., differential of loss function w.r.t to layer-2 output multiplied element-wise with differential of layer-2 (activated) output w.r.t to z for layer-2.
- dW[2] i.e., differential of loss function w.r.t to weights for layer 2 = dz[2] . (h[1])T i.e., (activated) output of layer 1 affected (multiplied) by differential of loss function w.r.t z for layer-2.
----
- dh[1] i.e, differential of loss function w.r.t (activated) output for layer-1 = (W[2])T . dz[2] i.e., weights for layer-2 affected (multiplied) by differential of loss function w.r.t. to z for layer 2.
- dz[1] i.e., differential of loss function w.r.t to z for layer-1 = dh[1] ⓧ σ'(z[1]) i.e., differential of loss function w.r.t to layer-1 output multiplied element-wise with differential of layer-1 (activated) output w.r.t to z for layer-1.
- dW[1] i.e., differential of loss function w.r.t to weights for layer-1 = dz[1] . (x)T i.e., input layer affected (multiplied) by differential of loss function w.r.t z for layer-1.
Backpropagation Algo (batch-based)
Firstly, note that the loss L is computed over a batch. In a way, the batch acts as a proxy for the whole dataset. Hence, for a batch size of m, the average loss is:
1m∑xϵBL(N(x),y)
This is the average loss over the m data points of the batch. N(x) is the network output (the vector p). Let's denote a batch of input data points by the matrix B. The backpropagation algorithm for a batch is:
- H0=B
- for l in [1,2,.......,L]:
- Hl=σ(Wl.Hl−1+bl)
- P = normalize(exp(Wo.HL+bo))
- L=−1mYT.log(P)
- DZL+1 = P−Y
- DWL+1 = DZL+1.(HL)T
- for l in [L,L−1,........1]:
- dHl = (Wl+1)T.DZl+1
- DZl = dHl⊗σ′(Zl)
- DWl = DZl.(Hl−1)T
- Dbl = DZl
Note that earlier dzl represented a vector while DZl now represents the matrix consisting of the dzl vectors of all the data points in B (each dzl being a column of DZl). Similarly, dHl, Y, P and Hl are all matrices of the corresponding individual vectors stacked side by side.