Thursday, October 24, 2019

Neural Network Notational Symbols


Neural Network Notational Symbols

x  - input vector
h - hidden layer output vector
y - ground truth vector
p - predicted output probability vector

c - # of output classes
n - # of neurons (in a layer)

l - layer supersript
i - index subscript
s - index for a current layer neuron
t - index for the previous layer neuron

W - weight matrix (of a layer)
b - bias vector
σ - sigmoid (activation) function (applied on a layer). More generally, activation function need not be a sigmoid function.
z - cumulative weighted input going into a layer, on which the activation function is applied.

Backpropagation "Intuition"

For a 4-layer network
i.e., Input + 2 Hidden Layers + Output layer i.e., layers 0, 1, 2, 3.
  1. dz[3] i.e., differential of loss function w.r.t. to  z for layer 3 (output) =  (p - y) i.e., delta of predicted output and actual output (ground truth).
  2. dW[3] i.e., differential of loss function w.r.t to weights (w) for layer 3 (output) = dz[3] . (h[2])i.e., differential of loss function w.r.t. to for layer 3 affected by (activated) output of layer 2.
    ----
  3. dh[2] i.e, differential of loss function w.r.t (activated) output for layer-2  = (W[3])T .  dz[3]  i.e., weights going into layer-3 affected (multiplied) by differential of loss function w.r.t. to z for layer 3.
  4. dz[2]  i.e., differential of loss function w.r.t to z for layer-2 =   dh[2] ⓧ  σ'(z[2])  i.e., differential of loss function w.r.t to layer-2 output multiplied element-wise with differential of layer-2 (activated) output w.r.t to z for layer-2.
  5. dW[2] i.e., differential of loss function w.r.t to weights for layer 2  = dz[2] . (h[1])i.e., (activated) output of layer 1 affected (multiplied) by differential of loss function w.r.t for layer-2.
    ----
  6. dh[1] i.e, differential of loss function w.r.t (activated) output for layer-1 = (W[2])T .  dz[2]  i.e., weights for layer-2 affected (multiplied) by differential of loss function w.r.t. to z for layer 2.
  7. dz[1]  i.e., differential of loss function w.r.t to z for layer-1 =   dh[1] ⓧ  σ'(z[1])  i.e., differential of loss function w.r.t to layer-1 output multiplied element-wise with differential of layer-1 (activated) output w.r.t to z for layer-1.
  8. dW[1] i.e., differential of loss function w.r.t to weights for layer-1  = dz[1] . (x)i.e., input layer  affected (multiplied) by differential of loss function w.r.t for layer-1.

Backpropagation Algo (batch-based)

Firstly, note that the loss  is computed over a batch. In a way, the batch acts as a proxy for the whole dataset. Hence, for a batch size of , the average loss is:


This is the average loss over the  data points of the batch.  is the network output (the vector ). Let's denote a batch of input data points by the matrix . The backpropagation algorithm for a batch is:
  1. for  in 
  2.  = (
  3.  =   
  4.   =  .
  5. for  in 
    1.   = 
    2.  = 
    3.   = 
    4.    = 

Note that earlier represented a vector while  now represents the matrix consisting of the vectors of all the data points in  (each being a column of ). Similarly,  and  are all matrices of the corresponding individual vectors stacked side by side.




No comments:

Post a Comment