Data Science (Incl AI/ML): Neural Network Notational Symbols

Neural Network Notational Symbols

x - input vector
h - hidden layer output vector
y - ground truth vector
p - predicted output probability vector

c - # of output classes
n - # of neurons (in a layer)

l - layer supersript

i - index subscript

s - index for a current layer neuron

t - index for the previous layer neuron

W - weight matrix (of a layer)

b - bias vector
σ - sigmoid (activation) function (applied on a layer). More generally, activation function need not be a sigmoid function.

z - cumulative weighted input going into a layer, on which the activation function is applied.

Backpropagation "Intuition"

For a 4-layer network
i.e., Input + 2 Hidden Layers + Output layer i.e., layers 0, 1, 2, 3.

dz^[3] i.e., differential of loss function w.r.t. to z for layer 3 (output) = (p - y) i.e., delta of predicted output and actual output (ground truth).
dW^[3] i.e., differential of loss function w.r.t to weights (w) for layer 3 (output) = dz^[3] . (h^[2])^Ti.e., differential of loss function w.r.t. to z for layer 3 affected by (activated) output of layer 2.
----
dh^[2] i.e, differential of loss function w.r.t (activated) output for layer-2 = (W^[3])^T . dz^[3] i.e., weights going into layer-3 affected (multiplied) by differential of loss function w.r.t. to z for layer 3.
dz^[2] i.e., differential of loss function w.r.t to z for layer-2 = dh^[2] ⓧ σ'(z^[2]) i.e., differential of loss function w.r.t to layer-2 output multiplied element-wise with differential of layer-2 (activated) output w.r.t to z for layer-2.
dW^[2] i.e., differential of loss function w.r.t to weights for layer 2 = dz^[2] . (h^[1])^Ti.e., (activated) output of layer 1 affected (multiplied) by differential of loss function w.r.t z for layer-2.
----
dh^[1] i.e, differential of loss function w.r.t (activated) output for layer-1 = (W^[2])^T . dz^[2] i.e., weights for layer-2 affected (multiplied) by differential of loss function w.r.t. to z for layer 2.
dz^[1] i.e., differential of loss function w.r.t to z for layer-1 = dh^[1] ⓧ σ'(z^[1]) i.e., differential of loss function w.r.t to layer-1 output multiplied element-wise with differential of layer-1 (activated) output w.r.t to z for layer-1.
dW^[1] i.e., differential of loss function w.r.t to weights for layer-1 = dz^[1] . (x)^Ti.e., input layer affected (multiplied) by differential of loss function w.r.t z for layer-1.

Backpropagation Algo (batch-based)

Firstly, note that the loss

L

is computed over a batch. In a way, the batch acts as a proxy for the whole dataset. Hence, for a batch size of

m

, the average loss is:

\frac{1}{m} \sum_{x ϵ B} L (N (x), y)

This is the average loss over the

m

data points of the batch.

N (x)

is the network output (the vector

p

). Let's denote a batch of input data points by the matrix

B

. The backpropagation algorithm for a batch is:

$H^{0} = B$
for $l$ in $[1, 2, . . . . . . ., L] :$
1. $H^{l} = σ (W^{l} . H^{l - 1} + b^{l})$
$P$ = $n o r m a l i z e (e x p$ ( $W^{o} . H^{L} + b^{o}$ $))$
$L =$ $- \frac{1}{m} Y^{T} . l o g (P)$
$D Z^{L + 1}$ = $P - Y$
$D W^{L + 1}$ = $D Z^{L + 1}$ . ${(H^{L})}^{T}$
for $l$ in $[L, L - 1, . . . . . . . .1] :$
1. $d H^{l}$ = $(W^{l + 1})^{T} .$ $D Z^{l + 1}$
2. $D Z^{l}$ = $d H^{l}$ $\otimes$ $σ^{'} (Z^{l})$
3. $D W^{l}$ = $D Z^{l} . (H^{l - 1})^{T}$
4. $D b^{l}$ = $D Z^{l}$

Note that earlier

d z^{l}

represented a vector while

D Z^{l}

now represents the matrix consisting of the

d z^{l}

vectors of all the data points in

B

(each

d z^{l}

being a column of

D Z^{l}

). Similarly,

d H^{l}

Y

P

and

H^{l}

are all matrices of the corresponding individual vectors stacked side by side.

Data Science (Incl AI/ML)

Thursday, October 24, 2019

Neural Network Notational Symbols

Neural Network Notational Symbols

Backpropagation "Intuition"

Backpropagation Algo (batch-based)

No comments:

Post a Comment