Data Science (Incl AI/ML)

Saturday, August 1, 2020

Hadoop

Typically, the cloud distribution is jus the open source implementation on the particular cloud i.e., Amazon distribution.
One can have Commercial packages on Cloud such as MapR on AWS as well

Sunday, February 23, 2020

UpGrad Notebooks Index - Code Examples

Working_With_Flowers_Images.ipynb

Image normalisation techniques
Image Augmentation techniques
Using LossHistory to determine the best starting initial LR (Learning Rate)

Monday, January 6, 2020

Reinforcement Learning

Markov Decision Process

In a Reinforcement Learning problem,

An agent learns how to behave in an environment by taking actions
Then observing the consequences (rewards and next state), of the action taken.
The control objective of the agent is to learn a policy to accumulate maximum cumulative rewards over a period of time.
All of RL problems are based on the Markov assumption: the current state contains all relevant information to take the current action.

Running log of questions to ask ...

1. Re https://learn.upgrad.com/v/course/272/session/60574/segment/336735 defines Policy Evaluation and Policy Improvement. Policy Evaluation description states "Say you know a policy and you want to evaluate how good it is, i.e., compute the state-value functions for the existing policy". While it is clear how you compute the state-value functions, what is the actual evaluation you are doing here? Are we comparing different state-value functions?

Sunday, December 8, 2019

Industry Use Cases of RNN

(From an UpGrad sponsored webinar by Manish Kumar on 8th Dec)

Machine Translation
Q&A
Language Modeling
Text Generation
Named Entity Recognition
Text Summarization

Speech Recognition
Speech Verification
Speech Enhancement
Text-To-Speech

Gesture Recognition
Stock Market Prediction
Code Generation
SQL Chatbots

Some other creative use cases

Generate HTML code for a web page based on the visual design

Perusing https://paperswithcode.com/sota will help understand the type of use cases/problem statements using RNN (or any other type of neural network)

Wednesday, December 4, 2019

RNN Code Snippets (using Keras)

SOURCE: UpGrad

Vanilla RNN

# import libraries
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import SimpleRNN

# define parameters
n_output =  number of classes in case of classification, 1 in case of regression
output_activation = # “softmax” or “sigmoid” in case of classification, “linear” in case of regression

# ---- build RNN architecture ----

# instantiate sequential model
model = Sequential()

# add the first hidden layer
n_cells = #number of neurons to add in the hidden layer
time_steps = # length of sequences
features = # number of features of each entity in the sequence

model.add(SimpleRNN(n_cells, input_shape=(time_steps, features)))

# add output layer
model.add(Dense(n_output, activation=output_activation)

Many-to-One

# import libraries
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import SimpleRNN

# define parameters
n_output = # number of classes in case of classification, 1 in case of regression
output_activation = # “softmax” or “sigmoid” in case of classification, “linear” in case of regression

# instantiate model
model = Sequential()

# time_steps: multiple input, that is, one input at each timestep
model.add(SimpleRNN(n_cells, input_shape=(time_steps, features)))

# single output at output layer
model.add(Dense(n_classes, activation=output_activation))

Many-to-Many (equal length input & output)

# instantiate model
model = Sequential()

# time_steps: multiple input, that is, one input at each timestep
model.add(SimpleRNN(n_cells, input_shape=(time_steps, features)))

# TimeDistributed(): This function is used when you want your neural network to provide an output at each timestep which is exactly what we want in the many-to-many RNN model.
model.add(TimeDistributed(Dense(n_classes, activation=output_activation)))

Encoder-Decoder

# instantiate model
model = Sequential()

# encoder with multiple inputs 
model.add(LSTM(n_cells_input, input_shape=(input_timesteps, ...)))

# encoded sequence 
model.add(RepeatVector(output_timesteps))
  
model.add(LSTM(n_cells_output, return_sequences=True))

# TimeDistributed(): multiple outputs at the output layer
model.add(TimeDistributed(Dense(n_classes, activation=output_activation)))

One-to-Many

# instantiate model
model = Sequential()

# time_steps is one in this case because the input consists of only one entity
model.add(SimpleRNN(n_cells, input_shape=(1, features)))

# TimeDistributed(): multiple outputs at the output layer
model.add(TimeDistributed(Dense(n_classes, activation=output_activation)))

Bidirectional

# instantiate model
model = Sequential()

# bidirectional RNN layer
model.add(Bidirectional(SimpleRNN(n_cells, input_shape=(time_steps, features))) 

# output layer
model.add(Dense(n_classes, activation = output_activation))

LSTM

# import LSTM layer
from keras.layers import LSTM

# instantiate model
model = Sequential()

# replace the SimpleRNN() layer with LSTM() layer
model.add(LSTM(n_cells, input_shape=(time_steps, features)))

# output layer
model.add(Dense(n_classes, activation=output_activation))

GRU

from keras.layers import GRU

# instantiate model
model = Sequential()

# replace the LSTM() layer with GRU() layer
model.add(GRU(n_cells, input_shape=(time_steps, features)))

# output layer
model.add(Dense(n_classes, activation=output_activation))

Saturday, November 30, 2019

CNN: Working With Images: Summary Steps

Data Preparation:

Make sure all images are of the same resolution.
Organize images into folders based on the class being predicted i.e, a folder for each class.

Data Pre-processing: Morphological Operations

Thresholding on the image - convert it from a grey image to a binary image.
Look at Erosion, Dilation, Opening, Closing.

Data Pre-processing: Normalisation

Divide by 255 (or)
Divide by max-min (or)
Divide based on percentile (to account for outliers)

Data Pre-Processing: Augmentation

Two types of transformations for augmentation - linear and affine.
Different ways to augment - translation, rotation, scaling, etc.
Adds variability to add to train the model better.

Model Building

Run ablation experiments
Overfit on a smaller version of the training set
Hyperparameter tuning
Model training and evaluation

Friday, November 29, 2019

Custom Data Generator Code

To start with, we have the training data stored in n directories (if there are n classes). For a given batch size, we want to generate batches of data points and feed them to the model.
The first for loop 'globs' through each of the classes (directories). For each class, it stores the path of each image in the list paths. In training mode, it subsets paths to contain the first 80% images; in validation mode it subsets the last 20%. In the special case of an ablation experiment, it simply subsets the first ablation images of each class.
We store the paths of all the images (of all classes) in a combined list self.list_IDs. The dictionary self.labels contains the labels (as key:value pairs of path: class_number (0/1)).
After the loop, we call the method on_epoch_end(), which creates an array self.indexes of length self.list_IDs and shuffles them (to shuffle all the data points at the end of each epoch).
The _getitem_ method uses the (shuffled) array self.indexes to select a batch_size number of entries (paths) from the path list self.list_IDs.
Finally, the method __data_generation returns the batch of images as the pair X, y where X is of shape (batch_size, height, width, channels) and y is of shape (batch size, ). Note that __data_generation also does some preprocessing - it normalises the images (divides by 255) and crops the center 100 x 100 portion of the image. Thus, each image has the shape (100, 100, num_channels). If any dimension (height or width) of an image less than 100 pixels, that image is deleted.

import numpy as np
import keras

class DataGenerator(keras.utils.Sequence):
'Generates data for Keras'

def __init__(self, mode='train', ablation=None, flowers_cls=['daisy', 'rose'],
batch_size=32, dim=(100, 100), n_channels=3, shuffle=True):
"""
Initialise the data generator
"""
self.dim = dim
self.batch_size = batch_size
self.labels = {}
self.list_IDs = []

# glob through directory of each class
for i, cls in enumerate(flowers_cls):
paths = glob.glob(os.path.join(DATASET_PATH, cls, '*'))
brk_point = int(len(paths)*0.8)
if mode == 'train':
paths = paths[:brk_point]
else:
paths = paths[brk_point:]
if ablation is not None:
paths = paths[:ablation]
self.list_IDs += paths
self.labels.update({p:i for p in paths})

self.n_channels = n_channels
self.n_classes = len(flowers_cls)
self.shuffle = shuffle
self.on_epoch_end()

def __len__(self):
'Denotes the number of batches per epoch'
return int(np.floor(len(self.list_IDs) / self.batch_size))

def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

# Find list of IDs
list_IDs_temp = [self.list_IDs[k] for k in indexes]

# Generate data
X, y = self.__data_generation(list_IDs_temp)

return X, y

def on_epoch_end(self):
'Updates indexes after each epoch'
self.indexes = np.arange(len(self.list_IDs))
if self.shuffle == True:
np.random.shuffle(self.indexes)

def __data_generation(self, list_IDs_temp):
'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
# Initialization
X = np.empty((self.batch_size, *self.dim, self.n_channels))
y = np.empty((self.batch_size), dtype=int)

delete_rows = []

# Generate data
for i, ID in enumerate(list_IDs_temp):
# Store sample
img = io.imread(ID)
img = img/255
if img.shape[0] > 100 and img.shape[1] > 100:
h, w, _ = img.shape
img = img[int(h/2)-50:int(h/2)+50, int(w/2)-50:int(w/2)+50, : ]
else:
delete_rows.append(i)
continue

X[i,] = img

# Store class
y[i] = self.labels[ID]

X = np.delete(X, delete_rows, axis=0)
y = np.delete(y, delete_rows, axis=0)
return X, keras.utils.to_categorical(y, num_classes=self.n_classes)