Data Science (Incl AI/ML): Model Selection

Key Points on Model Selection

Central question for any ML model is
How to extrapolate learnings from a finite amount of data to explain or predict all possible inputs of the same kind
Domain knowledge is a very important factor in making decisions on ML Models for a given problem.
Usefulness of a model is determined by how it performs unseen data
Occam's razor is perhaps the most important thumb rule in machine learning, and incredibly 'simple' at the same time. When in dilemma, choose the simpler model.

Logistic regression is simple to run, with no hyperparameters to tune. It can be used as a benchmark to compare the performance of other models.
Support vector machines can take quite a bit of time to run because of their resource-intensive nature. It also takes multiple runs to choose the best kernel for a particular problem.
A decision tree generally does not perform well on a dataset with a lot of continuous variables. Since the tree is performing well on the dataset, it is highly unlikely that the data has only continuous attributes.
If the difference between training and validation accuracy is significant, then you can conclude that the tree has overfitted the data.
A logistic regression model can also work with nonlinear separable datasets, but the performance will not be at par with other machine learning models such as decision trees, SVMs etc.

Comparison of Models

Logistic Regression (LR), Decision Trees & SVMs

Logistic Regression

Pro’s	Con’s
Logistic Regression
t is convenient for generating probability scores. 2. Efficient implementation is available across different tools. 3. The issue of multicollinearity can be countered with regularisation. 4. It has widespread industry use.	It does not perform well when the features space is too large. It does not perform well when there are a lot of categorical variables in the data. The nonlinear features have to be transformed to linear features in order to efficiently use them for a logistic model. It relies on entire data i.e. if there is even a small change in the data, the logistic model can change significantly.
Decision Trees
Intuitive decision rules make it easy to interpret. Trees handle nonlinear features well. The variable interaction is taken into account.	Trees are highly biased towards the training set and overfit it more often than not. There is no meaningful probability score as the output
SVMs
SVMs can handle large feature space. These can handle nonlinear feature interaction. They do not rely on the entire dimensionality of the data for the transformation.	SVMs are not efficient in terms of computational cost when the number of observations is large. It is tricky and time-consuming to find the appropriate kernel for a given data.

CART and CHAID Trees

Use CART for forecasting/prediction v/s CHAID which is better suited for driver analysis, i.e, understanding the key variables/features driving the behavior of the data/target.
e,g., Suppose you are working with the Indian cricket team, and you want to predict whether the team will win a particular tournament or not. In this case, CART would be more preferable because it is more suitable for prediction tasks. Whereas, if you want to look at the factors that are going to influence the win/loss of the team, then a CHAID tree would be more preferable.

Decision Trees v/s Random Forest

Trees have a tendency to overfit the training data whereas with Random Forest trees, it is hard to overfit the data it uses bagging along with sampling the features randomly at each node split. This prevents them from overfitting the data, unlike decision trees.

There is no need to prune trees in a random forest because even if some trees overfit the training set, it will not matter when the results of all the trees are aggregated.

In a decision tree, while building a decision tree, at every node we introduce a condition (example: age>=20) on a feature which in turn creates a "linear" boundary perpendicular to that feature (age) to split the dataset into two. The number of such linear boundaries increases if the data is not linearly separable and more than one node have to be created for each features. Hence creating large number of linear boundaries for highly non-linear data may not be efficient enough to classify the data points correctly.

With any decision tree (even if using Random Forests), it is not possible to predict beyond the range of the response variable in the training data in a regression problem. Suppose you want to predict house prices using a decision tree and the range of the the house price (response variable) is $5000 to $35000. While predicting, the output of the decision tree will always be within that range. If unseen data has values outside this range, the model can be inaccurate.

With a RF, the OOB error can be calculated from the training data itself which gives a good estimate of the model performance on unseen data.

A random forest is not affected by outliers as much because of the aggregation strategy.

Limitations of Random Forest

Owing to their origin to decision trees, random forests have the same problem of not predicting beyond the range of the response variable in the training set.

Extreme values are often not predicted because of the aggregation strategy.
To illustrate this, let’s take the house prices example, Where the response variable is the price of a house.

Suppose the range of the price variable is between $5000 and $35000.

You train the random forest and then make predictions. While making predictions for an expensive house, there will be some trees in the forest which predict the price of the house as $35000, but there will be other trees in the same forest with values close to $35000 but not exactly $35000.

In the end, when the final price is decided by aggregating using the mean of all the predictions of the trees of the forest, the predicted value will be close to the extreme value of $35000 but not exactly $35000.

Unless all the trees of the forest predict the house price to be $35000, this extreme value will not be predicted.

Directed v/s Undirected (Probability) Graphs

If the relationship (between variables) that we are trying to model needs to be asymmetric (i.e. one variable influences the other but not the other way), then go for Directed model.

e.g. disease & symptom, drug & cure.

If symmetric, then use Undirected model.

e.g. pixels in an image.

How to build different models and choose the best

Start with logistic regression. Using a logistic regression model serves two purposes:

It acts as a baseline (benchmark) model.
It gives you an idea about the important variables.

Then, go for decision trees and compare their performance with the logistic regression model. If there is no significant improvement in their performance, then just use the important variables drawn from the logistic regression model.

While building a decision tree, you should choose the appropriate method: CART for predicting and CHAID for driver analysis.

Finally, if you still do not meet the performance requirements, and you have sufficient time & resources on hand, then go ahead and build more complex models like random forests & support vector machines.

In general, starting from a basic model helps in two ways:

If the model performs as per requirement, there is no need to go to complex models. This saves time and resources.
If it does not perform well, it can be used to benchmark the performance of other models.

Restricted Boltzman Machine

A restricted Boltzmann Machine is an "Energy Based" generative stochastic model. Initially invented by Paul Smolensky in 1986 and were called "Harmonium". After the evolution of training algorithms in the mid 2000's by Geoffrey Hinton, the boltzman machine became more prominent. It gained big popularity in recent years in the context of the Netflix Prize where RBMs achieved state of the art performance in collaborative filtering and have beaten most of the competition.

RBM's are useful for dimensionality reduction, classification, regression, collaborative filtering, feature learning and topic modeling.

Data Science (Incl AI/ML)

Thursday, November 7, 2019

Model Selection