All notes below are from Upgrad Course material
Pro's
- Decision trees model are most popular way of addressing classification problems
- Ease of interpretation:
The decision trees are easy to interpret. Almost always, you can identify the various factors that lead to the decision. In fact, trees are often underestimated for their ability to relate the predictor variables to the predictions. As a rule of thumb, if interpretability by laymen is what you're looking for in a model, decision trees should be at the top of your list. - Other models such as SVMs, logistics regression have a certain assumption of the data for them work effectively
- They need strictly numeric data
- Categorical data cannot be handled in a natural way
- Don't have the above limitation in terms of needing only numeric data.
- Don't need normalization of data.
- Decision trees often give us an idea of the relative importance of the explanatory attributes that are used for prediction.
Con's
- Decision trees tend to overfit the data. If allowed to grow with no check on its complexity, a tree will keep splitting till it has correctly classified (or rather, mugged up) all the data points in the training set.
- Decision trees tend to be very unstable, which is an implication of overfitting. A few changes in the data can change a tree considerably.
Dealing with Overfitting
There are two ways to control overfitting in trees:
- Truncation - Stop the tree while it is still growing so that it may not end up with leaves containing very few data points.
- Pruning - Let the tree grow to any complexity. Then, cut the branches of the tree in a bottom-up fashion, starting from the leaves. It is more common to use pruning strategies to avoid overfitting in practical implementations.
Truncation
Though there are various ways to truncate or prune trees, the DecisionTreeClassifier function in sklearn provides the following hyperparameters which you can control:
- criterion (Gini/IG or entropy): It defines the function to measure the quality of a split. Sklearn supports “gini” criteria for Gini Index & “entropy” for Information Gain. By default, it takes the value “gini”.
- max_features: It defines the no. of features to consider when looking for the best split. We can input integer, float, string & None value.
- If an integer is inputted then it considers that value as max features at each split.
- If float value is taken then it shows the percentage of features at each split.
- If “auto” or “sqrt” is taken then max_features=sqrt(n_features).
- If “log2” is taken then max_features= log2(n_features).
- If None, then max_features=n_features. By default, it takes “None” value.
- max_depth: The max_depth parameter denotes maximum depth of the tree. It can take any integer value or None. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. By default, it takes “None” value.
- min_samples_split: This tells above the minimum no. of samples required to split an internal node. If an integer value is taken then consider min_samples_split as the minimum no. If float, then it shows percentage. By default, it takes “2” value.
- min_samples_leaf: The minimum number of samples required to be at a leaf node. If an integer value is taken then consider - -min_samples_leaf as the minimum no. If float, then it shows percentage. By default, it takes “1” value.
No comments:
Post a Comment