Winning Kaggle Competitions
Effective Ways to Compete in Kaggle data contests
Competition Mechanics
- Results measured solely based on prediction metrics. Usually Sources or models are not required.
- Kind of metric used depends on the competition. Could be accuracy, recall etc. Read the competition rules carefully.
- Often, the actual, absolute prediction score itself is less important. More important is the relative performance in comparison to other competitors.
- What sources of data can or cannot be used would be specified in the competition rules
- Organizers often use the following techniques to prevent overfitting to the test set using scores from the leaderboard
- Split test date into public/private. You can test with public and self-evaluate but final evaluation will be done on the private data which will be visible only after submission
- Shuffle the order of rows in train & test sets
- Add non-relevant examples to the data set
- Don't be afraid to tweak source code of open source models to get better results.
Recap of Common ML Algos
- Linear family of algos e.g. Logistic Regression, Support Vector Machine (SVM)
- Especially good for sparse, high-dimensional data
- Implementations to consider - SciKit Learn, Wowpal Wabbit (esp. good at handling large data sets)
- Tree-based algos e.g. Decision Tree, Random Forest (RF), GBDT (Gradient Boosted Decision Tree)
- Uses linear recursively to divide and conquer using sub spaces.
- Powerful and a good default model for tabular data
- In almost every competition, winners are known to use these models.
- Hard to capture linear dependencies when using complex tree-based models
- Prediction can be inaccurate near decision borders
- Scikit Learn is good for Decision Trees. XGBoost & Microsoft/LightGBM good for GBDT
- k-NN (k nearest neighbours) algos
- Points closer to each other are likely to have the same label/classification
- Usually squared distance used to measure proximity. However this approach can be semantically inadequate for images
- Despite its simple approach, features based on this algo can be quite informative.
- Scikit Learn recommended since it uses algorithm matrix to speedup calculations and allows you to use several predefined distance functions. Also, it allows you to implement your own distance function.
- Neural Networks
- Smooth separation curves unlike decision trees
- Good for images, sounds
- Tensor Flow, Keras, PyTorch, mxnet, Lasagne - PyTorch provides a user-friendly way to tweak NNs.
- General Note / Recap
- First of all, there is no silver bullet algorithm which outperforms all the other in all and every task.
- Next, Linear Models can be imagined as splitting space into two sub-spaces separated by a hyper plane.
- Tree-Based Methods split space into boxes and use constant the predictions in every box.
- k-NN methods are based on the assumptions that close objects are likely to have same labels. So we need to find closest objects and pick their labels. Also, k-NN approach heavily relies on how to measure point closeness.
- Feed-forward Neural Nets are harder to interpret but they produce smooth non-linear decision boundary.
- The most powerful methods are Gradient Boosted Decision Trees and Neural Networks. But we shouldn't underestimate Linear Models and k-NN because sometimes, they may be better.
No comments:
Post a Comment