Thursday, October 24, 2019

Winning Kaggle Competitions


Effective Ways to Compete in Kaggle data contests

Based on Coursera's How to Win a Data Science Competition: Learn from Top Kagglers

Competition Mechanics

  1. Results measured solely based on prediction metrics. Usually Sources or models are not required. 
  2. Kind of metric used depends on the competition. Could be accuracy, recall etc. Read the competition rules carefully.
  3. Often, the actual, absolute prediction score itself is less important. More important is the relative performance in comparison to other competitors.
  4. What sources of data can or cannot be used would be specified in the competition rules
  5. Organizers often use the following techniques to prevent overfitting to the test set using scores from the leaderboard
    1. Split test date into public/private. You can test with public and self-evaluate but final evaluation will be done on the private data which will be visible only after submission
    2. Shuffle the order of rows in train & test sets
    3. Add non-relevant examples to the data set
  6. Don't be afraid to tweak source code of open source models to get better results. 

    Recap of Common ML Algos

    1. Linear family of algos e.g. Logistic RegressionSupport Vector Machine (SVM)
      • Especially good for sparse, high-dimensional data
      • Implementations to consider - SciKit LearnWowpal Wabbit (esp. good at handling large data sets)
    2. Tree-based algos e.g. Decision TreeRandom Forest (RF), GBDT (Gradient Boosted Decision Tree)
      • Uses linear recursively to divide and conquer using sub spaces. 
      • Powerful and a good default model for tabular data
      • In almost every competition, winners are known to use these models.
      • Hard to capture linear dependencies when using complex tree-based models
      • Prediction can be inaccurate near decision borders
      • Scikit Learn is good for Decision Trees. XGBoost & Microsoft/LightGBM good for GBDT 
    3. k-NN (k nearest neighbours) algos
      • Points closer to each other are likely to have the same label/classification
      • Usually squared distance used to measure proximity. However this approach can be semantically inadequate for images
      • Despite its simple approach, features based on this algo can be quite informative. 
      • Scikit Learn recommended since it uses algorithm matrix to speedup calculations and allows you to use several predefined distance functions. Also, it allows you to implement your own distance function. 
    4. Neural Networks
      • Smooth separation curves unlike decision trees
      • Good for images, sounds
      • Tensor Flow, Keras, PyTorch, mxnet, Lasagne - PyTorch provides a user-friendly way to tweak NNs.
    5. General Note / Recap
      • First of all, there is no silver bullet algorithm which outperforms all the other in all and every task. 
      • Next, Linear Models can be imagined as splitting space into two sub-spaces separated by a hyper plane. 
      • Tree-Based Methods split space into boxes and use constant the predictions in every box.
      • k-NN methods are based on the assumptions that close objects are likely to have same labels. So we need to find closest objects and pick their labels. Also, k-NN approach heavily relies on how to measure point closeness. 
      • Feed-forward Neural Nets are harder to interpret but they produce smooth non-linear decision boundary. 
      • The most powerful methods are Gradient Boosted Decision Trees and Neural Networks. But we shouldn't underestimate Linear Models and k-NN because sometimes, they may be better.

    No comments:

    Post a Comment