ML Diaries: Day 4
Data Splitting, Choosing a Model and Training a Model
— a daily log of my learning and projects built as I take up Machine Learning. Welcome to The Mind Palace by Dayo.
Date: Aug 21, 2022
About
Continuing from Day 3, I went through what a dataset should be split into in order to train a model, standard practices regarding splitting data, what models should be chosen for certain types of data, and the Generalization process.
The Good Stuff — Splitting the Data
Standard practice is that the machine learning engineer splits the data into three (hence, the name ‘3-sets’)— the training dataset, the validation data set and the test dataset — in a 70–80%, 10–15%, 10–15% ratio. Each use case determines this ratio, and some cases might not require a dataset to develop the machine learning model.
Nonetheless, why the split? The data is split to train the model and correctly ascertain the performance of the model in real-world applications. The training set is used to train the model, the validation set is used to tune or improve the model, and the test set is used to compare versions of the developed model.
Is this [splitting the data] necessary? Very important, actually. Think of it like this: to ace an exam, you have to study course materials, take practice exams, and then take the main exam. Studying course materials gives you the required knowledge and taking practice exams checks if you have learned correctly. Both steps give you the ability and confidence to take on the main exam with the expectation that you’ll perform well.
For a machine learning model, studying course materials is the same as going through the training data to learn the right patterns that convert the inputs to the given output, taking practice exams is similar to the model applying the extracted pattern to the validation dataset to check if the model is indeed correct so the model can be applied to any other use with successful results, and taking the main exam is applying the model on other use cases called the test.
The training data, validation data and test data must not be the same. If your main exam and your practice exam have the same questions, you’ll get a good result but with no guarantee that you will perform well with a set of different questions. It’s the same reason none of the split datasets should be similar, else your machine learning model will essentially be an ‘expert memorization technique’. Also, just as you need to study many materials, the training dataset contains the bulk of any data. Having a successful model directly depends on how well it has been trained.
The Good Stuff — Picking a Model
Different data, different models. Since different types of data exist, not all models work best with some types of data. This happens because these models extract and apply patterns from data differently. A number of machine learning models exist today, so a machine learning engineer can leverage these existing models to solve problems. But what model should be used?
For now, a general tip is that structured data — think rows and columns — works well with decision trees (e.g. Random Forest) and gradient boosting algorithms (such as CatBoost and XGBoost), and unstructured data — think images, natural language texts, etc. — work well with deep learning, neural networks.
And that’s it for Day 4. Now onto Day 5 :)
Stories on The Mind Palace still continues weekly.
You can read the latest story ‘Pride, Humility and Small-mindedness’ or subscribe to the newsletter.
See ya!