Why building a Machine Learning Model is like cooking


This post was originally published by Vicky Yu at Towards Data Science

The first step in building a machine learning model is to prepare the data. Depending on the data infrastructure this may involve pulling raw data from a variety of sources to load into a database. In a mature company, data is in a database and the data scientist just needs to find the data they need for the model.

Likewise, the first step in cooking is to get the ingredients (the data). You may need to go to the grocery store to buy ingredients you don’t have at home (pull from a variety of sources).

Next, the data scientist explores the data to analyze trends, remove bad data like duplicates and missing values, and transforms it into a usable form for modeling. By usable form I mean if the model expects one row per user, then the data needs to be aggregated and transposed to meet that requirement.

Similarly, you have to explore recipe ingredients to decide if you want to use fresh or frozen or replace with an alternative if you can’t find it at the store. The recipe may call for ingredients to be pre-mixed or pre-cooked before adding to the dish (transform into a usable form).

The third step is to choose a model algorithm depending on the modeling problem.

This is analogous to choosing the cooking method — broiling, frying, steaming, etc.

Now the data scientist will train the model using the selected algorithm. In order to determine model accuracy a portion of the data is held out from the training to evaluate how well the model can predict the outcome using data it has never seen.

Similarly, you cook your dish accordingly to the recipe and rank the taste to comparable dishes you’ve cooked in the past.

The data scientist reviews the model results and depending on the outcome, will repeat steps 3 and 4 with another model algorithm. Occasionally the data scientist may need to start over from step 1 to evaluate if any new data can be introduced to improve the model results.

Depending on how satisfied you with the taste of your dish, you may adjust your cooking method or start over with different ingredients.

Model parameters can be adjusted to improve the accuracy but this is an optional step if the model results are acceptable.

In cooking, the ratio of ingredients or seasonings can be adjusted dish if you’re not happy with the taste.

A model is trained using historical data. After the model is trained it will be put into production where it’s used to predict future outcomes using current data.

Likewise, once you’ve prepared your dish and are happy with the taste you’re ready to serve it to your family and friends.

This is a rarely mentioned last step. Models that have been in production occasionally need to be retrained with more current data. A model uses data as of a point in time to predict future user behavior. If user behavior changes over time the model is not able to capture this and the prediction accuracy will drop. This is especially true due to the pandemic. User behavior changed dramatically in 2020 and predictions from models built using data from a year ago will be impacted.

The closest cooking analogy I think of is if your dish requires a seasonal or key ingredient that is not readily available. In this case, you use an alternative that’s close but the dish is not as good as the original recipe.

Spread the word

This post was originally published by Vicky Yu at Towards Data Science

Related posts