*This post was originally published by Luke Sun at Towards Data Science*

## Creation and Evaluation of Handful of Machine Learning Models for Leave Prediction

Photo by Olivier Collet from Unsplash

In previous posts, I tried to predict if a bank customer is likely to leave, OR if an app user is likely to churn or subscribe. Here I will share recent work in the human resource domain to bring some predictive power to any firm struggling to retain their employees.

In this second post, I aim to evaluate and contrast the performances of a handful of different models. As always, it is split into:

- Data Engineering
- Data Processing
- Model Creation & Evaluation
- Takeaways

**1. Data Engineering**

Having completed a brief data exploration in the first post, let’s proceed with feature engineering and data encoding. Feature engineering involves creating new features and relationships from current features.

To start off, let’s segregate the categorical variables from numerical ones. We can use the ** datatype method **to find categorical variables, as their

**would be**

*dtype**‘object’*. You may notice data types are already shown when using

**.**

*employee_df.info()*Then, we can encode categorical variables. Two methods are available. One is to use ** OneHotEncoder from sklearn**, and the other is

**from**

*get_dummies()***. I prefer the latter, as it returns a dataframe which makes the following step easy. Specifically,**

*pandas*`employee_df_cat = pd.get_dummies(employee_df_cat)`

Then, concatenate the encoded categorical and numerical variables together. Specifically,

`X_all = pd.concat([X_cat, X_numerical], axis = 1)`

One final step is to generate the target variable.

```
employee_df[‘Attrition’] = employee_df[‘Attrition’].apply(lambda x: 1 if x == ‘Yes’ else 0)
y = employee_df[‘Attrition’]
```

**2. Data Processing**

Now we are ready to process the data, including data split, scaling, and balancing.

To make the data ready for training, we need to scale the features to avoid any variables dominating over other variables, namely taking higher weights and a strong influence on model learning. Specifically,

```
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X = scaler.fit_transform(X_all)
```

Now let’s partition the dataset into a training set and a test set. To split data,

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
```

We have noted that severe imbalance between employees leaves and stays. So let’s implement the **SMOTE **method to oversample the minority classes. Specifically,

```
oversampler = SMOTE(random_state=0)
X_smote_train, y_smote_train = oversampler.fit_sample(X_train,y_train)
```

Great! Now the data is ready for the model 📣📣.

**3. Model Creation & Evaluation**

As alluded at the beginning of the post, we aim to evaluate and compare the performance of a handful of models.

3.1 Logistic Regression

**Simply put, logistic regression uses a logarithmic transformation on a linear combination of independent variables which allows us to model a nonlinear problem in a linear manner.** It is commonly used for a binary classification problem where some correlation between predictors and response variables is assumed.

To create a logistic regression classifier, we use sklearn as below.

```
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_smote_train, y_smote_train)
```

To evaluate the performance, we use the confusion matrix.

```
y_pred = model.predict(X_test)
cm = confusion_matrix(y_pred, y_test)
sns.heatmap(cm, annot= True)
```

As indicated in Fig.1, the logistic regression classifier gives an accuracy of 0.75 and an F1 score of 0.52.

3.2 Random Forest

Random Forest is a type of ensemble model with a decision tree as its build block. It creates a group of decision trees and uses their collective predictive power to obtain relatively strong performance. For a really good read that drives home the basics of the Random Forest, refer to this CitizenNet blog.

To create a Random Forest classifier,

```
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_smote_train, y_smote_train)
```

Using the same method to evaluate the performance, we obtained an accuracy of 0.85 and an F1 score of 0.39.

3.3 Artificial Neural Network

The final attempt is to create and train an artificial neural network. Here we will build a sequential model with a few dense layers and dropout technique to reduce overfitting. Specifically,

```
from keras.models import Sequential
from keras.layers import Dense, Dropout
model = Sequential()
model.add(Dense(units = 50, activation = ‘relu’, input_shape = (50, )))
model.add(Dense(units = 500, activation = ‘relu’))
model.add(Dropout(0.3))
model.add(Dense(units = 500, activation = ‘relu’))
model.add(Dropout(0.3))
model.add(Dense(units = 50, activation = ‘relu’))
model.add(Dense(units = 1, activation = ‘sigmoid’))
```

To compile the neural network, we use* ‘adam’ *optimizer and binary cross-entropy as the loss function.

```
model.compile(optimizer=’adam’, loss = ‘binary_crossentropy’, metrics = [‘accuracy’])
epochs_hist = model.fit(X_smote_train, y_smote_train, epochs = 50, batch_size = 50)
y_pred = model.predict(X_test)
y_pred = (y_pred > 0.5)
```

Note above, we set a threshold probability for sigmoid function output at 0.5. So, any output greater than 0.5 is taken as ‘leave’, and otherwise as ‘stay’. Fig.3 shows the model loss during training. It seems the model reached a convergence with 20 epochs.

Finally, the confusion matrix heat map as shown in Fig.4 gives an accuracy of 0.88 and an F1 score of 0.41.

**4. Takeaways**

At last, let’s compile the performance in Table 1. To better understand the metrics, take one step back. We are tasked to predict if an employee is likely to leave. Due to the high imbalance between the classes, accuracy is not a good indicator. In my view, reducing false-negative errors are more meaningful than the false positive, because the model can identify more among people are to leave🤔. Based on this logic, the logistic regression model is a winner. But obviously, there is quite a space for improvement.

**Great! Hopefully, this post put a good foundation on different EDA and machine learning techniques. As usual, if you are interested in the code, check my GitHub repository ****here ****🤞🤞.**

*This post was originally published by Luke Sun at Towards Data Science*