Combining logistic regression and decision tree

towards-data-science

This post was originally published by Andrzej Szymanski, PhD at Towards Data Science

Making logistic regression less linear

Logistic regression is one of the most used machine learning techniques. Its main advantages are clarity of results and its ability to explain the relationship between dependent and independent features in a simple manner. It requires comparably less processing power, and is, in general, faster than Random Forest or Gradient Boosting.

However, it has also some serious drawbacks and the main one is its limited ability to resolve non-linear problems. In this article, I will demonstrate how we can improve the prediction of non-linear relationships by incorporating a decision tree into a regression model.

The idea is quite similar to weight of evidence (WoE), a method widely used in finance for building scorecards. WoE takes a feature (continuous or categorical) and splits it into bands to maximise separation between goods and bads (positives and negatives). Decision tree carries out a very similar task, splitting the data into nodes to achieve maximum segregation between positives and negatives. The main difference is that WoE is built separately for each feature, while nodes of decision tree select multiple features at the same time.

Knowing that the decision tree is good at identifying non-linear relationships between dependent and independent features, we can transform the output of the decision tree (nodes) into a categorical variable and then deploy it in a logistic regression, by transforming each of the categories (nodes) into dummy variables.

In my professional projects, using decision tree nodes in the model would out-perform both logistic regression and decision tree results in 1/3 of cases. However, I have struggled to find any publicly available data which could replicate it. This is probably because the available data contain only a handful of variables, pre-selected and cleansed. There is simply not much to squeeze! It is much easier to find additional dimensions of the relationship between dependent and independent features when we have hundreds or thousands of variables at our disposal.

In the end, I decided to use the data from a banking campaign. Using these data I have managed to get a minor, but still an improvement of combined logistic regression and decision tree over both these methods used separately.

After importing the data I did some cleansing. The code used in this paper is available on GitHub. I have saved the cleansed data into a separate file.

Because of the small frequency, I have decided to oversample the data using SMOTE technique.

import pandas as pd
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTEdf=pd.read_csv('banking_cleansed.csv')
X = df.iloc[:,1:]
y = df.iloc[:,0]os = SMOTE(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
columns = X_train.columns
os_data_X,os_data_y=os.fit_sample(X_train, y_train)
os_data_X = pd.DataFrame(data=os_data_X,columns=columns )
os_data_y= pd.DataFrame(data=os_data_y,columns=['y'])

In the next steps I have built 3 models:

  • decision tree
  • logistic regression
  • logistic regression with decision tree nodes

Decision tree

It is important to keep the decision tree depth to a minimum if you want to combine with logistic regression. I’d prefer to keep the decision tree at maximum depth of 4. This already gives 16 categories. Too many categories may cause cardinality problems and overfit the model. In our example, the incremental increase in predictability between depth of 3 and 4 was minor, therefore I have opted for maximum depth = 3.

from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.metrics import roc_auc_scoredt = DecisionTreeClassifier(criterion='gini', min_samples_split=200,min_samples_leaf=100, max_depth=3)
dt.fit(os_data_X, os_data_y)
y_pred3 = dt.predict(X_test)print('Misclassified samples: %d' % (y_test != y_pred3).sum())
print(metrics.classification_report(y_test, y_pred3))
print (roc_auc_score(y_test, y_pred3))

The next step is to convert the nodes into new variable. To do so, we need to code-up the decision tree rules. Luckily, there is a bit of programme which can do it for us. The function below produces a piece of code which is a replication of decision tree split rules.

Now run the code:

tree_to_code(dt,columns)

and output will look like this:

We can now copy and paste the output into our next function, which we can use to create our new categorical variable.

Now we can quickly create a new variable (‘nodes’) and transfer it into dummies.

df['nodes']=df.apply(tree, axis=1)
df_n= pd.get_dummies(df['nodes'],drop_first=True)
df_2=pd.concat([df, df_n], axis=1)
df_2=df_2.drop(['nodes'],axis=1)

After adding nodes variable, I re-run split to train and test groups and oversampled the train data using SMOTE .

X = df_2.iloc[:,1:]
y = df_2.iloc[:,0]os = SMOTE(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
columns = X_train.columns
os_data_X,os_data_y=os.fit_sample(X_train, y_train)

Now we can run logistic regressions and compare the impact of node dummies on predictability.

Logistic regression excluding nodes dummies

I have created a list of all features excluding the nodes dummies:

nodes=df_n.columns.tolist()
Init = os_data_X.drop(nodes,axis=1).columns.tolist()

and run the logistic regression using the Init list:

from sklearn.linear_model import LogisticRegression
lr0 = LogisticRegression(C=0.001, random_state=1)
lr0.fit(os_data_X[Init], os_data_y)
y_pred0 = lr0.predict(X_test[Init])print('Misclassified samples: %d' % (y_test != y_pred0).sum())
print(metrics.classification_report(y_test, y_pred0))
print (roc_auc_score(y_test, y_pred0))

Logistic regression with nodes dummies

In the next step I re-run the regression, but this time I have included nodes dummies.

from sklearn.linear_model import LogisticRegression
lr1 = LogisticRegression(C=0.001, random_state=1)
lr1.fit(os_data_X, os_data_y)
y_pred1 = lr1.predict(X_test)print('Misclassified samples: %d' % (y_test != y_pred1).sum())
print(metrics.classification_report(y_test, y_pred1))
print (roc_auc_score(y_test, y_pred1))

Results comparison

The logistic regression with node dummies has the best performance. Although, the incremental improvement is not massive (especially if compared with decision tree), as I said before, it is hard to squeeze anything extra out data which contain only a handful of pre-selected variables and I can reassure you that in real life the differences can be bigger.

We can scrutinise the models a little bit more by comparing the distribution of positives and negatives across the decile score using Model Lift, which I have presented in my previous article.

First step is to obtain probabilities:

y_pred0b=lr0.predict_proba(X_test[Init])
y_pred1b=lr1.predict_proba(X_test)

Next we need to run the function below:

Now we can check the differences between these two models. First, let’s evaluate the performance of the initial model without decision tree.

ModelLift0 = lift(y_test,y_pred0b,10)
ModelLift0

Model Lift before applying decision tree nodes…

…and next the model with decision tree nodes

ModelLift1 = lift(y_test,y_pred1b,10)
ModelLift1

Response in top 2 deciles of the model with decision tree nodes has improved, and so did the Kolmogorov-Smirnov test(KS). Once we translate the lift into financial value, it may turn out that this minimal incremental improvement may generate a substantial return in our marketing campaign.

Summarising, combining logistic regression and decision tree is not a well-known approach, but it may outperform the individual results of both decision tree and logistic regression. In the example presented in this article, the differences between decision tree and 2nd logistic regression are very negligible. However, in real life, when working on un-polished data, combining decision tree with logistic regression may produce far better results. That was rather a norm in projects I have run in the past. Node variable may not be a magic wand but definitely something worth knowing and trying out.

Spread the word

This post was originally published by Andrzej Szymanski, PhD at Towards Data Science

Related posts