Hybrid Rule-based Machine Learning with scikit-learn

Hybrid Rule-Based Machine Learning With scikit-learn

towards-data-science

This post was originally published by Lukas Haas at Towards Data Science

There are many ways in which we can integrate deterministic rules into our machine learning pipeline. Adding rules progressively as data pre-processing steps might seem intuitive, but this would not suit our goal. Preferably, we aim to leverage the concept of abstraction by adopting object-oriented programming (OOP) to generate a novel ML model class. This hybrid model will then encompass all deterministic rules, enabling us to train it like any other machine learning model.

Conveniently, scikit-learn provides a BaseEstimator class which we can inherit to build scikit-learn models ourselves without much effort. The advantage of building a new estimator is that we can blend our rules directly with the model logic while leveraging an underlying machine learning model for all data to which the rules don’t apply.

Let us start by building our new hybrid model class and adding an init method to it. As an underlying model, we will use the scikit-learn implementation of a GradientBoostingClassifier; we will call it the base_model.

import pandas as pd
from typing import Dict, Tuple
from sklearn.base import BaseEstimatorclass RuleAugmentedGBC(BaseEstimator):

def __init__(self, base_model: BaseEstimator, rules: Dict, **base_params):

self.rules = rules
self.base_model = base_model
self.base_model.set_params(**base_params)


We created the RuleAugmentedGBC class which inherits from BaseEstimator. Our class is not complete yet and is still missing some essential methods, but it is now technically a scikit-learn estimator. The init method initializes our estimator utilizing a base_model and a dictionary of rules. We can set additional parameters in the init method which are then directly passed to the underlying base_model. In our case, we will use a GradientBoostingClassifier as the base_model.

A Common Format for Rules

In this article’s implementation, we will supply rules to the model in the following format:

{"House Price": [
("<", 1000.0, 0.0),
(">=", 500000.0, 1.0)
],
"...": [
...
...
]}

As illustrated above, we format rules as a Python dictionary. The dictionary keys represent the feature column names to which we want to apply our rules. The values of the dictionary are lists of tuples, each tuple representing a unique rule. The first element of the tuple is the logical operator of the rule, the second the split criterion, and the last object is the value which the model should return if the rule is applicable.

For instance, the first rule in the example above would indicate that if any value in the House Price feature column is less than 1000.0, the model should return the value 0.0.

The fit Method

We proceed to code a fit method (within our RuleAugmentedGBC class) to allow our model to train on data. What is important to notice here is that we want to use our deterministic rules wherever possible, and train the base_model only on data which is not affected by the rules. We will decompose this step by formulating a private helper method called _get_base_model_data to filter out the data necessary to train our base_model.

def fit(self, X: pd.DataFrame, y: pd.Series, **kwargs):  train_x, train_y = self._get_base_model_data(X, y)
self.base_model.fit(train_x, train_y, **kwargs)

The fit method is pretty straightforward: it first applies the to be coded _get_base_model_data method to distill the training features and labels for our underlying base_model and then fits the model to the data. Similar to before, we can set additional parameters which we subsequently pass to the fit method of the base_model. Let us now implement the _get_base_model_data method:

def _get_base_model_data(self, X: pd.DataFrame, y: pd.Series) -> Tuple[pd.DataFrame, pd.Series]:  train_x = X

for category, rules in self.rules.items():

    if category not in train_x.columns.values: continue
for rule in rules:      if rule[0] == "=":
train_x = train_x.loc[train_x[category] != rule[1]]      elif rule[0] == "<":
train_x = train_x.loc[train_x[category] >= rule[1]]      elif rule[0] == ">":
train_x = train_x.loc[train_x[category] <= rule[1]]      elif rule[0] == "<=":
train_x = train_x.loc[train_x[category] > rule[1]]      elif rule[0] == ">=":
train_x = train_x.loc[train_x[category] < rule[1]]      else:
print("Invalid rule detected: {}".format(rule))  indices = train_x.index.values
train_y = y.iloc[indices]
train_x = train_x.reset_index(drop=True)
train_y = train_y.reset_index(drop=True)  return train_x, train_y

Our private _get_base_model_data method iterates through the rule dictionary keys and finally through every unique rule. At every rule, depending on the logical operator, it narrows down the train_x pandas dataframe to only include the data points not affected by the rule. Once we have applied all rules, we match the corresponding labels via indices and return the residual data for the base_model.

The predict Method

The predict method works in like manner to the fit method. Wherever possible, rules should be applied; if no rules are applicable, the base_model should produce a prediction.

def predict(self, X: pd.DataFrame) -> np.array:

p_X = X.copy()
p_X[‘prediction’] = np.nan

  for category, rules in self.rules.items():    if category not in p_X.columns.values: continue
for rule in rules:      if rule[0] == "=":
p_X.loc[p_X[category] == rule[1], 'prediction'] = rule[2]      elif rule[0] == "<":
p_X.loc[p_X[category] < rule[1], 'prediction'] = rule[2]      elif rule[0] == ">":
p_X.loc[p_X[category] > rule[1], 'prediction'] = rule[2]      elif rule[0] == "<=":
p_X.loc[p_X[category] <= rule[1], 'prediction'] = rule[2]      elif rule[0] == ">=":
p_X.loc[p_X[category] >= rule[1], 'prediction'] = rule[2]      else:
print("Invalid rule detected: {}".format(rule))  if len(p_X.loc[p_X['prediction'].isna()].index != 0):    base_X = p_X.loc[p_X['prediction'].isna()].copy()
base_X.drop('prediction', axis=1, inplace=True)
p_X.loc[p_X['prediction'].isna(), 'prediction'] = self.base_model.predict(base_X)  return p_X['prediction'].values

The predict method copies our input pandas dataframe in order not to change the input data. We then add a prediction column in which we gather all our hybrid model’s predictions. Just as in the _get_base_model_data method, we iterate through all rules and, wherever applicable, record the corresponding return value in the prediction column. Once we have applied all rules, we check whether any predictions are still missing. If this is the case, we revert to our base_model to generate the remaining predictions.

Other Required Methods

To get a working model that inherits from the BaseEstimator class, we need to implement two more simple methods — get_params and set_params. These allow us to set and read the parameters of our new model. As these two methods are not integral to the topic of this article, please have a look at the fully documented implementation below if you want to know more.

Spread the word

This post was originally published by Lukas Haas at Towards Data Science

Related posts