*This post was originally published by Lukas Haas at Towards Data Science*

There are many ways in which we can integrate deterministic rules into our machine learning pipeline. Adding rules progressively as data pre-processing steps might seem intuitive, but this would not suit our goal. Preferably, we aim to leverage the concept of abstraction by adopting object-oriented programming (OOP) to generate a novel ML model class. This hybrid model will then encompass all deterministic rules, enabling us to train it like any other machine learning model.

Conveniently, *scikit-learn* provides a *BaseEstimator* class which we can inherit to build scikit-learn models ourselves without much effort. The advantage of building a new estimator is that we can blend our rules directly with the model logic while leveraging an underlying machine learning model for all data to which the rules don’t apply.

Let us start by building our new hybrid model class and adding an *init* method to it. As an underlying model, we will use the *scikit-learn* implementation of a *GradientBoostingClassifier**;* we will call it the *base_model*.

import pandas as pd from typing import Dict, Tuple from sklearn.base import BaseEstimatorclass RuleAugmentedGBC(BaseEstimator):

**def __init__(self, base_model: BaseEstimator, rules: Dict, **base_params):**

self.rules = rules

self.base_model = base_model

self.base_model.set_params(**base_params)

We created the *RuleAugmentedGBC* class which inherits from *BaseEstimator*. Our class is not complete yet and is still missing some essential methods, but it is now technically a *scikit-learn* estimator. The *init* method initializes our estimator utilizing a *base_model* and a dictionary of rules. We can set additional parameters in the *init* method which are then directly passed to the underlying *base_model*. In our case, we will use a *GradientBoostingClassifier* as the *base_model*.

### A Common Format for Rules

In this article’s implementation, we will supply rules to the model in the following format:

```
{"House Price": [
("<", 1000.0, 0.0),
(">=", 500000.0, 1.0)
],
"...": [
...
...
]}
```

As illustrated above, we format rules as a *Python* dictionary. The dictionary keys represent the feature column names to which we want to apply our rules. The values of the dictionary are lists of tuples, each tuple representing a unique rule. The first element of the tuple is the logical operator of the rule, the second the split criterion, and the last object is the value which the model should return if the rule is applicable.

For instance, the first rule in the example above would indicate that if any value in the *House Price* feature column is less than 1000.0, the model should return the value 0.0.

### The fit Method

We proceed to code a *fit* method (within our *RuleAugmentedGBC* class) to allow our model to train on data. What is important to notice here is that we want to use our deterministic rules wherever possible, and train the *base_model* only on data which is not affected by the rules. We will decompose this step by formulating a private helper method called *_get_base_model_data* to filter out the data necessary to train our *base_model.*

def fit(self, X: pd.DataFrame, y: pd.Series, **kwargs):train_x, train_y = self._get_base_model_data(X, y) self.base_model.fit(train_x, train_y, **kwargs)

The *fit* method is pretty straightforward: it first applies the to be coded *_get_base_model_data* method to distill the training features and labels for our underlying *base_model* and then fits the model to the data. Similar to before, we can set additional parameters which we subsequently pass to the fit method of the *base_model*. Let us now implement the *_get_base_model_data* method:

def _get_base_model_data(self, X: pd.DataFrame, y: pd.Series) -> Tuple[pd.DataFrame, pd.Series]:train_x = X

for category, rules in self.rules.items():

if category not in train_x.columns.values: continue for rule in rules: if rule[0] == "=": train_x = train_x.loc[train_x[category] != rule[1]] elif rule[0] == "<": train_x = train_x.loc[train_x[category] >= rule[1]] elif rule[0] == ">": train_x = train_x.loc[train_x[category] <= rule[1]] elif rule[0] == "<=": train_x = train_x.loc[train_x[category] > rule[1]] elif rule[0] == ">=": train_x = train_x.loc[train_x[category] < rule[1]] else: print("Invalid rule detected: {}".format(rule)) indices = train_x.index.values train_y = y.iloc[indices] train_x = train_x.reset_index(drop=True) train_y = train_y.reset_index(drop=True) return train_x, train_y

Our private *_get_base_model_data* method iterates through the rule dictionary keys and finally through every unique rule. At every rule, depending on the logical operator, it narrows down the *train_x* pandas dataframe to only include the data points not affected by the rule. Once we have applied all rules, we match the corresponding labels via indices and return the residual data for the *base_model*.

### The predict Method

The *predict* method works in like manner to the *fit* method. Wherever possible, rules should be applied; if no rules are applicable, the *base_model* should produce a prediction.

**def predict(self, X: pd.DataFrame) -> np.array:**

p_X = X.copy()

p_X[‘prediction’] = np.nan

for category, rules in self.rules.items(): if category not in p_X.columns.values: continue for rule in rules: if rule[0] == "=": p_X.loc[p_X[category] == rule[1], 'prediction'] = rule[2] elif rule[0] == "<": p_X.loc[p_X[category] < rule[1], 'prediction'] = rule[2] elif rule[0] == ">": p_X.loc[p_X[category] > rule[1], 'prediction'] = rule[2] elif rule[0] == "<=": p_X.loc[p_X[category] <= rule[1], 'prediction'] = rule[2] elif rule[0] == ">=": p_X.loc[p_X[category] >= rule[1], 'prediction'] = rule[2] else: print("Invalid rule detected: {}".format(rule)) if len(p_X.loc[p_X['prediction'].isna()].index != 0): base_X = p_X.loc[p_X['prediction'].isna()].copy() base_X.drop('prediction', axis=1, inplace=True) p_X.loc[p_X['prediction'].isna(), 'prediction'] = self.base_model.predict(base_X) return p_X['prediction'].values

The *predict* method copies our input pandas dataframe in order not to change the input data. We then add a *prediction* column in which we gather all our hybrid model’s predictions. Just as in the *_get_base_model_data* method, we iterate through all rules and, wherever applicable, record the corresponding return value in the *prediction* column. Once we have applied all rules, we check whether any predictions are still missing. If this is the case, we revert to our *base_model* to generate the remaining predictions.

### Other Required Methods

To get a working model that inherits from the *BaseEstimator* class, we need to implement two more simple methods — *get_params* and *set_params*. These allow us to set and read the parameters of our new model. As these two methods are not integral to the topic of this article, please have a look at the fully documented implementation below if you want to know more.

*This post was originally published by Lukas Haas at Towards Data Science*