This post was originally published by Mahmoud Albardan at Medium [AI]
As the arsenal of classification algorithms increased dramatically, it became more and more tempting to use several classifiers and then combine their decisions to gain accuracy and avoid the burden of choosing the right one. Note that a combination of classifiers remains itself a classifier and the no free lunch theorem also applies to it. There is no hope for a universally optimal classifier but one can achieve other forms of robustness that we will later discuss. One of the simplest fusion methods is to perform majority voting on the set of predicted classes delivered by each classifier.
In this article, I present some generalities to bear in mind when combining classifiers. Specifically, I investigate the taxonomy of the fusion scheme concerning the nature of the combined information, the combination rule, the types of individual classifiers, and the structure of the combination. Then, we discuss the importance of diversity in a multi-classifier system.
Four questions are helpful to determine the perspective from which we are attacking the combination problem:
Q1: What nature of output information is to be combined?
Q2: What types of classifiers are to be combined
Q3: What combination rule is used to combine decisions?
Q4: What is the structure of this combination?
- A relevant categorization of classifier output natures (Q1) is presented in the literature: elementary outputs (the output of the classifier is a single label), ranked outputs (the output of the classifier is a list of labels ranked from most probable to least probable), and scored outputs (the classifier assigns a degree of confidence to each class label).
- Fusion methods lying on classifiers deriving from the same classification algorithm are called homogeneous combination approaches (e.g. random forest uses decision trees) or ensemble methods. A standard approach in this category is bagging (bootstrap aggregating) introduced for the first time by Breiman and developed later. Another very popular ensemble method is boosting whose most widely used implementation is AdaBoost. Bagging and boosting were well explored and examined by the machine learning community in the last two decades. Recently, a new boosting design that makes use of linear programming support vector machines (LPSVM). If the fusion panel contains different classifiers, then we talk about heterogeneous combination approaches i.e. combining a neural network and a decision tree. These two types of panels (homogeneous and heterogeneous) are the main categories of multiple classifier systems arising from (Q2) .
- Let alone the types of combined classifiers, combination methods are also categorized with respect to the rule allowing decision fusion. The two main categories are deterministic versus probabilistic approaches which are two possible answers to (Q3) . By probabilistic approaches, we mean that the classification function relies on the maximization of the class probability distribution. Classifier combination can also be performed within other uncertainty frameworks which may be alternatives to probability theory such as fuzzy logic or related to probability theory such as the belief function framework. In the original paper of Dempster, belief functions are obtained by plugging a probability measure with a multi-valued mapping. Other authors advocate that the theory is self-contained and need not to be built upon probabilistic objects.
- Concerning combination architecture (Q4), we mean how classifiers are organized as a network and connected to the fusion process. There are three main categories of topology: parallel, sequential or hybrid combinations. In parallel fusion, the base classifiers work independently, they may be trained with inputs living in different feature spaces and the feature vectors may or may not be derived from the same raw training examples. However, the output of a given classifier cannot serve as input for another classifier. In sequential (serial) fusion, elementary classifiers are stacked in a sequential way and the decision of one classifier depends on a previous decision. Some class labels are eliminated at each classification step until one class is left. Hybrid hierarchical fusion consists in a mix of parallel and sequential architectures.
This figure, related to (Q4) presents the three different schemes of classifier combination. This figure is extracted from my own PhD report.
Designing a multiple classifier system (MCS) requires special care in the choice of individual classifiers so as to achieve higher classification accuracy or obtain more robust algorithms. For instance, surely a panel of linear classifiers trained on the same dataset will converge to very close separating hyperplanes and combining very similar decision rules is doomed to achieve average results. Thus, diversity is a guiding rule to design an efficient multiple classifier system.
The need for diversity in responses
The notions of diversity and independence are closely related but the separating line in between is still unclear. However, we might not expect that a highly dependent classifier pool outputs very diverse responses since dependent classifiers will certainly produce correlated responses. In this section, we discuss the ways of measuring diversity as well as the methods to induce diversity in classifier responses.
Brow et al. advocate that inducing diversity in a classifier ensemble can be done in two ways. The first way is explicit and it is based on the optimization of the diversity metric over a pool of classifiers. The second one is implicit and it consists of generating classifiers using different mechanisms and hoping to have significant independence in their responses. This is usually done by training them on different learning data samples or in different regions of the features space.
Implicit diversity induction schemes
The second category of diversity induction involves some implicit approaches based on the manipulation of either the input data, the output decisions, or the types of classifiers.
In their survey, Wozniak et al. formulated the following taxonomy for data manipulation:
• Partitioning the data points: this allows training individual classifiers on different training sets. Even though learning samples are somehow correlated, we keep a certain level of diversity in classifier responses using different training data that might be obtained by a split or by a random sampling of learning samples as in Bagging.
• Selecting subsets of features: again, base classifiers are thus trained on different datasets in the sense that they have access to different pieces of information for each training example. Thus, decision functions have different domains and if the features are weakly correlated then we achieve diversity. A popular strategy in this vein is to randomly select the subset of those features selected to train each base classifier. This strategy is known as the random subspaces method. Another strategy consists in selecting the classifier having the best performances for each partition in the feature space .
• Modifying classifier outputs: MCS can enjoy another form of diversity by asking each classifier to discriminate only a subset of classes. To recover a decision granularity of the whole set of classes, it is necessary to design a specific information fusion strategy. A famous approach in this category are one-vs-all classifiers in which case the problem of multi-class classification problem is decomposed into binary classification problems. The classifier will choose between a given label and the remaining labels. The error-correcting output codes method also belongs to this family of approaches. In this setting, each label has a codeword (sequence of binary values) and then binary classifiers are built to produce a codeword for each test sample. The chosen label is the one having the closest hamming distance to the test sample codeword.
This post was originally published by Mahmoud Albardan at Medium [AI]