We are all familiar with machine learning, right? In simplest terms, the goal of machine learning is to learn the underlying model or function that can reliably produce a desired or known outcome from given data. This can mean many things depending on one’s field of research, including:

- Learning the function that maps a given set of gene expression levels, metabolite abundance, or clinical images to a phenotype
- Learning the function that maps an audio signal to an English sentence
- Learning the function that maps an image to an image subject
- Learning a mapping from a graph’s nodes to a set of modules optimized for a modularity index
- Learning a mapping from an image to a set of regions of interest based on color or saturation gradients

# Assumptions in Machine Learning

While the true underlying model is unknown, we generally make some assumptions about the form it takes. If we don’t, then the set of possible solutions effectively becomes uncountably infinite. That is, if we were to take all parameters to the model and sort them in order of value, we could always add an infinite number of other parameters between any two sorted parameters. We need to bound the set of possible solutions to make the problem tractable.

Here are some of the common assumptions made:

- The outcome is produced by a linear function of features in the data (
*linear regression*). - The outcome is produced by multiplying together a set of conditional probability distributions, where outcome is conditioned on the individual features (
*Naive Bayes).* - The outcome is produced by a series of chained decision boundaries (
*decision tree*). - The outcome is produced by a series of stacked perceptrons on the data features, with a fixed number of perceptrons at each layer and a fixed number of layers (
*neural network*). - The outcome is produced using a series of mean clustering iterations on the input space (
*k-means clustering*).

When we make these assumptions, there is always the possibility that they could be wrong. Yes, this is even true of neural networks, the so-called **universal function approximators**. While a single neural network can approximate a wide range of functions after parameter tuning, it may still miss the mark; for instance, neural networks trained to classify gender by eye images have been found to model the presence / absence of eye makeup [1], and it is also possible to change the outcome of a neural network for image identification by changing a single pixel [2].

# Motivation for Ensembles

A way to mitigate this is to learn a combination of these models, i.e. an **ensemble**. This ensemble may include multiple types of models (for instance, linear regression and k-means clustering) or it may include combinations of the same type of model but with different parameterizations (an example of this is a *random forest, *which is a combination of many decision trees). In the image below, each musician’s part represents a single model, with the true model only being realized when all parts are present.

Concretely, though, you may wonder why an ensemble is an improvement over a single model. As long as the model is complex enough, shouldn’t that be sufficient? There are two references I would like to recommend that nicely describe the utility of ensemble models, by Dietterich [3] and Polikar [4], respectively. I will summarize a few reasons why ensemble methods are useful below:

**Robustness to sample size.**With small sample sizes, it is difficult to learn all parameters of a complex model with high accuracy. Rather, given different initializations, different models of the same class will learn different parameterizations, all equally accurate. Combining these models may yield a more accurate prediction.**Robustness to local minima.**Some models have the tendency to become caught in local minima. Ensemble methods can mitigate this by finding a space between the local minima.**Ability to expand the space of possible solutions.**Again, a model is limited to a solution within a given parameter space. If the true solution is not within this parameter space, combining models can expand the set of possible solutions to include a parameterization closer to the optimal model.**Ability to combine multiple data sources.**Ensembles can be used as a way of fusing multiple data sources. In this case, a separate model is often learned for each source, and these models are then combined to obtain a final model.

# Types of Ensemble Methods

Here, I describe the methods for combining models in an ensemble and delve into some basic ensemble frameworks.

## Methods for Combining Models

Classification models are often combined using **majority voting. **In this framework, the class predicted by the majority of models in the ensemble is reported as the true class. This is reasonable if all models are equally accurate predictors, but that is not always the case. A variant of this method is **weighted majority voting**, in which the *weight* of each model’s output toward the final output is determined by its accuracy on the training data. Other common methods include algebraic combination, selection of a single “best” model for each input datum, or combination using fuzzy logic [5].

## Basic Ensemble Frameworks / Techniques

### Bagging [6]

In bagging, the aim is to have the models in the ensemble learn diverse parameters by training each model on a random subset of the data. This is useful because, if each model learns different parameters, the chances of the error between the models being correlated is reduced. This allows each model to provide a useful contribution to the outcome.

### Addition of Gaussian Noise [7]

Another approach toward the same goal is the addition of Gaussian noise to each model. Like bagging, the addition of noise helps to diversify the set of parameters learned for each model. It can be used in conjunction with bagging, as shown below, but that need not be the case.

### Boosting [8]

Boosting can be thought of as a “smarter” variant of bagging. Rather than sampling randomly, each model’s samples are determined using the predictions of the previous model. Those samples for which outcomes disagree on the previous models are used to train the current model. In the AdaBoost variant, those samples which are mispredicted in all previous models are used to train the current model. Thus, the last models to be trained handle the most complex classification cases.

### Feature Manipulation [9][10]

In feature manipulation, it is not the samples which are divided when training the models, but the features. Each model is trained using a subset of features from the input, after which the models are combined. Again, the goal is to diversify the set of models so that error is uncorrelated. Random forests often use this technique. Other than simply subsetting the features, some techniques involve projecting the features onto novel vector spaces, such as the eigenspace.

### Error-Correcting Output Codes [11]

This can be useful for data with many classes, and it relies on the ability to decompose the set of output classes into uncorrelated bits (This can be a difficult task). Here, the same data is input into each model, but the models are all binary classifiers focused on classification of just one output bit. Assuming that the bits are uncorrelated, model error should also be. Once each bit is predicted, a nearest neighbor technique is used to determine the class associated with that set of bits.

### Random Weight Initialization [12]

For models that are not robust to weight initialization, the use of random weights to initialize each model can diversify the parameter space. This is often used in combination with one of the other techniques described above.

### Stacking [13]

This looks like a neural network but, unless each model is a perceptron, it isn’t. Here, rather than using sampling techniques to diversify the parameter space, decision models at the second layer determine directly how each model should be trained. Note that, like a neural network, this technique increases the space of parameters to learn and may not be a good choice for small sample sizes.

### Mixture of Experts [14]

A mixture of experts model is similar to a stacked ensemble, but instead of learning a layer of additional models on top of the first layer, it learns gated network of the model outputs. This is called a mixture of experts because, for a given input, the set of “experts” that can best classify that input varies. In this way, it is similar in principle to boosting, but the mixture of experts uses only the output of a subset of relevant models. The illustration below shows different classes of models being used but, notably, this need not be the case. In addition, illustrations above that do not show different classes of models may still use different classes of models.

### Composite Classifier [15]

In a composite classifier, a space of data that cannot be reliably classified by the first model is fed into the second for classification. This is similar in theory to boosting, but the unclassifiable space is actually estimated directly during training of the first model. This was a very early type of ensemble method first described in 1979 and consisting of a linear classifier (a type of proto-SVM) fed into a KNN model. In theory, other types of models could be used.

### Dynamic Classifier Selection [16]

In this framework, a single model is used to predict each datum based on its similarity to training data. The way this work is that each training datum is classified using each of the models, and the best model for that datum is chosen and tracked. Then, each datum in the testing set is mapped to its nearest neighbor in the training set, and the best predictor for that neighbor is used to predict the datum. Like the mixture of experts or the stacked framework, this technique assumes that not all models are useful for all data. However, it is more stringent in that it allows only a single model to be used for each datum.

# Conclusion

Ensemble models provide multiple advantages over single models for predicting outcome, but a necessary condition for high performance is that the models in the ensemble not make the same mistakes at the same time. There are multiple techniques described above which work to achieve this goal in different ways. Note that the techniques described above represent basic classes of ensemble frameworks (as per the dates cited below) and are not necessarily an exhaustive list.

# References

- Kuehlkamp,A. et al. (2017) Gender-From-Iris or Gender-From-Mascara? In, IEEE Winter Conference on Applications of Computer Vision.
- Su,J. et al. (2019) One Pixel Attack for Fooling Deep Neural Networks. In, IEEE Transactions on Evolutionary Computation.
- Dietterich,T.G. (2000) Ensemble Methods in Machine Learning. Int. Work. Mult. Classif. Syst., 1857, 1–15.
- Polikar,R. (2009) Ensemble learning. Scholarpedia, 4, 2776.
- Scherer,R. (2010) Designing boosting ensemble of relational fuzzy systems. Int. J. Neural Syst., 20, 381–8.
- Breiman,L. (1994) Bagging Predictors. Berkeley, California.
- Raviv,Y. and Intrator,N. (1996) Bootstrapping with Noise: An Effective Regularization Technique. Conn. Sci., 8, 355–372.
- Freund,Y. and Schapire,R.E. (1999) A Short Introduction to Boosting.
- Cherkauer,K.J. (1996) Human Expert-Level Performance on a Scientific Image Analysis Task by a System Using Combined Artiicial Neural Networks. AAAI Press.
- Tumer,K. and Ghosh,J. (1996) Error Correlation and Error Reduction in Ensemble Classifiers. Conn. Sci., 8, 385–404.
- Dietterich,T.G. and Bakiri,G. (1995) Solving Multiclass Learning Problems via Error-Correcting Output Codes. Journal of Artiicial Intelligence Research, 2, 263-286.
- Kolen,J.F. and Pollack,J.B. (1990) Back Propagation is Sensitive to Initial Conditions. Complex Syst., 4.
- Wolpert,D.H. (1992) STACKED GENERALIZATION. Los Alamos, NM.
- Jacobs,R.A. et al. (1991) Adaptive Mixtures of Local Experts. Neural Comput., 3, 79–87.
- Dasarathy,B. V. and Sheela,B. V. (1979) A composite classifier system design: Concepts and methodology. Proc. IEEE, 67, 708–713.
- Giacinto,G. and Roli,F. (1999) Methods for dynamic classifier selection. In, Proceedings – International Conference on Image Analysis and Processing, ICIAP 1999., pp. 659