Title: Cost-sensitive Boosting, Margin Maximization & Information Theory
Abstract: Boosting refers to a family of learning algorithms that additively combine multiple weak predictive models to construct a powerful one. The next model to be added is effectively fitting the residuals of the current model. In the first part of the talk, we will provide a unifying perspective for two decades of work on cost-sensitive Boosting algorithms for classification. We analyse the relevant literature using four theoretical frameworks: decision theory, functional gradient descent, margin theory, and probabilistic modelling. We find that only 3 of the –more than 15– published variants are consistent with the rules of all the frameworks — and even they require their outputs to be calibrated to achieve this. The final recommendation – based on simplicity, flexibility and empirical performance – is to use the original AdaBoost algorithm with a shifted decision threshold and calibrated probability estimates.
The reason for the poor calibration of the scores generated by Boosting models lies in the margin maximization property of Boosting, which forces the model to be overconfident in its predictions. In the second part of the talk, we focus on the more positive aspect of this property: constructing classifiers good at generalizing beyond the training sample. We then discuss ongoing work on interpreting margin maximization from an information-theoretic perspective and connecting it to a general language for expressing both the quality of fit and the effective complexity of a predictive model.