School and Workshop on Classification and Regression Trees - IMS

School and Workshop on Classification and Regression Trees
(10 - 26 Mar 2014)

~ Abstracts ~

Decision support system for classifying emergency room patients
Hongshik Ahn, Stony Brook University, USA

Acute GIB (Gastrointestinal bleeding) is a medical emergency and one of the leading causes for intensive care unit admission from emergency rooms. It necessitates rapid medical response and institution of urgent intervention. The implementation of an artificially intelligent system utilizing current guidelines could aid triage and risk stratify patients to help optimize care, resource utilization and also standardize care through incorporation of evidence based guidelines into classification models. Several classification methods were chosen and run with their performances compared. Training was performed on a randomly selected subset of patients and testing on the remaining patients of a toal of 122 patients with acute GIB. An online tool has been developed for a user-friendly interface that physicians and nurses can utilize. This online tool will be utilized in future studies in the hope that this tool or something similar can be adopted for routine use in caring patients in emergency rooms.

Semiparametric discriminant analysis for mixture populations using Mahalanobis distance
Probal Chaudhuri, Indian Statistical Institute, India

Mahalanobis distance and Fisher's linear discriminant analysis are fundamentally related to each other, and both the ideas have been extensively used in the classification literature. Fisher's linear and quadratic discriminant functions are known to possess Bayes risk optimality when the class distributions are Gaussian. However, they may have poor performance when the assumption of Gaussianity is violated, and the class boundaries for different populations have complex geometric shapes. In such situations, many standard nonparametric classifiers also have high misclassification rates when the data dimension is large due to the curse of dimensionality that affects such nonparametric methods. In this article, we derive an expression for the Bayes classifier when the class distributions are finite mixtures of elliptic distributions. The Bayes classifier turns out to be a function of the Mahalanobis distances of a data point from different sub-populations within a mixture population. We use this to develop and investigate a semi-parametric classification procedure, which is named as SPARC (SemiPARametric Classification), and we demonstrate its performance on simulated and real benchmark data sets.

Tree-structured mixed-effects regression modeling for longitudinal data
Hyungjun Cho, Korea University, Korea

Tree-structured models have been widely used because they function as interpretable prediction models that offer easy data visualization. A number of tree algorithms have been developed for univariate response data and can be extended to analyze multivariate response data. We propose a tree algorithm by combining the merits of a tree-based model and a mixed-effects model for longitudinal data. We alleviate variable selection bias through residual analysis, which is used to solve problems that exhaustive search approaches suffer from, such as undue preference to split variables with more possible splits, expensive computational cost, and end-cut preference. Most importantly, our tree algorithm discovers trends over time on each of the subspaces from recursive partitioning, while other tree algorithms predict responses. We investigate the performance of our algorithm with both simulation and real data studies. We also develop an R package melt that can be used conveniently and freely.

Classification and regression soft trees and beyond
Antonio Ciampi, McGill University, Canada

When constructing a predictor from data, the analyst thinks of predictor variables as being precisely known. Therefore, when the predictor is a tree, a cut off to decide whether a subject goes left or right at a node is also chosen to be a precise value of a particular predictor variable. Data are rarely precise, however; this is recognized explicitly only in certain approaches to data analysis that treat interval valued data, e.g. a subject is between 50 and 55 years of age. Clearly a hard cut-off for such data does not make much sense, and approaches have been devised to send subjects right or left with a certain probability. Then, at least for subjects with imprecise data, the 'hard' decision rule at a node may be replaced by a 'soft' decision rule. From this to the idea of building trees with soft nodes there is a small step. Even for precisely known variables, a soft threshold may actually improve predictive accuracy if nothing else because it offers a broader class of predictors within which to choose; and this class remains interpretable, if with a little extra effort. In this presentation, I will trace the development of trees with soft nodes, starting from the simple case of a binary response and going towards Soft Classification & Regression Trees and beyond.

Clustering patients into subgroups differing in optimal treatment alternative: the method QUINT
Elise Dusseldorp, Netherlands Organization for Applied Scientific Research, The Netherlands

In the analysis of randomized controlled trials (RCTs), treatment effect heterogeneity often occurs, implying differences across (subgroups of) patients in treatment efficacy. This phenomenon is typically referred to as treatment-subgroup interactions. The identification of subgroups of patients, defined in terms of pretreatment characteristics, that are involved in a treatment-subgroup interaction is a methodological challenging task, especially when many characteristics are available that may interact with treatment and when no comprehensive a priori hypotheses on relevant subgroups are available. A special type of treatment-subgroup interactions occurs if the ranking of treatment alternatives in terms of efficacy differs across subgroups of patients (e.g., for one subgroup treatment A is better than B and for another subgroup treatment B is better than A). These are called qualitative treatment-subgroup interactions and are most important for optimal treatment assignment. The method QUINT (Qualitative INteraction Trees; Dusseldorp & Van Mechelen, 2014) was recently proposed to induce subgroups involved in such interactions from RCT data. The result of an analysis with QUINT is a binary tree from which treatment assignment criteria can be derived. The method is implemented in the R package quint. In this presentation the method will be explained, along with the use of the package. A real data set will be used as motivating example.

Dusseldorp E, Van Mechelen I (2014). Qualitative interaction trees: A tool to identify qualitative treatment-subgroup interactions. Statistics in Medicine, 33, 219-237.

Kernel-induced classification tree and its ensemble model
Guangzhe Fan, Dalian University of Technology, China

As a useful complement to kernel learning, a recursive-partitioning procedure using kernel functions is proposed. We call it KICT- kernel-induced classification trees. Essentially, KICT uses observation-based kernel functions to construct split rules for a CART model. The resulting KICT model could perform significantly better in classification than the CART model in many situations, especially when the pattern of the data is non-linear. KICT, usually small in tree size, also performs very well in comparison to random forests and SVM in these situations. To further improve KICT, we also introduce KIRF- kernel-induced random forests. KIRF is competitive to random forests and SVM in many situations. We use simulated and real world data to illustrate their performances. We conclude that the proposed methods are useful alternatives and complements to CART, random forest, and SVM by combining a rich class of kernel functions with the recursive-partitioning learning procedure (KICT) and the ensemble random forests procedure (KIRF), which are flexible to a variety of non-traditional data types with properly defined kernels.

KEY WORDS: classification tree, feature space, kernel function, random forest, split rule, support vector machine

Regression tree methods for subgroup identification I
Xu He, Chinese Academy of Sciences, China

In the search for new drugs to treat diseases such as cancer, it is very hard to find one that has a large beneficial effect for all subjects. A focus in recent years is to identify subgroups of the subject population where a treatment has a significant above-average effect. Because a regression tree model naturally partitions the data into subgroups, it is an ideal solution to this problem. We review several previous attempts including the Virtual Twins method (Foster et al., 2011) and the SIDES method (Lipkovich et al., 2011). Then we introduce some new methods based on the GUIDE algorithm.

Constant partying: Introducing a generic toolkit for recursive partitioning in R
Torsten Hothorn, University of Zurich, Switzerland

Recursive partitioning methods, or simply "trees", are simple yet powerful methods for capturing regression relationships. Hence, many different algorithms have been suggested in both the statistics and machine learning communities and many standard algorithms are available as R packages, e.g., in rpart, RWeka, party, and many others. However, no common infrastructure is available for representing trees fitted by different packages. Consequently, the capabilities for extraction of information - such as predictions, printed summaries, or visualizations - vary between packages and come with somewhat different user interfaces. Similarly, implementations of new tree models might also require new infrastructure, e.g., for multi-way splits or more complex models in the leafs. To overcome these difficulties, the partykit package offers a unifiedrepresentation of tree objects along with predict(), print(), and plot() methods. Trees are represented through a new flexible class "party" which can, in principle, capture all trees mentioned above. Moreover, a flexible subclass "constparty" is provided for recursive partitions with constant fits (e.g., means or proportions) in the terminal nodes. This greatly facilitates the representation of different types of trees in a common framework, e.g., trees learned by rpart, J48, evtree, or chaid, or read from an XML specification in the predictive modeling markup language PMML.

SPARC - SemiPARametric Classification
Shuddhodan Kadattur Vasudevan, Tata Institute of Fundamental Research, India

More than seventy-five years ago, R. A. Fisher and P. C. Mahalanobis published their classic papers in the Annals of Eugenics (1936) and the Proceedings of the National Institute of Sciences of India (1936), respectively. It turns out that Mahalanobis' distance and Fisher's linear discriminant analysis are fundamentally related to each other, and both ideas have been used extensively in the statistics literature. However, Fisher's linear and quadratic discriminant functions may have poor performance when the assumption of Gaussianity is violated, and the class boundaries for different populations have complex geometric shapes. Several standard nonparametric classifiers also have high misclassi fication rates when the data dimension is large due to the curse of dimensionality which affects such nonparametric methods in such situations. In this talk, we derive an expression for the Bayes' classifier when the class distributions are finite mixtures of elliptic distributions. The class posterior probabilities satisfy a generalized additive model, and the Bayes' classifier turns out to be a simple function of Mahalanobis' distances of a data point from different sub-populations within a mixture population. We use this idea to develop a semi-parametric classification procedure, and demonstrate its performance on simulated and real benchmark data sets.

(This talk is based on a joint work with Prof. Probal Chaudhuri)

Weight-adjusted pruning for classification ensemble
Hyunjoong Kim, Yonsei University, Korea

Recently the concept of combining base classifiers is proposed to improve the classification accuracy of individual classifiers. An ensemble classifier uses the predictions from multiple base classifiers through majority vote or weighted vote to produce a final decision. We can build a better classification model by combining many weak classifiers to improve the accuracy, where the weak classifier performs slightly better than random guess. Typically, classification trees with a few splits are used as the base classifier.

In this study, an ensemble pruning method for classification is discussed. Ensemble pruning is beneficial because it allows faster classification, less memory for storage, and improved accuracy of the ensemble classifier. We propose an ensemble pruning method using the performance measure which is proportional to the degree of difficulty of training instances. In the experiment on 28 real and simulation data set, we show the proposed pruned ensembles statistically perform better than existing ensemble methods. Via bias-variance decomposition, we also confirm that the proposed method reduces the classification bias while maintaining low variance.

Mixed effects trees and forests for clustered data
Denis Larocque, HEC Montréal, Canada

In this talk, we will present extensions of tree-based and random forest methods for the case of clustered data. The proposed methods can handle unbalanced clusters, allows observations within clusters to be splitted, and can incorporate random effects and observation-level covariates. The basic tree-building algorithm for a continuous outcome is implemented using standard algorithms within the framework of the EM algorithm. The extension to other types of outcomes (e.g., binary, count) uses the penalized quasi-likelihood (PQL) method for the estimation and the EM algorithm for the computation. Simulation results show that the proposed methods provides substantial improvements over standard trees and forests when the random effects are non negligible. The use of the method will be illustrated with real data sets.

Authors: Denis Larocque (1), Ahlem Hajjem (2) and François Bellavance (1)

(1) Department of Management Sciences, HEC Montréal, Montréal, Canada
(2) Department of Marketing, Université du Québec à Montréal (UQAM), Montréal, Canada.

Regression tree methods for subgroup identification II
Wei-Yin Loh, University of Wisconsin-Madison, USA

This is a continuation of the talk by Xu He. In this part, the subgroup identification methods are extended to censored response variables and bootstrap confidence intervals are obtained for the treatment effects in the nodes of the tree.

Classification and regression trees
Wei-Yin Loh, University of Wisconsin-Madison, USA

Outline of lectures for school on Classification and Regression Trees

Day 1: Grand tour of examples and introduction to decision tree methods

(a) Linear regression: impact of air pollution on house prices.

(b) Poisson regression: defects in soldering circuit boards.

(c) Logistic regression: tree damage after a storm

(d) Multi-response data: interactions of variables in production of concrete.

(e) Longitudinal data: hourly wages of high-school dropouts.

(f) Censored data and differential treatment effects: breast cancer survival.

(g) Simple classification: classification of flower species.

(h) Unequal misclassification costs: attitudes towards mammography.

(i) Unbalanced classes: characterizing dissatisfied credit card holders.

---

Day 2: Classification tree algorithms

(a) THAID (Messenger and Mandell, 1972), CART (Breiman et al., 1984), and RPART (Therneau and Atkinson, 1997).

(b) FACT (Loh and Vanichsetakul, 1988), QUEST (Loh and Shih, 1997), CRUISE (Kim and Loh, 2001), and GUIDE (Loh, 2009).

(c) C4.5 (Quinlan, 1993), CHAID (Kass, 1980), and CTREE (Hothorn et al., 2006).

(d) Selection bias, accuracy, speed, and tree complexity of algorithms.

(e) GUIDE classification tree software demonstration

---

Day 3: Basic regression tree algorithms

(a) AID (Morgan and Sonquist, 1963), CART, RPART, and GUIDE piecewise constant models.

(b) GUIDE piecewise linear regression trees (Loh, 2002).

(c) Software demonstration.

---

Day 4: Advanced regression tree algorithms

(a) Quantile, Poisson, logistic, and proportional hazards regression for censored data.

(b) Regression trees for multivariate and longitudinal response data (Loh and Zheng, 2013).

(c) Subgroup identification for differential treatment effects.

(d) Software demonstration.

---

References:

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and Regression Trees. Chapman & Hall/CRC.

Chan, K.-Y. and Loh, W.-Y. (2004). LOTUS: An algorithm for building accurate and comprehensible logistic regression trees. Journal of Computational and Graphical Statistics, 13, 826-852.

Hothorn, T., Hornik, K., and Zeileis, A. (2006). Unbiased recursive partitioning: a conditional inference framework. Journal of Computational and Graphical Statistics, 15, 651-674.

Kass, G. V. (1980). An exploratory technique for investigating large quantities of categorical data. Applied Statistics, 29, 119-127.

Kim, H. and Loh, W.-Y. (2001). Classification trees with unbiased multiway splits. Journal of the American Statistical Association, 96, 589-604.

Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 12, 361-386.

Loh, W.-Y. (2006). Logistic regression tree analysis. Handbook of Engineering Statistics, Springer, 537-549.

Loh, W.-Y. (2009). Improving the precision of classification trees. Annals of Applied Statistics, 3, 1710-1737.

Loh, W.-Y. (2011). Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1, 14-23.

Loh, W.-Y. and Shih, Y.-S. (1997). Split selection methods for classification trees. Statistica Sinica, 7, 815-840.

Loh, W.-Y. and Vanichsetakul, N. (1988). Tree-structured classification via generalized discriminant analysis (with discussion). Journal of the American Statistical Association, 83, 715-728.

Loh, W.-Y. and Zheng, W. (2013). Regression trees for longitudinal and multiresponse data. Annals of Applied Statistics, 7, 495-522.

Messenger, R. and Mandell, L. (1972). A modal search technique for predictive nominal scale multivariate analysis. Journal of the American Statistical Association, 67, 768-772.

Morgan, J. N. and Sonquist, J. A. (1963). Problems in the analysis of survey data, and a proposal. Journal of the American Statistical Association, 5, 415-434.

Quinlan, J. R. (1992). Learning with continuous classes. In 5th Australian Joint Conference on Artificial Intelligence, 343-348.

Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann.

Therneau, T. and Atkinson, E. (1997). An introduction to recursive partitioning using the RPART routines. Technical Report 61, Mayo Clinic, Division of Biostatistics.

Witten, I. H., Frank, E., and Hall, M. A. (2011). Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, third edition.

Zeileis, A., Hothorn, T., and Hornik, K. (2008). Model-based recursive partitioning. Journal of Computational and Graphical Statistics, 17, 492-514.

PartDSA for deriving survival risk groups and for ensemble learning
Annette Molinaro, University of California, USA

We recently developed partDSA, a multivariate method that, similarly to CART, utilizes loss functions to select and partition predictor variables to build a tree-like regression model for a given outcome. However, unlike CART, partDSA permits both 'and' and 'or' conjunctions of predictors, elucidating interactions between variables as well as their independent contributions. partDSA thus permits tremendous flexibility in the construction of predictive models and has been shown to supersede CART in both prediction accuracy and stability. As the resulting models continue to take the form of a decision tree, partDSA also provides an ideal foundation for developing a clinician-friendly tool for accurate risk prediction and stratification.

With right-censored outcomes, partDSA currently builds estimators via either the Inverse Probability Censoring Weighted (IPCW) or Brier Score weighting schemes; see Lostritto, Strawderman and Molinaro (2012), where it is shown in numerous simulations that both proposed adaptations for partDSA perform as well, and often considerably better, than two competing tree-based methods. In this talk, various useful extensions of partDSA for right-censored outcomes are described and we show the power of the partDSA algorithm in deriving survival risk groups for glioma patient based on genomic markers. Another interesting extension of partDSA is as an aggregate learner. A comparison will be made of standard partDSA to an ensemble version of partDSA as well as to alternative ensemble learners.

Tree depth in a forest
Mark R. Segal, University of California, USA

All approaches for generating individual tree-structured predictors necessarily dedicate non trivial effort to determining appropriate tree size. For example, the CART paradigm of Breiman et al., introduced the cost-complexity pruning strategy coupled with cross-validation based size selection. However, in advancing his ensemble methods, bagging and random forests, individual tree size was universally preset so as to yield (near) maximal depth trees. Why this disconnect? And does it matter? To address the first question we revisit the predictive performance of random forests as evaluated using benchmark datasets and reveal some interesting and problematic aspects of the attendant repository. To address the second question we showcase some theoretical results based on adaptive potential nearest neighbours and provide an illustrative data analysis pertaining to splice site recognition.

Rank-based decision tree for paired comparison data
Yu-Shan Shih, National Chung-Cheng University, Taiwan

Paired comparison outcomes are collected by comparing objects in couples. A common approach to analyze such data is to find the preference patterns of the objects based on some covariates. In this study, a new decision tree method for analyzing paired comparison data is proposed. A scoring system is implemented and the total scores associated with each object for each subject are counted. The GUIDE multi-response regression tree is then applied to the score ranking outcomes and the average ranks of the objects are used to give the preference in each terminal node. Simulation study and real data analysis are given to demonstrate its usefulness.

Regression trees for longitudinal and clustered data based on mixed effects models: methods, applications, and extensions
Jeffrey Simonoff, New York University, USA

Longitudinal data refer to the situation where repeated observations are available for each sampled object. Clustered data, where observations are nested in a hierarchical structure within objects (without time necessarily being involved) represent a similar type of situation. Methodologies that take this structure into account allow for the possibilities of systematic differences between objects that are not related to attributes and autocorrelation within objects across time periods. This talk discusses work related to tree-based methods designed for this situation. After describing methods based on tree construction for multiple response data, we focus on a methodology that combines the structure of mixed effects models for longitudinal and clustered data with the flexibility of tree-based estimation methods through the use of an EM-type algorithm. The resulting estimation method, called the RE-EM tree, is less sensitive to parametric assumptions and can provide improved predictive power compared to linear models with random effects and regression trees without random effects.

We then describe how the RE-EM tree can be used as a diagnostic check when applying the most commonly used model for longitudinal and clustered data, the linear multilevel model (which combines a linear model for the population-level fixed effects, a linear model for normally distributed individual-level random effects, and normally distributed observation-level errors with constant variance). Application of the RE-EM tree to (functions of) the residuals from a linear multilevel model fit can be used to construct goodness-of-fit tests for nonlinearity of the fixed effects or heteroscedasticity of the errors. The resultant tests have good power to identify explainable model violations (that is, ones that are related to available covariate information in the data).

The RE-EM tree described above is based on the CART tree algorithm proposed by Breiman et al. (1984) for tree building, and therefore it inherits the tendency of CART to split on variables with more possible splits. We propose a revised version of the RE-EM regression tree that corrects for this bias by using the conditional inference tree proposed by Hothorn et al. (2006) as the underlying tree algorithm instead of CART. The new version is apparently unbiased, and has several improvements over the original RE-EM regression tree in terms of prediction accuracy and the ability to recover the correct true structure.

Portions of this talk are based on Sela and Simonoff (2012, Machine Learning) and Simonoff (2013, Statistical Modelling), and on joint work with Wei Fu.

Interaction trees for exploring stratified and individual treatment effects
Xiaogang Su, University of Texas at El Paso, USA

Assessing heterogeneous treatment effects has become a growing interest in many application fields, such as alcohol abuse treatment and psychological medicine. The concepts of stratified and individual treatment effects involved in such as assessment are closely related to subgroup analysis, personalized medicine, and optimal treatment regime. Concerning experimental data collected from randomized trials, we proposed a tree-structured method, termed 'Interaction Trees'(IT), to explore stratified and individual treatment effects. IT recursively splits data in such a way that the treatment effect become more homogeneous within each resultant partition. The automated procedure facilitates a number of objectively defined subgroups, in some of which the treatment effect is found prominent while in others the treatment has a negligible or even negative effect. The standard CART (Breiman et al., 1984) methodology is inherited for built-in validation in order to reduce false positive findings.

Inference at different levels can be made on the basis of IT. An aggregated grouping procedure stratifies data into refined groups where the treatment effect remains homogeneous. Ensembles of IT models can provide prediction for individual treatment effects and it compares favorably to the traditional `separate regression' methods. In order to extract meaningful interpretations, we also made available several other features such as variable importance and partial dependence plot. An empirical illustration of the proposed techniques is made via an analysis of quality of life (QOL) data among breast cancer survivors.

Parties, models, mobsters: A new implementation of model-based recursive partitioning in R
Achim Zeileis, Universität Innsbruck, Austria

MOB is a generic algorithm for model-based recursive partitioning (Zeileis, Hothorn, Hornik 2008). Rather than fitting one global model to a dataset, it estimates local models on subsets of data that are "learned" by recursively partitioning. It proceeds in the following way: (1) fit a parametric model to a data set, (2) test for parameter instability over a set of partitioning variables, (3) if there is some overall parameter instability, split the model with respect to the variable associated with the highest instability, (4) repeat the procedure in each of the child nodes. It is discussed how these steps of the conceptual algorithm are translated into computational tools in an object-oriented manner, allowing the user to plug in various types of parametric models. For representing the resulting trees, the R package "partykit" is employed: The basic "party" class is leveraged and extended to a new class "modelparty" providing generic infrastructure for recursive partitions where nodes are associated with statistical models. Compared to the previously available implementation in the party package, the new implementation supports more inference options, is easier to extend to new models, and provides more convenience features.

Best viewed with IE 7 and above