## An implementation in R using JOUSBoost and XGBoost

Since its inception, AdaBoost (acronym for Adaptive Boost) by Freund and Schapire (1997) has been a very popular among data scientists. While it has evolved in many ways (Discrete AdaBoost, Real AdaBoost, etc) it’s popularity has not dwindled. In fact, Breiman (NIPS Workshop, 1996) referred to AdaBoost with trees as the “best off-the-shelf classifier in the world” (see also Breiman (1998)).

Recently, my team was given a classification task. The objective was to determine if the closest match that we found for an entity was a good a match or not. It was a binary classification task. We tried many methods - support vector machines, logistic regression, lasso regression, random forest - but ultimately none of these were any match for AdaBoost. A close second was Gradient Boosting. In this notebook I will take you through an implementation and benefits of boosting and compare the performance of AdaBoost and Gradient Boosting.

For our classification task we are going to use the famous Titanic dataset (hosted by Kaggle). The task is to predict whether a passenger on board survived the sinking of the Titanic.

## Data cleaning

Since, the objective of the blog post is to practice boosting, I’m not going to explain most of the data cleaning steps. The main thing to note is that boosting algorithms require all the data to be numeric. In fact, I’m applying the same data cleaning steps that were created by Ranjeet Singh in this notebook.

library(tidyverse)
# we are going to combine the train and test data so that we apply the same transformations on both datasets
# the column to predict is called Survived
# it will be NA for all the rows that correspond to the test dataset
tt <- dplyr::bind_rows(train,test)
tt$Survived = as.factor(tt$Survived)
tt$Pclass = as.ordered(tt$Pclass)
tt$Sex = as.factor(tt$Sex)
tt$Age = as.numeric(tt$Age)
tt$Embarked = as.factor(tt$Embarked)

# impute missing data
tt$Fare[1044] = median(tt$Fare[tt$Pclass == 3 & tt$Embarked == "S"], na.rm = T)
# transform the fare
tt$Fare = log(tt$Fare)
# impute Age with median : more values populated near median
tt$Age[is.na(tt$Age) == T] = median(tt$Age, na.rm = TRUE) tt$Embarked[c(62,830)] = "C"

for(i in 1:nrow(tt)){
if(grepl(pattern = "Mr. ", x = tt$Name[i], ignore.case = TRUE) == 1){ tt$Title[i] = "Mr"
}else if(grepl(pattern = "Mrs. ", x = tt$Name[i], ignore.case = TRUE) == 1){ tt$Title[i] = "Mrs"
}else if(grepl(pattern = "Miss. ", x = tt$Name[i], ignore.case = TRUE) == 1){ tt$Title[i] = "Miss"
}else if(grepl(pattern = "Master. ", x = tt$Name[i], ignore.case = TRUE) == 1){ tt$Title[i] = "Master"
}else{
tt$Title[i] = "Rare" } } tt$Title = as.factor(tt$Title) ######################################## family surname #### tt$Surname = sapply(tt$Name, function(x) strsplit(x, "[,.]")[[1]][1]) tt$Surname = as.factor(tt$Surname) tt$Fsize = tt$SibSp + tt$Parch + 1 # including himself/herself

###################################### Discrete family size ####
tt$FsizeDiscrete[tt$Fsize == 1] = "Singleton"
tt$FsizeDiscrete[tt$Fsize <= 5 & tt$Fsize > 1] = "Small" tt$FsizeDiscrete[tt$Fsize >5] = "Large" tt$FsizeDiscrete = as.factor(tt$FsizeDiscrete) # Is solo traveller ? # if tt$SibSp == 0 && tt$Parch == 0, solo traveller : no siblings, no child nor parent onboard tt$Solo = "No"
tt$Solo[tt$SibSp == 0 & tt$Parch == 0] = "Yes" tt$Solo = as.factor(tt$Solo) ###################################### ageGroup #### # infants, children and old people are more likely to survive for(i in 1:nrow(tt)){ if(tt$Age[i] <=4){
tt$AgeGroup[i] = "Infant" }else if(tt$Age[i] > 4 & tt$Age[i] <= 10){ tt$AgeGroup[i] = "Child"
}else if(tt$Age[i] > 10 & tt$Age[i] <= 18){
tt$AgeGroup[i] = "Young" }else if(tt$Age[i] > 18 & tt$Age[i] <= 50){ tt$AgeGroup[i] = "Adults"
}else{
tt$AgeGroup[i] = "Old" } } tt$AgeGroup = as.factor(tt$AgeGroup) # mother and child : likely to survive #### tt$Mother = "Not Mother"
tt$Mother[tt$Sex == "female" & tt$Parch > 0 & tt$Age > 18 & tt$Title != "Miss"] = "Mother" tt$Mother = as.factor(tt$Mother) ############################################ fare range #### #md.pattern(tt[1:891,]) # filter already numeric data xgb_data = tt[,c("Survived","Age","SibSp","Parch","Fare","Fsize")] # One-hot ecoding xgb_data$Plass_1 = ifelse(tt$Pclass == 1, 1, 0) xgb_data$Plass_2 = ifelse(tt$Pclass == 2, 1, 0) xgb_data$Plass_3 = ifelse(tt$Pclass == 3, 1, 0) # Sex xgb_data$Sex_Male = ifelse(tt$Sex == "male", 1, 0) # Embarked xgb_data$Embarked_C = ifelse(tt$Embarked == "C", 1, 0) xgb_data$Embarked_Q = ifelse(tt$Embarked == "Q", 1, 0) xgb_data$Embarked_S = ifelse(tt$Embarked == "S", 1, 0) # Title xgb_data$Title_Mr = ifelse(tt$Title == "Mr", 1, 0) xgb_data$Title_Mrs = ifelse(tt$Title == "Mrs", 1, 0) xgb_data$Title_Miss = ifelse(tt$Title == "Miss", 1, 0) xgb_data$Title_Master = ifelse(tt$Title == "Master", 1, 0) xgb_data$Title_Rare = ifelse(tt$Title == "Rare", 1, 0) # FsizeDiscrete xgb_data$FsizeDiscrete_Singleton = ifelse(tt$FsizeDiscrete == "Singleton", 1, 0) xgb_data$FsizeDiscrete_Small = ifelse(tt$FsizeDiscrete == "Small", 1, 0) xgb_data$FsizeDiscrete_Large = ifelse(tt$FsizeDiscrete == "Large", 1, 0) # Solo xgb_data$Solo_Yes = ifelse(tt$Solo == "Yes", 1, 0) # Age Group xgb_data$AgeGroup_Infant = ifelse(tt$AgeGroup == "Infant", 1, 0) xgb_data$AgeGroup_Child = ifelse(tt$AgeGroup == "Child", 1, 0) xgb_data$AgeGroup_Young = ifelse(tt$AgeGroup == "Young", 1, 0) xgb_data$AgeGroup_Adult = ifelse(tt$AgeGroup == "Adults", 1, 0) xgb_data$AgeGroup_Old = ifelse(tt$AgeGroup == "Old", 1, 0) # mother xgb_data$Mother_Yes = ifelse(tt$Mother == "Mother", 1, 0) ############################# separate Training and testing dataset ############################### training_data = xgb_data[1:891,] testing_data = xgb_data[892:1309,] training_data$Survived =  as.factor(training_data$Survived) testing_data$Survived = NULL # removing dependent variable from testing dataset
### since we don't have access to actual test labels, we are going to create our own test data and we are going to
## call it dev data. We will get this data from our training data set

set.seed(1122)
devIndex = sample(1:891, 90, replace=FALSE)
dev_data = training_data[devIndex, ]
train_data = training_data[-devIndex,] # note that this is different from 'training_data' sorry about the confusion

############################################# Basic Boosting Model ########################################

y_train = train_data$Survived %>% as.integer() y_train = y_train -1 y_train_ada = ifelse(y_train == 1, 1, -1) #JOUSBoost library requires labels in (1,-1) train_data$Survived = NULL

x_train = as.matrix(train_data)

y_dev = dev_data$Survived %>% as.integer() y_dev = y_dev -1 y_dev_ada = ifelse(y_dev == 1, 1, -1) #JOUSBoost library requires labels in (1,-1) dev_data$Survived = NULL

x_dev = as.matrix(dev_data)

While the data transformation steps designed by Ranjeet are very clever, if you did not follow all the steps do not worry. For now all you need to know is that boosting algorithms only need numeric data and Ranjeet has done a brilliant job of not only converting all data into a numeric format but also of feature engineering. I’m particulalry impressed by they way he tried to measure if the passenger is part of a family unit or not, and also if a passenger is single or married, etc. These are very strong signals for whether someone survived the Titanic disaster or not.

## A brief note on boosting algorithms

Let’s say we have a single decision tree, a classifier that we can call $$G(X)$$. It classifies whether a passenger has survived the titatic disaster or not. Boosting is part of a family of ‘additivie models’, which basically means that when it tries to improve its prediction it does not change the coefficients of the function ($$G(X)$$) that generated the prediction. Instead it looks at those predictions that were incorrectly classified ($$y(i) \neq \hat{y}(i)$$) and tries to predict them better in the next iteration by adding another function ($$G_{new}(X)$$). It keeps on doing this until it is asked to stop. The opposite of this is a ‘discriminative model’ e.g. simple regression, wherein our model keeps trying to improve the original $$G(X)$$ by either OLS or MLE.

In the case of boosting each $$G_i(X)$$ is also called a classifier, or a tree (since we are basically using a ‘tree’ to predict). The number of trees therefore becomes an important parameter. If we have too many of them, we end up overfitting our model to the training data – we can detect this by seeing that our accuracy on training data is increasing but our accuracy on a validation (or dev data) has started to decrease (we will show that AdaBoost is resistant to overfitting but unfortunately that is not the case with gradient boosting) There are ways to prevent overfitting (look up regularisation techniques, shrinkage, cross-fold validation, early stopping). The number of trees in the Gradient Boosting algorithm is also called number of iterations or number of rounds (if you come across them they mean the same thing)

### Tree depth, when tree depth is 1 its called a stump

The other parameter of importance is the depth of the tree. This can also cause overfitting. Traditional tree based models have very deep trees and then they are pruned to a level where the analyst feels they provide the most generalizable results. In Boosting algorithms, we generally avoid deep trees. Our AdaBoost had tree depth at 1 (a tree with depth=1 is also called a stump). Gradient boosting generally has a tree depth between 2 and 8.

The main difference between AdaBoost and Gradient Boosting is how the model improves the next prediction. They use different loss functions or objective functions. (AdaBoost uses exponential function while gradient boosting can potentially use any function as long as it is continuous, and differentiable and has a global maxima).

### But what is boosting anyway?

The reason that boosting works has been referred to as the wisdom of the crowds. Basically when we have a bunch of weak classifiers, we pool them all together and take a popular vote –> that is kind of a boosting i.e. we boost the predictive power of weak classifiers by adding them all together. A weak classifier is one which has a predictive accuracy of slightly over 50% so it is only marginally useful as it is only slightly better than random guessing. However, if we have lots of these weak classifiers their additive power can make them useful.

### Building a baseline model for comparison

Before we optimize the GBM model we will build a baseline model just to check if it gives an accuracy that is in the ballpark of what we need >70%.

We are going to use the following tuning parameters and initially set them to:
- n.trees = 10
- interaction.depth = 2 (1 would mean stumps, exactly as in AdaBoost)
- n.minobsinnode = 10 (later we will do stochastic learning, but for now we will stick with 10)
- shrinkage or learning rate = 0.001 Start at 0.001 and gradually take it down further to 0.0001 (NB. learning rate and number of trees are inversely related which means we will take our n.trees gradually higher by a similar factor, so we could end up with 10,000 trees)

We will start with the gbm (Generalized Boosted Models) package because of its simplicity. Once we have established a good baseline, we will go for intense optimisation and cross validation across a grid. In that case we will move onto XGBoost (Extreme Gradient Boosting) package which is a computationally efficient library (for python users, XGBoost goes by the same name in Python and is also a sub-package within scikit-learn).

### The risk of overfitting

For most data scientists the goal is to run our models in production and therefore we are more concerned with production accuracy than training or test accuracy. We will use a two point strategy to avoid over-fiting:

• use a very low value for shrinkage ~ 0.0001
• deploy early stopping to avoid over-optimisation on training set

For comparison purposes we will do a quick deployment of a baseline model with acceptable parameters as declared above.

library(gbm) # package for efficient implementation of AdaBoost

# since gbm needs a dataframe object we will not use our x_train matrix right now
baseline <- gbm(y_train ~ ., data = train_data, distribution = "bernoulli", n.trees = 10)
baseline_preds <- predict(baseline, dev_data, type='response', n.trees=100)
baseline_preds <- baseline_preds > 0.5 # if prob is >0.5 then we call it 1 else 0
baseline_conf <- table(y_dev, baseline_preds)
baseline_precision <- baseline_conf[2,2]/sum(baseline_conf[,2])
baseline_recall = baseline_conf[2,2]/sum(baseline_conf[2,])
baseline_accuracy = (baseline_conf[1,1] + baseline_conf[2,2]) / length(y_dev)
baseline_f1Score = 2 * baseline_precision * baseline_recall / (baseline_precision + baseline_recall)
knitr::kable(
data.frame(Baseline Metrics = c('Accuracy', 'Precision', 'Recall', 'F1-Score'),
Values = scales::percent(c(baseline_accuracy, baseline_precision, baseline_recall, baseline_f1Score)))
)
Baseline.Metrics Values
Accuracy 77.8%
Precision 69.2%
Recall 60.0%
F1-Score 64.3%

We can see that AdaBoost has already done a pretty good job of this classification. On the dev dataset (which is practically a test dataset) we have acheived 78% accuracy.

## Deploying and optimising Gradient Boosting using XGBoost

Now we will perform a grid search to look for the optimal solution and move on to XGBoost library since it is more efficient.

We will use only two parameters in our grid search.

1. Tree depth: we will go from 1 to 8
2. Shrinkage: we will decrease by a factor of 10 (from 0.1 to 0.0001)

Why are we not optimizing for other parameters?
In AdaBoost we did a grid search for number of trees as well but we can avoid that in XGBoost because it has the option of early stopping. Early stopping basically means that if accuracy does not improve after N continuous trees (set at 100) then the algorithm will automatically stop.

We will not use L1 & L2 regularisation because our data set is too small for it to need that level of regularisation (how do I know? I have already tried them in the console and they didn’t do much :)

library(xgboost)

# prepare data for XGBoost
# one adavantage of XGBoost is that it stores results of subsequent iterations
# in something called a watchlist, we will take advantage of that

# we will build two matrices (one for training and one for validation) and then join them together
dtrain <- xgb.DMatrix(data = x_train,
label=y_train)
dvalidation <- xgb.DMatrix(data= x_dev,
label = y_dev)
watchlist <- list(train=dtrain, validation=dvalidation)

# the following params will vary
max_depth <- 2:8
shrinkage <- c(0.1, 0.01, 0.001, 0.0001)
xgb_models <- list()

start_time <- Sys.time()

for(d in max_depth) {
for(s in shrinkage){
param <- list(
set.seed = 586,
objective = "binary:logistic",
eta = s,
max_depth = d,
colsamplebytree = 0.8,
subsample = 0.8,
#min_child_weight = 3,
base_score = 0.50,
#lambda = 100,
#lambda_bias = 5,
#alpha = 0.1,
verbose = FALSE
)

mod = xgboost::xgb.train(
params = param,
data = dtrain,
nrounds = 1000,
watchlist = watchlist,
maximize = FALSE,
early_stopping_rounds = 100,
verbose = FALSE)

xgb_models = append(xgb_models, list(mod))
}
}

end_time = Sys.time()

print(str_c(
"Grid search completed.",
" Time elapsed: ",
(end_time - start_time),
"seconds",
sep=""
))
## [1] "Grid search completed. Time elapsed: 5.20081496238708seconds"

Creating a dataframe which nicely collates the results of all XGBoost models that we ran.

xgb_metrics <- data.frame(algo = character(),
tree_depth = integer(),
n_trees = integer(),
shrinkage = numeric(),
accuracy = numeric(),
precision = numeric(),
recall = numeric(),
f1_score = numeric(),
training_accuracy = numeric(),
training_precision = numeric(),
training_recall = numeric(),
training_f1_score = numeric(),
stringsAsFactors = FALSE)

# get sequential list of depth and shrinkage as they were deployed
d = rep(max_depth, 4)
s = rep(shrinkage, 7)

ConfusionMatrix <- function(prediction, truth) {
conf <- table(truth, prediction)
df <- data.frame(precision = conf[2,2]/sum(conf[,2]),
recall = conf[2,2]/sum(conf[2,]),
accuracy = (conf[1,1] + conf[2,2]) / length(prediction))
df$f1 <- 2/((1/df$precision) + (1/df$recall)) return(list(confusion_matrix=conf, metrics=df)) } for(i in 1:28) { mod = xgb_models[[i]] preds = predict(mod, dvalidation, type="response") validation_metrics = ConfusionMatrix(preds > 0.5, y_dev)$metrics

training_preds = predict(mod, dtrain, type="response")
training_metrics = ConfusionMatrix(training_preds > 0.5, y_train)$metrics xgb_metrics <- xgb_metrics %>% add_row(algo = "Gradient Boosting", tree_depth = d[i], n_trees = mod$niter,
shrinkage = s[i],
accuracy = validation_metrics$accuracy, precision = validation_metrics$precision,
recall = validation_metrics$recall, f1_score = validation_metrics$f1,
training_accuracy = training_metrics$accuracy, training_precision = training_metrics$precision,
training_recall = training_metrics$recall, training_f1_score = training_metrics$f1)

}

knitr::kable(
xgb_metrics %>%
select(algo, tree_depth, n_trees, shrinkage, accuracy, training_accuracy) %>%
sample_n(10))
algo tree_depth n_trees shrinkage accuracy training_accuracy
Gradient Boosting 2 231 1e-01 0.9000000 0.8651685
Gradient Boosting 7 160 1e-01 0.9000000 0.9101124
Gradient Boosting 7 106 1e-04 0.8555556 0.8776529
Gradient Boosting 4 110 1e-03 0.8222222 0.8239700
Gradient Boosting 6 105 1e-04 0.8555556 0.8476904
Gradient Boosting 6 107 1e-03 0.8666667 0.8739076
Gradient Boosting 2 107 1e-03 0.8222222 0.8501873
Gradient Boosting 6 191 1e-01 0.9000000 0.8926342
Gradient Boosting 4 103 1e-04 0.8666667 0.8714107
Gradient Boosting 8 130 1e-03 0.8444444 0.8339576

### Selecting the best XGBoost model

xgb_metrics %>%
ggplot(aes(x=tree_depth, y = accuracy)) +
geom_point(aes(colour = as.factor(shrinkage)), position= "jitter") +
scale_color_discrete("Shrinkage") +
ggtitle("Accuracy of each model that ran on XGBoost")

Surprisingly, XGBoost has hit 90% test accuracy. The model that actually did it has tree_depth = 2 and shrinkage = 0.1. Let’s pull out the top performing models.

# pull up the top 3 accuracy metrics
knitr::kable(
xgb_metrics %>%
top_n(3, accuracy) %>%
select(algo, tree_depth, n_trees, shrinkage, accuracy, training_accuracy))
algo tree_depth n_trees shrinkage accuracy training_accuracy
Gradient Boosting 2 231 0.1 0.9 0.8651685
Gradient Boosting 6 191 0.1 0.9 0.8926342
Gradient Boosting 7 160 0.1 0.9 0.9101124

Apparently, there are two models that hit 90% accuracy and suprisingly, their training accuracy was actually lower (86%)! The other contenders for top spot are models hover around 89% so it’s worth considering them. Since most of them have a higher training accuracy, it is probably overfitting.

It is tempting to go with the model with highest accuracy but given that we have an unusual case here wherein test accuracy is higher than training accuracy, it’s worth looking a bit more closely at our model. However, since the purpose of this blog post is to show how we can run a Gradient Boosting model and AdaBoost model, I’m not going into that.

We will go ahead with our final XGBoost model.

xgb_final_metrics <-
xgb_metrics %>%
top_n(1, accuracy) %>%
filter(tree_depth==2)
xgb_final <- xgb_models[[1]]

## Summary

knitr::kable(
data.frame(
Metrics    =                c('Test Accuracy', 'Test Precision', 'Test Recall', 'Test F1-Score'),
Baseline   =scales::percent(c(baseline_accuracy, baseline_precision, baseline_recall, baseline_f1Score)),
AdaBoost   =scales::percent(c(adab_final_metrics$accuracy, adab_final_metrics$precision, adab_final_metrics$recall, adab_final_metrics$f1_score)),
XGBoost    =scales::percent(c(xgb_final_metrics$accuracy, xgb_final_metrics$precision, xgb_final_metrics$recall, xgb_final_metrics$f1_score))))