Since its inception, AdaBoost (acronym for Adaptive Boost) by Freund and Schapire (1997) has been a very popular among data scientists. While it has evolved in many ways (Discrete AdaBoost, Real AdaBoost, etc) it’s popularity has not dwindled. In fact, Breiman (NIPS Workshop, 1996) referred to AdaBoost with trees as the “best off-the-shelf classifier in the world” (see also Breiman (1998)).
Recently, my team was given a classification task. The objective was to determine if the closest match that we found for an entity was a good a match or not. It was a binary classification task. We tried many methods - support vector machines, logistic regression, lasso regression, random forest - but ultimately none of these were any match for AdaBoost. A close second was Gradient Boosting. In this notebook I will take you through an implementation and benefits of boosting and compare the performance of AdaBoost and Gradient Boosting.
Classification task
For our classification task we are going to use the famous Titanic dataset (hosted by Kaggle). The task is to predict whether a passenger on board survived the sinking of the Titanic.
Data cleaning
Since, the objective of the blog post is to practice boosting, I’m not going to explain most of the data cleaning steps. The main thing to note is that boosting algorithms require all the data to be numeric. In fact, I’m applying the same data cleaning steps that were created by Ranjeet Singh in this notebook.
library(tidyverse)
train <- readr::read_csv("2020-03-21-how-to-deploy-adaboost-for-a-classification-problem.en_files/input/titanic/train.csv")
test <- readr::read_csv("2020-03-21-how-to-deploy-adaboost-for-a-classification-problem.en_files/input/titanic/test.csv")
# we are going to combine the train and test data so that we apply the same transformations on both datasets
# the column to predict is called Survived
# it will be NA for all the rows that correspond to the test dataset
tt <- dplyr::bind_rows(train,test)
tt$Survived = as.factor(tt$Survived)
tt$Pclass = as.ordered(tt$Pclass)
tt$Sex = as.factor(tt$Sex)
tt$Age = as.numeric(tt$Age)
tt$Embarked = as.factor(tt$Embarked)
# impute missing data
tt$Fare[1044] = median(tt$Fare[tt$Pclass == 3 & tt$Embarked == "S"], na.rm = T)
# transform the fare
tt$Fare = log(tt$Fare)
# impute Age with median : more values populated near median
tt$Age[is.na(tt$Age) == T] = median(tt$Age, na.rm = TRUE)
tt$Embarked[c(62,830)] = "C"
for(i in 1:nrow(tt)){
if(grepl(pattern = "Mr. ", x = tt$Name[i], ignore.case = TRUE) == 1){
tt$Title[i] = "Mr"
}else if(grepl(pattern = "Mrs. ", x = tt$Name[i], ignore.case = TRUE) == 1){
tt$Title[i] = "Mrs"
}else if(grepl(pattern = "Miss. ", x = tt$Name[i], ignore.case = TRUE) == 1){
tt$Title[i] = "Miss"
}else if(grepl(pattern = "Master. ", x = tt$Name[i], ignore.case = TRUE) == 1){
tt$Title[i] = "Master"
}else{
tt$Title[i] = "Rare"
}
}
tt$Title = as.factor(tt$Title)
######################################## family surname ####
tt$Surname = sapply(tt$Name, function(x) strsplit(x, "[,.]")[[1]][1])
tt$Surname = as.factor(tt$Surname)
tt$Fsize = tt$SibSp + tt$Parch + 1 # including himself/herself
###################################### Discrete family size ####
tt$FsizeDiscrete[tt$Fsize == 1] = "Singleton"
tt$FsizeDiscrete[tt$Fsize <= 5 & tt$Fsize > 1] = "Small"
tt$FsizeDiscrete[tt$Fsize >5] = "Large"
tt$FsizeDiscrete = as.factor(tt$FsizeDiscrete)
# Is solo traveller ?
# if tt$SibSp == 0 && tt$Parch == 0, solo traveller : no siblings, no child nor parent onboard
tt$Solo = "No"
tt$Solo[tt$SibSp == 0 & tt$Parch == 0] = "Yes"
tt$Solo = as.factor(tt$Solo)
###################################### ageGroup ####
# infants, children and old people are more likely to survive
for(i in 1:nrow(tt)){
if(tt$Age[i] <=4){
tt$AgeGroup[i] = "Infant"
}else if(tt$Age[i] > 4 & tt$Age[i] <= 10){
tt$AgeGroup[i] = "Child"
}else if(tt$Age[i] > 10 & tt$Age[i] <= 18){
tt$AgeGroup[i] = "Young"
}else if(tt$Age[i] > 18 & tt$Age[i] <= 50){
tt$AgeGroup[i] = "Adults"
}else{
tt$AgeGroup[i] = "Old"
}
}
tt$AgeGroup = as.factor(tt$AgeGroup)
# mother and child : likely to survive ####
tt$Mother = "Not Mother"
tt$Mother[tt$Sex == "female" & tt$Parch > 0 & tt$Age > 18 & tt$Title != "Miss"] = "Mother"
tt$Mother = as.factor(tt$Mother)
############################################ fare range ####
#md.pattern(tt[1:891,])
# filter already numeric data
xgb_data = tt[,c("Survived","Age","SibSp","Parch","Fare","Fsize")]
# One-hot ecoding
xgb_data$Plass_1 = ifelse(tt$Pclass == 1, 1, 0)
xgb_data$Plass_2 = ifelse(tt$Pclass == 2, 1, 0)
xgb_data$Plass_3 = ifelse(tt$Pclass == 3, 1, 0)
# Sex
xgb_data$Sex_Male = ifelse(tt$Sex == "male", 1, 0)
# Embarked
xgb_data$Embarked_C = ifelse(tt$Embarked == "C", 1, 0)
xgb_data$Embarked_Q = ifelse(tt$Embarked == "Q", 1, 0)
xgb_data$Embarked_S = ifelse(tt$Embarked == "S", 1, 0)
# Title
xgb_data$Title_Mr = ifelse(tt$Title == "Mr", 1, 0)
xgb_data$Title_Mrs = ifelse(tt$Title == "Mrs", 1, 0)
xgb_data$Title_Miss = ifelse(tt$Title == "Miss", 1, 0)
xgb_data$Title_Master = ifelse(tt$Title == "Master", 1, 0)
xgb_data$Title_Rare = ifelse(tt$Title == "Rare", 1, 0)
# FsizeDiscrete
xgb_data$FsizeDiscrete_Singleton = ifelse(tt$FsizeDiscrete == "Singleton", 1, 0)
xgb_data$FsizeDiscrete_Small = ifelse(tt$FsizeDiscrete == "Small", 1, 0)
xgb_data$FsizeDiscrete_Large = ifelse(tt$FsizeDiscrete == "Large", 1, 0)
# Solo
xgb_data$Solo_Yes = ifelse(tt$Solo == "Yes", 1, 0)
# Age Group
xgb_data$AgeGroup_Infant = ifelse(tt$AgeGroup == "Infant", 1, 0)
xgb_data$AgeGroup_Child = ifelse(tt$AgeGroup == "Child", 1, 0)
xgb_data$AgeGroup_Young = ifelse(tt$AgeGroup == "Young", 1, 0)
xgb_data$AgeGroup_Adult = ifelse(tt$AgeGroup == "Adults", 1, 0)
xgb_data$AgeGroup_Old = ifelse(tt$AgeGroup == "Old", 1, 0)
# mother
xgb_data$Mother_Yes = ifelse(tt$Mother == "Mother", 1, 0)
############################# separate Training and testing dataset ###############################
training_data = xgb_data[1:891,]
testing_data = xgb_data[892:1309,]
training_data$Survived = as.factor(training_data$Survived)
testing_data$Survived = NULL # removing dependent variable from testing dataset
### since we don't have access to actual test labels, we are going to create our own test data and we are going to
## call it dev data. We will get this data from our training data set
set.seed(1122)
devIndex = sample(1:891, 90, replace=FALSE)
dev_data = training_data[devIndex, ]
train_data = training_data[-devIndex,] # note that this is different from 'training_data' sorry about the confusion
############################################# Basic Boosting Model ########################################
y_train = train_data$Survived %>% as.integer()
y_train = y_train -1
y_train_ada = ifelse(y_train == 1, 1, -1) #JOUSBoost library requires labels in (1,-1)
train_data$Survived = NULL
x_train = as.matrix(train_data)
y_dev = dev_data$Survived %>% as.integer()
y_dev = y_dev -1
y_dev_ada = ifelse(y_dev == 1, 1, -1) #JOUSBoost library requires labels in (1,-1)
dev_data$Survived = NULL
x_dev = as.matrix(dev_data)
While the data transformation steps designed by Ranjeet are very clever, if you did not follow all the steps do not worry. For now all you need to know is that boosting algorithms only need numeric data and Ranjeet has done a brilliant job of not only converting all data into a numeric format but also of feature engineering. I’m particulalry impressed by they way he tried to measure if the passenger is part of a family unit or not, and also if a passenger is single or married, etc. These are very strong signals for whether someone survived the Titanic disaster or not.
A brief note on boosting algorithms
Let’s say we have a single decision tree, a classifier that we can call \(G(X)\). It classifies whether a passenger has survived the titatic disaster or not. Boosting is part of a family of ‘additivie models’, which basically means that when it tries to improve its prediction it does not change the coefficients of the function (\(G(X)\)) that generated the prediction. Instead it looks at those predictions that were incorrectly classified (\(y(i) \neq \hat{y}(i)\)) and tries to predict them better in the next iteration by adding another function (\(G_{new}(X)\)). It keeps on doing this until it is asked to stop. The opposite of this is a ‘discriminative model’ e.g. simple regression, wherein our model keeps trying to improve the original \(G(X)\) by either OLS or MLE.
In the case of boosting each \(G_i(X)\) is also called a classifier, or a tree (since we are basically using a ‘tree’ to predict). The number of trees therefore becomes an important parameter. If we have too many of them, we end up overfitting our model to the training data – we can detect this by seeing that our accuracy on training data is increasing but our accuracy on a validation (or dev data) has started to decrease (we will show that AdaBoost is resistant to overfitting but unfortunately that is not the case with gradient boosting) There are ways to prevent overfitting (look up regularisation techniques, shrinkage, cross-fold validation, early stopping). The number of trees in the Gradient Boosting algorithm is also called number of iterations or number of rounds (if you come across them they mean the same thing)
Tree depth, when tree depth is 1 its called a stump
The other parameter of importance is the depth of the tree. This can also cause overfitting. Traditional tree based models have very deep trees and then they are pruned to a level where the analyst feels they provide the most generalizable results. In Boosting algorithms, we generally avoid deep trees. Our AdaBoost had tree depth at 1 (a tree with depth=1 is also called a stump). Gradient boosting generally has a tree depth between 2 and 8.
Difference between AdaBoost and Gradient Boost
The main difference between AdaBoost and Gradient Boosting is how the model improves the next prediction. They use different loss functions or objective functions. (AdaBoost uses exponential function while gradient boosting can potentially use any function as long as it is continuous, and differentiable and has a global maxima).
But what is boosting anyway?
The reason that boosting works has been referred to as the wisdom of the crowds. Basically when we have a bunch of weak classifiers, we pool them all together and take a popular vote –> that is kind of a boosting i.e. we boost the predictive power of weak classifiers by adding them all together. A weak classifier is one which has a predictive accuracy of slightly over 50% so it is only marginally useful as it is only slightly better than random guessing. However, if we have lots of these weak classifiers their additive power can make them useful.
Building a baseline model for comparison
Before we optimize the GBM model we will build a baseline model just to check if it gives an accuracy that is in the ballpark of what we need >70%.
We are going to use the following tuning parameters and initially set them to:
- n.trees
= 10
- interaction.depth
= 2 (1 would mean stumps, exactly as in AdaBoost)
- n.minobsinnode
= 10 (later we will do stochastic learning, but for now we will stick with 10)
- shrinkage
or learning rate = 0.001 Start at 0.001 and gradually take it down further to 0.0001 (NB. learning rate and number of trees are inversely related which means we will take our n.trees
gradually higher by a similar factor, so we could end up with 10,000 trees)
We will start with the gbm
(Generalized Boosted Models) package because of its simplicity. Once we have established a good baseline, we will go for intense optimisation and cross validation across a grid. In that case we will move onto XGBoost
(Extreme Gradient Boosting) package which is a computationally efficient library (for python users, XGBoost goes by the same name in Python and is also a sub-package within scikit-learn).
The risk of overfitting
For most data scientists the goal is to run our models in production and therefore we are more concerned with production accuracy than training or test accuracy. We will use a two point strategy to avoid over-fiting:
- use a very low value for shrinkage ~ 0.0001
- deploy early stopping to avoid over-optimisation on training set
Deploying AdaBoost
For comparison purposes we will do a quick deployment of a baseline model with acceptable parameters as declared above.
library(gbm) # package for efficient implementation of AdaBoost
# since gbm needs a dataframe object we will not use our x_train matrix right now
baseline <- gbm(y_train ~ ., data = train_data, distribution = "bernoulli", n.trees = 10)
baseline_preds <- predict(baseline, dev_data, type='response', n.trees=100)
baseline_preds <- baseline_preds > 0.5 # if prob is >0.5 then we call it 1 else 0
baseline_conf <- table(y_dev, baseline_preds)
baseline_precision <- baseline_conf[2,2]/sum(baseline_conf[,2])
baseline_recall = baseline_conf[2,2]/sum(baseline_conf[2,])
baseline_accuracy = (baseline_conf[1,1] + baseline_conf[2,2]) / length(y_dev)
baseline_f1Score = 2 * baseline_precision * baseline_recall / (baseline_precision + baseline_recall)
knitr::kable(
data.frame(`Baseline Metrics` = c('Accuracy', 'Precision', 'Recall', 'F1-Score'),
Values = scales::percent(c(baseline_accuracy, baseline_precision, baseline_recall, baseline_f1Score)))
)
Baseline.Metrics | Values |
---|---|
Accuracy | 77.8% |
Precision | 69.2% |
Recall | 60.0% |
F1-Score | 64.3% |
We can see that AdaBoost has already done a pretty good job of this classification. On the dev dataset (which is practically a test dataset) we have acheived 78% accuracy.
Optimising AdaBoost using grid search
We will now run an extensive grid search to look for the best parameters. When we do this, we will intentionally go to an extreme extent partly for educational purposes and partly because we want actually see whether AdaBoost is resistant to overfitting. For example we will build a 1000 trees even though we only have 801 observations in our training dataset (which is ridiculous but we will do it anyway).
library(JOUSBoost) # package for efficient AdaBoost implementation
set.seed(583)
tree_depth = 1:15
n_trees = c(
(seq(from = 10, to = 200, by = 10)),
(seq(from = 250, to = 1000, by = 50)))
# AdaBoost will run 555 times. This is going to take a long time!
start_time <- Sys.time()
# packages for parallel processing
library(doSNOW)
library(foreach)
numOfclusters <- 10 # yup, i've got 10+ clusters on my machine :)
cl <- makeCluster(numOfclusters)
registerDoSNOW(cl)
adaBoost_grid <-
foreach(t = tree_depth) %:%
foreach(n = n_trees) %dopar% {
mod = JOUSBoost::adaboost(X = x_train, y=y_train_ada, tree_depth = t, n_rounds =n)
# training results
tr_conf = mod$confusion_matrix
tr_precision = tr_conf[2,2]/sum(tr_conf[,2])
tr_recall = tr_conf[2,2]/sum(tr_conf[2,])
tr_accuracy = (tr_conf[1,1] + tr_conf[2,2]) / length(y_train_ada)
tr_f1_score = 2 * tr_precision * tr_recall / (tr_precision + tr_recall)
# dev_test results
preds = JOUSBoost::predict.adaboost(mod, X=x_dev, type="response")
conf = table(y_dev_ada, preds)
precision = conf[2,2]/sum(conf[,2])
recall = conf[2,2]/sum(conf[2,])
accuracy = (conf[1,1] + conf[2,2]) / length(y_dev_ada)
f1_score = 2 * precision * recall / (precision + recall)
# metrics to append
model_results = data.frame(algo = "AdaBoost",
tree_depth = t,
n_trees = n,
accuracy = accuracy,
precision = precision,
recall = recall,
f1_score = f1_score,
training_accuracy = tr_accuracy,
training_precision = tr_precision,
training_recall = tr_recall,
training_f1_score = tr_f1_score,
stringsAsFactors = FALSE)
# final list item to return
list(model = mod, metrics = model_results)
}
stopCluster(cl)
end_time = Sys.time()
print(str_c(
"Grid search completed.",
" Time elapsed: ",
(end_time - start_time),
sep=""
))
## [1] "Grid search completed. Time elapsed: 6.7166993021965"
# we have a nested list where each item corresponds to tree_depth (1 to 15)
# and inside each tree_depth we have one item for each num_trees
# we are going to collapse it all into one dataframe
adaboost_metrics <-
lapply(adaBoost_grid, function(outside_item)
{lapply(outside_item, function(inside_item)
{ inside_item[['metrics']]})})
adaboost_metrics <- as_tibble(bind_rows(lapply(adaboost_metrics, bind_rows)))
AdaBoost is resistant to overfitting
Below we can already see that even though our training accruacy has reached as high as 98.5% our test accuracy is adamantly stuck at ~82%. This is unusualy since for most ML models as training_accuracy increases the test accuracy worsens because of overfitting. But we can see that AdaBoost is stubbornly resistant to overfitting.
knitr::kable(adaboost_metrics %>%
select(algo, tree_depth, n_trees, training_accuracy, accuracy) %>%
arrange(desc(training_accuracy), desc(accuracy)) %>%
sample_n(10))
algo | tree_depth | n_trees | training_accuracy | accuracy |
---|---|---|---|---|
AdaBoost | 5 | 500 | 0.9637953 | 0.8888889 |
AdaBoost | 8 | 250 | 0.9850187 | 0.8222222 |
AdaBoost | 7 | 20 | 0.9837703 | 0.8555556 |
AdaBoost | 1 | 350 | 0.8339576 | 0.8111111 |
AdaBoost | 7 | 80 | 0.9837703 | 0.8555556 |
AdaBoost | 3 | 20 | 0.8526841 | 0.8111111 |
AdaBoost | 10 | 110 | 0.9837703 | 0.8333333 |
AdaBoost | 13 | 190 | 0.9837703 | 0.8222222 |
AdaBoost | 2 | 750 | 0.8489388 | 0.8222222 |
AdaBoost | 8 | 150 | 0.9850187 | 0.8222222 |
At a broad level we can observe that for each of our models, the highest training accuracy is reached fairly early and after that it holds steady. The test accuracy also reaches its maxima around the same point and holds steady. To look at how resistant AdaBoost is to overfitting we will zoom in on one type of model.
In this model we can see that even though our training accuracy has been steadily climbing from ~85% to ~97% our test accuracy has been holding steady at ~85%. Most data scientists who see AdaBoost results for the first time are very pleasantly surprised by this property of AdaBoost. Incidentally, even the original creators of AdaBoost, Freund and Schapire (1997), have not been able to explain why this happens.
adaboost_metrics %>%
filter(tree_depth == 5, n_trees < 60) %>%
mutate(tree_depth = as.factor(tree_depth)) %>%
mutate(type_training = "Training") %>%
mutate(type_dev = "Dev") %>%
ggplot(aes(x=n_trees)) +
geom_line(aes( y=training_accuracy, lty=type_training, colour = tree_depth)) +
geom_line(aes( y=accuracy, lty=type_dev, colour = tree_depth)) +
ylab("Accuracy") +
xlab("Number of classifiers") +
ylim(0,1) +
ggtitle("Bias and variance for AdaBoost") +
scale_linetype_discrete(name="Accuracy type",
labels = c("Dev", "Training")) +
scale_color_discrete(name="Tree Depth") +
theme_minimal()
Selecting our best AdaBoost model
Occam’s razor - all other things equal, go with the simplest model :)
Since our Kaggle task is to predict who will survive and who will not, our ideal metric is accuracy. In our case, we will pick the model that has the highest test accuracy.
adaboost_metrics %>%
top_n(1, accuracy)
## # A tibble: 34 x 11
## algo tree_depth n_trees accuracy precision recall f1_score training_accura…
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AdaB… 5 10 0.889 0.833 0.833 0.833 0.893
## 2 AdaB… 5 40 0.889 0.857 0.8 0.828 0.966
## 3 AdaB… 5 50 0.889 0.833 0.833 0.833 0.964
## 4 AdaB… 5 60 0.889 0.833 0.833 0.833 0.964
## 5 AdaB… 5 70 0.889 0.833 0.833 0.833 0.964
## 6 AdaB… 5 80 0.889 0.833 0.833 0.833 0.964
## 7 AdaB… 5 90 0.889 0.833 0.833 0.833 0.964
## 8 AdaB… 5 100 0.889 0.833 0.833 0.833 0.964
## 9 AdaB… 5 110 0.889 0.833 0.833 0.833 0.964
## 10 AdaB… 5 120 0.889 0.833 0.833 0.833 0.964
## # … with 24 more rows, and 3 more variables: training_precision <dbl>,
## # training_recall <dbl>, training_f1_score <dbl>
We can see that out of our 540 models 34 have achieved the same max accuracy. In this case, we will go for the model that is simplest among the lot. We can define complexity as directly proportional to:
- tree depth
- number of trees
# final model specs
adab_final_metrics <-
adaboost_metrics %>%
top_n(1, accuracy) %>%
top_n(1, -tree_depth) %>%
top_n(1, -n_trees)
knitr::kable(adab_final_metrics %>%
select(algo, tree_depth, accuracy, precision, recall, f1_score))
algo | tree_depth | accuracy | precision | recall | f1_score |
---|---|---|---|---|---|
AdaBoost | 5 | 0.8888889 | 0.8333333 | 0.8333333 | 0.8333333 |
# extracting our final AdaBoost model from our grid search object
adaBoost_final <- adaBoost_grid[[5]][[1]][['model']]
Deploying and optimising Gradient Boosting using XGBoost
Now we will perform a grid search to look for the optimal solution and move on to XGBoost library since it is more efficient.
We will use only two parameters in our grid search.
- Tree depth: we will go from 1 to 8
- Shrinkage: we will decrease by a factor of 10 (from 0.1 to 0.0001)
Why are we not optimizing for other parameters?
In AdaBoost we did a grid search for number of trees as well but we can avoid that in XGBoost because it has the option of early stopping. Early stopping basically means that if accuracy does not improve after N continuous trees (set at 100) then the algorithm will automatically stop.
We will not use L1 & L2 regularisation because our data set is too small for it to need that level of regularisation (how do I know? I have already tried them in the console and they didn’t do much :)
library(xgboost)
# prepare data for XGBoost
# one adavantage of XGBoost is that it stores results of subsequent iterations
# in something called a watchlist, we will take advantage of that
# we will build two matrices (one for training and one for validation) and then join them together
dtrain <- xgb.DMatrix(data = x_train,
label=y_train)
dvalidation <- xgb.DMatrix(data= x_dev,
label = y_dev)
watchlist <- list(train=dtrain, validation=dvalidation)
# the following params will vary
max_depth <- 2:8
shrinkage <- c(0.1, 0.01, 0.001, 0.0001)
xgb_models <- list()
start_time <- Sys.time()
for(d in max_depth) {
for(s in shrinkage){
param <- list(
set.seed = 586,
objective = "binary:logistic",
eta = s,
max_depth = d,
colsamplebytree = 0.8,
subsample = 0.8,
#min_child_weight = 3,
base_score = 0.50,
#lambda = 100,
#lambda_bias = 5,
#alpha = 0.1,
verbose = FALSE
)
mod = xgboost::xgb.train(
params = param,
data = dtrain,
nrounds = 1000,
watchlist = watchlist,
nthread = 10,
maximize = FALSE,
early_stopping_rounds = 100,
verbose = FALSE)
xgb_models = append(xgb_models, list(mod))
}
}
end_time = Sys.time()
print(str_c(
"Grid search completed.",
" Time elapsed: ",
(end_time - start_time),
"seconds",
sep=""
))
## [1] "Grid search completed. Time elapsed: 5.20081496238708seconds"
Creating a dataframe which nicely collates the results of all XGBoost models that we ran.
xgb_metrics <- data.frame(algo = character(),
tree_depth = integer(),
n_trees = integer(),
shrinkage = numeric(),
accuracy = numeric(),
precision = numeric(),
recall = numeric(),
f1_score = numeric(),
training_accuracy = numeric(),
training_precision = numeric(),
training_recall = numeric(),
training_f1_score = numeric(),
stringsAsFactors = FALSE)
# get sequential list of depth and shrinkage as they were deployed
d = rep(max_depth, 4)
s = rep(shrinkage, 7)
ConfusionMatrix <- function(prediction, truth) {
conf <- table(truth, prediction)
df <- data.frame(precision = conf[2,2]/sum(conf[,2]),
recall = conf[2,2]/sum(conf[2,]),
accuracy = (conf[1,1] + conf[2,2]) / length(prediction))
df$f1 <- 2/((1/df$precision) + (1/df$recall))
return(list(confusion_matrix=conf, metrics=df))
}
for(i in 1:28) {
mod = xgb_models[[i]]
preds = predict(mod, dvalidation, type="response")
validation_metrics = ConfusionMatrix(preds > 0.5, y_dev)$metrics
training_preds = predict(mod, dtrain, type="response")
training_metrics = ConfusionMatrix(training_preds > 0.5, y_train)$metrics
xgb_metrics <- xgb_metrics %>%
add_row(algo = "Gradient Boosting",
tree_depth = d[i],
n_trees = mod$niter,
shrinkage = s[i],
accuracy = validation_metrics$accuracy,
precision = validation_metrics$precision,
recall = validation_metrics$recall,
f1_score = validation_metrics$f1,
training_accuracy = training_metrics$accuracy,
training_precision = training_metrics$precision,
training_recall = training_metrics$recall,
training_f1_score = training_metrics$f1)
}
knitr::kable(
xgb_metrics %>%
select(algo, tree_depth, n_trees, shrinkage, accuracy, training_accuracy) %>%
sample_n(10))
algo | tree_depth | n_trees | shrinkage | accuracy | training_accuracy |
---|---|---|---|---|---|
Gradient Boosting | 2 | 231 | 1e-01 | 0.9000000 | 0.8651685 |
Gradient Boosting | 7 | 160 | 1e-01 | 0.9000000 | 0.9101124 |
Gradient Boosting | 7 | 106 | 1e-04 | 0.8555556 | 0.8776529 |
Gradient Boosting | 4 | 110 | 1e-03 | 0.8222222 | 0.8239700 |
Gradient Boosting | 6 | 105 | 1e-04 | 0.8555556 | 0.8476904 |
Gradient Boosting | 6 | 107 | 1e-03 | 0.8666667 | 0.8739076 |
Gradient Boosting | 2 | 107 | 1e-03 | 0.8222222 | 0.8501873 |
Gradient Boosting | 6 | 191 | 1e-01 | 0.9000000 | 0.8926342 |
Gradient Boosting | 4 | 103 | 1e-04 | 0.8666667 | 0.8714107 |
Gradient Boosting | 8 | 130 | 1e-03 | 0.8444444 | 0.8339576 |
Selecting the best XGBoost model
xgb_metrics %>%
ggplot(aes(x=tree_depth, y = accuracy)) +
geom_point(aes(colour = as.factor(shrinkage)), position= "jitter") +
scale_color_discrete("Shrinkage") +
ggtitle("Accuracy of each model that ran on XGBoost")
Surprisingly, XGBoost has hit 90% test accuracy. The model that actually did it has tree_depth = 2 and shrinkage = 0.1. Let’s pull out the top performing models.
# pull up the top 3 accuracy metrics
knitr::kable(
xgb_metrics %>%
top_n(3, accuracy) %>%
select(algo, tree_depth, n_trees, shrinkage, accuracy, training_accuracy))
algo | tree_depth | n_trees | shrinkage | accuracy | training_accuracy |
---|---|---|---|---|---|
Gradient Boosting | 2 | 231 | 0.1 | 0.9 | 0.8651685 |
Gradient Boosting | 6 | 191 | 0.1 | 0.9 | 0.8926342 |
Gradient Boosting | 7 | 160 | 0.1 | 0.9 | 0.9101124 |
Apparently, there are two models that hit 90% accuracy and suprisingly, their training accuracy was actually lower (86%)! The other contenders for top spot are models hover around 89% so it’s worth considering them. Since most of them have a higher training accuracy, it is probably overfitting.
It is tempting to go with the model with highest accuracy but given that we have an unusual case here wherein test accuracy is higher than training accuracy, it’s worth looking a bit more closely at our model. However, since the purpose of this blog post is to show how we can run a Gradient Boosting model and AdaBoost model, I’m not going into that.
We will go ahead with our final XGBoost model.
xgb_final_metrics <-
xgb_metrics %>%
top_n(1, accuracy) %>%
filter(tree_depth==2)
xgb_final <- xgb_models[[1]]
Summary
knitr::kable(
data.frame(
Metrics = c('Test Accuracy', 'Test Precision', 'Test Recall', 'Test F1-Score'),
Baseline =scales::percent(c(baseline_accuracy, baseline_precision, baseline_recall, baseline_f1Score)),
AdaBoost =scales::percent(c(adab_final_metrics$accuracy, adab_final_metrics$precision, adab_final_metrics$recall, adab_final_metrics$f1_score)),
XGBoost =scales::percent(c(xgb_final_metrics$accuracy, xgb_final_metrics$precision, xgb_final_metrics$recall, xgb_final_metrics$f1_score))))
Metrics | Baseline | AdaBoost | XGBoost |
---|---|---|---|
Test Accuracy | 77.8% | 89% | 90.0% |
Test Precision | 69.2% | 83% | 88.9% |
Test Recall | 60.0% | 83% | 80.0% |
Test F1-Score | 64.3% | 83% | 84.2% |
As we can see, AdaBoost and XGBoost have both come pretty close to each other with very similar F1-Scores. If we had a larger dataset we could have done a more comprehensive test, but for now we will have to settle it at a tie.