mlr3 is a wonderful object-oriented machine learning package for R; I wanted to write a notebook that included a high level tutorial for mlr3 pipelines. Section 3.1 contains a walkthrough of how to train a decision tree using mlr3 pipelines. I cover defining "tasks" for train and test, resampling methods for cross-validation, a "learner" for decision tree, and how to combine them all together. This provides a foundation for the mlr3 Graph Learner I apply in Section 3.2 to run a hyperparameter grid search over several models.
Code link: github.com/ZackBarry/gRowing/blob/master/league_of_legends_classification/eda_and_model.Rmd
::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(message = FALSE)
knitr
library(tidyverse)
library(DataExplorer)
library(ggpubr)
library(patchwork)
library(caret)
library(knitr)
library(DT)
library(summarytools)
library(purrr)
library(car) # for vif()
library(mlr3)
library(mlr3learners)
library(mlr3measures)
library(mlr3tuning)
library(mlr3pipelines)
library(paradox) # ParamSet specifications for mlr3 tuning
library(kableExtra)
library(ddpcr)
::get_logger("mlr3")$set_threshold("warn") # supress verbose output
lgr
read_csv("Data/high_diamond_ranked_10min.csv") dataset <-
Leage of Legends is one of the most popular online multiplier games. Two teams of 5 players compete to battle their way to their oponents’ base. From game to game, players can assume different characters and roles on their team.
The goal of this notebook is to predict the outcome of a game with data from the first 10 minutes. Typically, games last 35min-45min, so it will be interesting to see how telling the first 10 minutes are. The dataset contains 19 different KPIs per team across 10,000 games. As e-sports betting is a growing industry, we will be using classification precision for model selection. Precision measures the percent of positive predictions that were true positives. By using precision as the target metric, we will pick the model that is most “confident” in predicting wins. To predict losses, specificity could be used.
We can see that the same variables are available for each the “red” and “blue” team, except blueWins
records the outcome (there is no redWins
).
## tibble [9,879 × 40] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gameId : num [1:9879] 4.52e+09 4.52e+09 4.52e+09 4.52e+09 4.44e+09 ...
## $ blueWins : num [1:9879] 0 0 0 0 0 1 1 0 0 1 ...
## $ blueWardsPlaced : num [1:9879] 28 12 15 43 75 18 18 16 16 13 ...
## $ blueWardsDestroyed : num [1:9879] 2 1 0 1 4 0 3 2 3 1 ...
## $ blueFirstBlood : num [1:9879] 1 0 0 0 0 0 1 0 0 1 ...
## $ blueKills : num [1:9879] 9 5 7 4 6 5 7 5 7 4 ...
## $ blueDeaths : num [1:9879] 6 5 11 5 6 3 6 13 7 5 ...
## $ blueAssists : num [1:9879] 11 5 4 5 6 6 7 3 8 5 ...
## $ blueEliteMonsters : num [1:9879] 0 0 1 1 0 1 1 0 0 1 ...
## $ blueDragons : num [1:9879] 0 0 1 0 0 1 1 0 0 1 ...
## $ blueHeralds : num [1:9879] 0 0 0 1 0 0 0 0 0 0 ...
## $ blueTowersDestroyed : num [1:9879] 0 0 0 0 0 0 0 0 0 0 ...
## $ blueTotalGold : num [1:9879] 17210 14712 16113 15157 16400 ...
## $ blueAvgLevel : num [1:9879] 6.6 6.6 6.4 7 7 7 6.8 6.4 7.2 6.8 ...
## $ blueTotalExperience : num [1:9879] 17039 16265 16221 17954 18543 ...
## $ blueTotalMinionsKilled : num [1:9879] 195 174 186 201 210 225 225 209 189 220 ...
## $ blueTotalJungleMinionsKilled: num [1:9879] 36 43 46 55 57 42 53 48 61 39 ...
## $ blueGoldDiff : num [1:9879] 643 -2908 -1172 -1321 -1004 ...
## $ blueExperienceDiff : num [1:9879] -8 -1173 -1033 -7 230 ...
## $ blueCSPerMin : num [1:9879] 19.5 17.4 18.6 20.1 21 22.5 22.5 20.9 18.9 22 ...
## $ blueGoldPerMin : num [1:9879] 1721 1471 1611 1516 1640 ...
## $ redWardsPlaced : num [1:9879] 15 12 15 15 17 36 57 15 15 16 ...
## $ redWardsDestroyed : num [1:9879] 6 1 3 2 2 5 1 0 2 2 ...
## $ redFirstBlood : num [1:9879] 0 1 1 1 1 1 0 1 1 0 ...
## $ redKills : num [1:9879] 6 5 11 5 6 3 6 13 7 5 ...
## $ redDeaths : num [1:9879] 9 5 7 4 6 5 7 5 7 4 ...
## $ redAssists : num [1:9879] 8 2 14 10 7 2 9 11 5 4 ...
## $ redEliteMonsters : num [1:9879] 0 2 0 0 1 0 0 1 2 0 ...
## $ redDragons : num [1:9879] 0 1 0 0 1 0 0 1 1 0 ...
## $ redHeralds : num [1:9879] 0 1 0 0 0 0 0 0 1 0 ...
## $ redTowersDestroyed : num [1:9879] 0 1 0 0 0 0 0 0 0 0 ...
## $ redTotalGold : num [1:9879] 16567 17620 17285 16478 17404 ...
## $ redAvgLevel : num [1:9879] 6.8 6.8 6.8 7 7 7 6.4 6.6 7.2 6.8 ...
## $ redTotalExperience : num [1:9879] 17047 17438 17254 17961 18313 ...
## $ redTotalMinionsKilled : num [1:9879] 197 240 203 235 225 221 164 157 240 247 ...
## $ redTotalJungleMinionsKilled : num [1:9879] 55 52 28 47 67 59 35 54 53 43 ...
## $ redGoldDiff : num [1:9879] -643 2908 1172 1321 1004 ...
## $ redExperienceDiff : num [1:9879] 8 1173 1033 7 -230 ...
## $ redCSPerMin : num [1:9879] 19.7 24 20.3 23.5 22.5 22.1 16.4 15.7 24 24.7 ...
## $ redGoldPerMin : num [1:9879] 1657 1762 1728 1648 1740 ...
## - attr(*, "spec")=
## .. cols(
## .. gameId = col_double(),
## .. blueWins = col_double(),
## .. blueWardsPlaced = col_double(),
## .. blueWardsDestroyed = col_double(),
## .. blueFirstBlood = col_double(),
## .. blueKills = col_double(),
## .. blueDeaths = col_double(),
## .. blueAssists = col_double(),
## .. blueEliteMonsters = col_double(),
## .. blueDragons = col_double(),
## .. blueHeralds = col_double(),
## .. blueTowersDestroyed = col_double(),
## .. blueTotalGold = col_double(),
## .. blueAvgLevel = col_double(),
## .. blueTotalExperience = col_double(),
## .. blueTotalMinionsKilled = col_double(),
## .. blueTotalJungleMinionsKilled = col_double(),
## .. blueGoldDiff = col_double(),
## .. blueExperienceDiff = col_double(),
## .. blueCSPerMin = col_double(),
## .. blueGoldPerMin = col_double(),
## .. redWardsPlaced = col_double(),
## .. redWardsDestroyed = col_double(),
## .. redFirstBlood = col_double(),
## .. redKills = col_double(),
## .. redDeaths = col_double(),
## .. redAssists = col_double(),
## .. redEliteMonsters = col_double(),
## .. redDragons = col_double(),
## .. redHeralds = col_double(),
## .. redTowersDestroyed = col_double(),
## .. redTotalGold = col_double(),
## .. redAvgLevel = col_double(),
## .. redTotalExperience = col_double(),
## .. redTotalMinionsKilled = col_double(),
## .. redTotalJungleMinionsKilled = col_double(),
## .. redGoldDiff = col_double(),
## .. redExperienceDiff = col_double(),
## .. redCSPerMin = col_double(),
## .. redGoldPerMin = col_double()
## .. )
No columns have missing data.
gameId | blueWins | blueWardsPlaced | blueWardsDestroyed | blueFirstBlood | blueKills | blueDeaths | blueAssists | blueEliteMonsters | blueDragons | blueHeralds | blueTowersDestroyed | blueTotalGold | blueAvgLevel | blueTotalExperience | blueTotalMinionsKilled | blueTotalJungleMinionsKilled | blueGoldDiff | blueExperienceDiff | blueCSPerMin | blueGoldPerMin | redWardsPlaced | redWardsDestroyed | redFirstBlood | redKills | redDeaths | redAssists | redEliteMonsters | redDragons | redHeralds | redTowersDestroyed | redTotalGold | redAvgLevel | redTotalExperience | redTotalMinionsKilled | redTotalJungleMinionsKilled | redGoldDiff | redExperienceDiff | redCSPerMin | redGoldPerMin |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
We’d like to be able to see distributions of the winning teams’ KPIs alongside the losing teams’ KPIs. Currently, the losing and winning team for each map occupy the same row. We modify the data set so that each row is one team’s performance in a given game.
Note that there are no values that we need to impute in this dataset.
Next, we look at a correlation heat map with all variables. There are too many variables to get a clear sense of what’s going on, so we’ll break it down in the next couple visualizations.
We consider which variables are highly correlated (cor > 0.5). We see that TotalGold
and TotalExperience
are the variables with the most highly correlated pairs. This makes sense because they are high-level metrics that are likely influenced by lower level metrics such as AvgLevel
and GoldDiff
.
GoldPerMin
is 100% correlated with TotalGold
and CSPerMin
with TotalMinionsKilled
. As such we will drop TotalGold
and TotalMinionsKilled
when we prepare for modeling.
Var1 | Var2 | PearsonCorrelation |
---|---|---|
GoldPerMin | TotalGold | 1 |
CSPerMin | TotalMinionsKilled | 1 |
GoldDiff
, ExperienceDiff
, TotalGold
, and GoldPerMin
top the list. Experience KPIs TotalExperience
and AvgLevel
are slightly less correlated, as are player vs. player aggression KPIs Kills
, Deaths
, and Assists
. Monster, Minion, and Ward KPIs are the least correlated. It will be interesting to see if our Primary Component Analysis pulls these groups out.
comparedVar | WinsCorrelation |
---|---|
Wins | 1.0000000 |
GoldDiff | 0.5110980 |
ExperienceDiff | 0.4895157 |
TotalGold | 0.4142877 |
GoldPerMin | 0.4142877 |
TotalExperience | 0.3918552 |
AvgLevel | 0.3549599 |
Kills | 0.3382602 |
Deaths | -0.3382602 |
Assists | 0.2738702 |
EliteMonsters | 0.2217446 |
CSPerMin | 0.2185366 |
TotalMinionsKilled | 0.2185366 |
Dragons | 0.2114095 |
FirstBlood | 0.2017411 |
TotalJungleMinionsKilled | 0.1211291 |
TowersDestroyed | 0.1097367 |
Heralds | 0.0945192 |
WardsDestroyed | 0.0497151 |
WardsPlaced | 0.0120241 |
gameId | 0.0000000 |
Taking a look at the distributions of the top 5 variables that are most correlated with Wins
, we see that each distribution is approximately normal. Additionally, the distribution for the variable when Wins
is 0 is generally shifted to the left from when Wins
is 1. This is in line with out expectations - it makes sense that the losing team should have less Gold and Experience than the winning team.
Next we seek to reduce the dimension of our predictor space by applying Principal Component Analysis (PCA). PCA uses linear combinations of possibly correlated input variables to form a new set of variables (principal components) that are uncorrelated with one another. Additionally, the variables are created sequentially to explain the most variance possible in the dataset. A variance threshold can be used for dimensionality reduction (e.g. keep only those components that explain more than 5% of the variance). Note: we center and scale (standardize) the data before applying PCA since variables with larger mean or standard deviation will be prioritized to explain the variation of the data.
In the scree plot below, the magnitude of the eigenvalue indicates the amount of variation that each principal component captures. The proportion of variance for a given component is the component’s eigenvalue divided by the sum of all eigenvalues. We see that the first component and second components explain ~35% and ~13% of the variance respectively. Later components are similar in the amount of variance they explain. The bottom plot shows the cumulative variance explained by the first N components. One rule of thumb is to keep enough components so that this cumulative variance exceeds 80%. In this case, 7 variables appears to be sufficient.
Next we consider a loading plot which shows how strongly each input variable influences a principal component. The length of the vector along the PCX axis (i.e. the length of the vector projected to the PCX axis) indicates how much weight that variable has on PCX. The angles between vectors tell us how the variables are correlated with one another. If two vectors are close, the variables they represent are positively correlated. If the angle is closer to 90 degrees, they are not likely to be correlated. If the angle is close to 180 degress they are negatively correlated.
In this example, the variables that contribute most to PC1
are GoldDiff
, ExperienceDiff
, AvgLevel
and TotalExperience
. Not many variables contribute more to PC2
than to PC1
, but Kills
, Assists
, and Deaths
contribute significantly. PC1
appears to represent high-level team characteristics, while PC2
represents the biggest KPIs within a game.
We see that Kills
and Assists
are strongly correlated with one another. Deaths
is negatively correlated with AvgLevel
and TotalExperience
.
Finally, we visualize the contribution of each variable to each component with a heat map. To make it easier to identify contributing variables, those with loading values less than 0.2 are colored as white.
We’ll compare 3 classification models - decision trees, support vector machines, and xgboost - using mlr3
. We’ll also test the performance improvements granted by applying PCA to the blue and red teams’ data. mlr3pipelines
provide a concise way to tune parameters across these models, including whether or not PCA is used as a preprocessing step. A short introduction to pipelines is included in the “mlr3 Walkthrough” section below.
To determine a baseline for model performance, we first estimate the classification accuracy for a decision tree by using cross validation. This also gives us the opportunity to walk through the ml3 modeling process in detail. In mlr3 applying cross-validation requires three objects - (1) a learner
containing the model to be trained; (2) a task
containing the data to be resampled from; (3) a resampling
that defines the sampling method.
# mlr3 classification task needs target to be a factor
mutate(dataset, blueWins = as.factor(blueWins))
dataset <-
set.seed(1)
sample_frac(dataset, 0.9)
train <- anti_join(dataset, train) test <-
A Task wraps a DataBackend which provides a layer of abstraction for various data storage systems (e.g. DataFrames). mlr3 comes with a data.table implementation for backends; the conversion from DataFrame to data.table is done automatically. Instead of working directly with DataBackends, they are worked with indirectly through the Task they are associated with. Tasks also store information about the role of individual columns of the DataBackend (e.g. target vs. feature).
TaskClassif$new("wins_classifier", backend = train, target = "blueWins", positive = "1")
task = task
## <TaskClassif:wins_classifier> (8891 x 40)
## * Target: blueWins
## * Properties: twoclass
## * Features (39):
## - dbl (39): blueAssists, blueAvgLevel, blueCSPerMin, blueDeaths,
## blueDragons, blueEliteMonsters, blueExperienceDiff,
## blueFirstBlood, blueGoldDiff, blueGoldPerMin, blueHeralds,
## blueKills, blueTotalExperience, blueTotalGold,
## blueTotalJungleMinionsKilled, blueTotalMinionsKilled,
## blueTowersDestroyed, blueWardsDestroyed, blueWardsPlaced,
## gameId, redAssists, redAvgLevel, redCSPerMin, redDeaths,
## redDragons, redEliteMonsters, redExperienceDiff,
## redFirstBlood, redGoldDiff, redGoldPerMin, redHeralds,
## redKills, redTotalExperience, redTotalGold,
## redTotalJungleMinionsKilled, redTotalMinionsKilled,
## redTowersDestroyed, redWardsDestroyed, redWardsPlaced
Deselect linearly dependent variables determined by EDA above - this is done in-place and “provides a different ‘view’ on the data without altering the data itself”.
$select(cols = setdiff(task$feature_names, c("blueTotalGold", "blueTotalMinionsKilled", "redKills", "redTotalGold", "redTotalMinionsKilled"))) task
See the different resampling implementations available to choose from:
as.data.table(mlr_resamplings) %>%
unnest_wider(params, names_sep = "")
## # A tibble: 7 x 4
## key params...1 params...2 iters
## <chr> <chr> <chr> <int>
## 1 bootstrap repeats ratio 30
## 2 custom <NA> <NA> 0
## 3 cv folds <NA> 10
## 4 holdout ratio <NA> 1
## 5 insample <NA> <NA> 1
## 6 repeated_cv repeats folds 100
## 7 subsampling repeats ratio 30
The verbose way to do this is to retrieve the specific Resampling object, set the hyperparameters, and attach it to a task in 3 separate steps:
mlr_resamplings$get("cv")
my_cv =$param_set$values <- list(folds = 5)
my_cv$instantiate(task)
my_cv my_cv
## <ResamplingCV> with 5 iterations
## * Instantiated: TRUE
## * Parameters: folds=5
An easier way is to use rsmp
to retrieve the object and set hyperparameters in one go:
rsmp("cv", folds = 5)$instantiate(task)
my_cv <- my_cv
## <ResamplingCV> with 5 iterations
## * Instantiated: TRUE
## * Parameters: folds=5
See the different built-in learners:
as.data.table(mlr_learners) %>%
filter(str_detect(key, "classif")) %>%
as.data.frame() %>%
select(key, predict_types) %>%
kable() %>%
kable_styling()
key | predict_types |
---|---|
classif.cv_glmnet | c(“response”, “prob”) |
classif.debug | c(“response”, “prob”) |
classif.featureless | c(“response”, “prob”) |
classif.glmnet | c(“response”, “prob”) |
classif.kknn | c(“response”, “prob”) |
classif.lda | c(“response”, “prob”) |
classif.log_reg | c(“response”, “prob”) |
classif.multinom | c(“response”, “prob”) |
classif.naive_bayes | c(“response”, “prob”) |
classif.qda | c(“response”, “prob”) |
classif.ranger | c(“response”, “prob”) |
classif.rpart | c(“response”, “prob”) |
classif.svm | c(“response”, “prob”) |
classif.xgboost | c(“response”, “prob”) |
As with resampling methods, there is a verbose way and a concise way to get a learner:
mlr_learners$get("classif.rpart")
my_learner = my_learner
## <LearnerClassifRpart:classif.rpart>
## * Model: -
## * Parameters: xval=0
## * Packages: rpart
## * Predict Type: response
## * Feature types: logical, integer, numeric, factor, ordered
## * Properties: importance, missings, multiclass, selected_features,
## twoclass, weights
The concise way:
lrn("classif.rpart")
my_learner = my_learner
## <LearnerClassifRpart:classif.rpart>
## * Model: -
## * Parameters: xval=0
## * Packages: rpart
## * Predict Type: response
## * Feature types: logical, integer, numeric, factor, ordered
## * Properties: importance, missings, multiclass, selected_features,
## twoclass, weights
A call to resample
returnes a ResampleResult
object that can be used to access different models and metrics from the CV runs.
resample(task = task, learner = my_learner, resampling = my_cv, store_models = TRUE)
my_resample <- my_resample
## <ResampleResult> of 5 iterations
## * Task: wins_classifier
## * Learner: classif.rpart
## * Warnings: 0 in 0 iterations
## * Errors: 0 in 0 iterations
We can aggregate model-specific metrics user Measure
objects. Here is a list of different built-in measures:
as.data.table(mlr_measures) %>%
filter(task_type == "classif") %>%
as.data.frame() %>%
select(key, predict_type, task_properties) %>%
kable() %>%
kable_styling()
key | predict_type | task_properties |
---|---|---|
classif.acc | response | character(0) |
classif.auc | prob | twoclass |
classif.bacc | response | character(0) |
classif.bbrier | prob | twoclass |
classif.ce | response | character(0) |
classif.costs | response | character(0) |
classif.dor | response | twoclass |
classif.fbeta | response | twoclass |
classif.fdr | response | twoclass |
classif.fn | response | twoclass |
classif.fnr | response | twoclass |
classif.fomr | response | twoclass |
classif.fp | response | twoclass |
classif.fpr | response | twoclass |
classif.logloss | prob | character(0) |
classif.mbrier | prob | character(0) |
classif.mcc | response | twoclass |
classif.npv | response | twoclass |
classif.ppv | response | twoclass |
classif.precision | response | twoclass |
classif.recall | response | twoclass |
classif.sensitivity | response | twoclass |
classif.specificity | response | twoclass |
classif.tn | response | twoclass |
classif.tnr | response | twoclass |
classif.tp | response | twoclass |
classif.tpr | response | twoclass |
To calculate a given measure for each CV model, we first create a measure:
mlr_measures$get("classif.acc")
my_measure <- my_measure
## <MeasureClassifSimple:classif.acc>
## * Packages: mlr3measures
## * Range: [0, 1]
## * Minimize: FALSE
## * Properties: -
## * Predict type: response
And next we pass the measure to the score()
method attached to our resampling object:
$score(my_measure) %>%
my_resample select(iteration, classif.acc)
## iteration classif.acc
## 1: 1 0.7048904
## 2: 2 0.7311586
## 3: 3 0.7322835
## 4: 4 0.7114736
## 5: 5 0.7272216
We can also do this in a more concise way using the msr()
function:
$score(msr("classif.acc")) %>%
my_resample select(iteration, classif.acc)
## iteration classif.acc
## 1: 1 0.7048904
## 2: 2 0.7311586
## 3: 3 0.7322835
## 4: 4 0.7114736
## 5: 5 0.7272216
Instead of using score()
, we can use aggregate()
to get a summary statistic across all models:
$aggregate(msr("classif.acc")) my_resample
## classif.acc
## 0.7214055
And we can also pass multiple statistics:
$aggregate(msrs(c("classif.acc", "classif.precision"))) my_resample
## classif.acc classif.precision
## 0.7214055 0.7446703
The plan for this model is to apply PCA to the red and blue columns separately, join the results, and fit a model. We’ll try multiple different models including classification decision tree, support vector machine, and xgboost. This discussion in the mlr3 GitHub repo served as a great resource for creating the pipeline below.
Note: A custom function will be needed to rejoin the split PCA results since there is not an operator for renaming variables within a pipeline. This section of the mlr3 book provides examples of customer operators.
We set up two different tasks for training and testing, deselecting linearly dependent columns:
TaskClassif$new("train", backend = train, target = "blueWins", positive = "1")
taskTrain =
$select(cols = setdiff(taskTrain$feature_names, c("blueTotalGold", "blueTotalMinionsKilled", "redKills", "redTotalGold", "redTotalMinionsKilled")))
taskTrain
TaskClassif$new("test", backend = test, target = "blueWins", positive = "1")
taskTest =
$select(cols = setdiff(taskTest$feature_names, c("blueTotalGold", "blueTotalMinionsKilled", "redKills", "redTotalGold", "redTotalMinionsKilled"))) taskTest
A custom operator is created for renaming PCA results so that two different PCA routines can be joined.
R6::R6Class("PipeOpPrepend",
PipeOpPrepend =inherit = mlr3pipelines::PipeOpTaskPreprocSimple,
public = list(
initialize = function(id = "prepend", param_vals = list()) {
ParamSet$new(params = list(ParamUty$new("prefix", default = "", tags = "prefix")))
ps =$initialize(id, param_set = ps, param_vals = param_vals)
super
},
get_state = function(task) {
task$feature_names
old_names = paste0(self$param_set$get_values(tags = "prefix"), old_names)
new_names =list(old_names = old_names, new_names = new_names)
},
transform = function(task) {
$rename(self$state$old_names, self$state$new_names)
task
}
) )
Here’s where the magic starts to happen. mlr3 pipelines can be expressed as graphs of PipeOperators. In the code below, we create two sequences of PipeOperators. The first sequence is for applying PCA to the blue team’s columns; the second sequence is for applying PCA to the red team’s columns. Finally, the features that are created by these two sequences are unioned together by the graph union operator gunion
.
po("select", id = "blue_cols", param_vals = list(selector = selector_grep("blue"))) %>>%
pca_blue <- po("pca", id = "blue_pca", param_vals = list(center = TRUE, scale. = TRUE)) %>>%
PipeOpPrepend$new(id = "blue_pca_rename", param_vals = list(prefix = "blue_"))
po("select", id = "red_cols", param_vals = list(selector = selector_grep("red"))) %>>%
pca_red <- po("pca", id = "red_pca", param_vals = list(center = TRUE, scale. = TRUE)) %>>%
PipeOpPrepend$new(id = "red_pca_rename", param_vals = list(prefix = "red_"))
gunion(list(pca_blue, pca_red)) %>>%
graph <- po("featureunion")
$keep_results <- TRUE
graph
$plot(html = TRUE) graph
Similar to the parallel sequences of operators created above for the blue and red team PCAs, we create parallel operators for 3 different models. Since we are not stacking the resulting models, only one model will be fit in each run of the pipeline. This is different than the parallel PCA sequences which will be fit each time. This is why the branch
and unbranch
PipeOperators appear before and after the learners and why they didn’t appear around the PCAs. A hyperparameter is used to choose which “branch” to follow for a given training of the graph.
mlr_pipeops$get("learner", learner = mlr_learners$get("classif.ranger"), id = "rf_lrn")
rf_lrn <- mlr_pipeops$get("learner", learner = mlr_learners$get("classif.svm"), id = "svm_lrn")
svm_lrn <- mlr_pipeops$get("learner", learner = mlr_learners$get("classif.xgboost"), id = "xgb_lrn")
xgb_lrn <-
c("rf_lrn", "svm_lrn", "xgb_lrn")
model_ids <- gunion(list(rf_lrn, svm_lrn, xgb_lrn))
models <-
graph %>>%
many_graph <- mlr_pipeops$get("branch", options = model_ids, id = "model_branch") %>>%
models %>>%
mlr_pipeops$get("unbranch", options = model_ids, id = "model_unbranch")
$plot(html = TRUE) many_graph
This is a more concise way to represent the code above:
po("learner", lrn("classif.ranger"), id = "rf_lrn")
rf_lrn <- po("learner", lrn("classif.svm"), id = "svm_lrn")
svm_lrn <- po("learner", lrn("classif.xgboost"), id = "xgb_lrn")
xgb_lrn <-
c("rf_lrn", "svm_lrn", "xgb_lrn")
model_ids <- gunion(list(rf_lrn, svm_lrn, xgb_lrn))
models <-
gunion(list(rf_lrn, svm_lrn, xgb_lrn))
models <-
graph %>>%
many_graph <- po("branch", options = model_ids, id = "model_branch") %>>%
models %>>%
po("unbranch", options = model_ids, id = "model_unbranch")
$plot(html = TRUE) many_graph
No we’re ready to fit each model branch of the graph to get a baseline performance measure before hyperparameter tuning. This is done by creating a hyperparameter set for which the only parameter is which branch to choose. 5-folds CV will be used to estimate the untuned accuracy and precision. We find that the Random Forest learner outperforms SVM and XGBoost with regard to both metrics.
GraphLearner$new(many_graph)
many_learner =
$predict_type <- "prob"
many_learner
ParamSet$new(list(
ps <-$new("model_branch.selection", levels = model_ids)
ParamFct
))
length(model_ids)
num_models <-
rsmp("cv", folds = 5)$instantiate(taskTrain)
cv5 <-
TuningInstance$new(
many_instance =task = taskTrain,
learner = many_learner,
resampling = cv5,
measures = msrs(c("classif.acc", "classif.precision")),
param_set = ps,
terminator = term("evals", n_evals = num_models)
)
# Verbose:
# tuner <- TunerGridSearch$new()
# tuner$param_set$values <- list(batch_size = num_models,
# resolution = num_models,
# param_resolutions = list(model_branch.selection = num_models))
tnr(
tuner <-"grid_search",
batch_size = num_models,
resolution = num_models,
param_resolutions = list(model_branch.selection = num_models)
)
quiet(tuner$tune(many_instance))
$archive(unnest = "params")[, c("model_branch.selection", "classif.acc", "classif.precision")] many_instance
## model_branch.selection classif.acc classif.precision
## 1: rf_lrn 0.7238768 0.7257330
## 2: svm_lrn 0.7113923 0.7126994
## 3: xgb_lrn 0.7023957 0.7062151
For each model, we’ll tune the number of PCA components to keep by searching over different values of red_pca.rank
and blue_pca.rank.
. We’ll also look for an optimal regularization term for the SVM by searching over svm_lrn.cost
. For XGBoost we’ll search over the max depth of a tree, xgb_lrn.max_depth
and the learning rate, xgb_lrn.eta
. Finally, for Random Forest we’ll tune the max depth, rf_lrn.max_depth
, and the number of variables available for splitting at each node, rf_lrn.mtry
.
ParamSet$new(list(
tune_ps =$new("blue_pca.rank.", lower = 2, upper = 7)
ParamInt$new("red_pca.rank.", lower = 2, upper = 7)
,ParamInt$new("model_branch.selection", levels = c("svm_lrn", "xgb_lrn", "rf_lrn"))
,ParamFct$new("svm_lrn.type", levels = "C-classification")
,ParamFct$new("svm_lrn.cost", lower = 0.001, upper = 1)
,ParamDbl$new("xgb_lrn.eta", lower = 0.01, upper = 0.4)
,ParamDbl$new("xgb_lrn.max_depth", lower = 3, upper = 10)
,ParamInt$new("rf_lrn.max.depth", lower = 3, upper = 10)
,ParamInt$new("rf_lrn.mtry", lower = 2, upper = 4)
,ParamInt ))
Since both models are included in the same graph learner, we need to make sure that the SVM parameters are applied when the SVM branch is selected and vice-versa for XGBoost and RandomForest. This is done by adding dependencies to the parameters in the parameter set.
$add_dep("svm_lrn.type",
tune_ps"model_branch.selection", CondEqual$new("svm_lrn"))
$add_dep("svm_lrn.cost",
tune_ps"model_branch.selection", CondEqual$new("svm_lrn"))
$add_dep("xgb_lrn.eta",
tune_ps"model_branch.selection", CondEqual$new("xgb_lrn"))
$add_dep("xgb_lrn.max_depth",
tune_ps"model_branch.selection", CondEqual$new("xgb_lrn"))
$add_dep("rf_lrn.max.depth",
tune_ps"model_branch.selection", CondEqual$new("rf_lrn"))
$add_dep("rf_lrn.mtry",
tune_ps"model_branch.selection", CondEqual$new("rf_lrn"))
Finally, we run the grid search for 1 hour and report the results in order of decreasing precision.
term("model_time", secs = 3600) # 1 hour
term_spec = tnr("random_search")
tuner =
TuningInstance$new(
instance =task = taskTrain,
learner = many_learner,
resampling = cv5,
measures = msrs(c("classif.acc", "classif.precision")),
param_set = tune_ps,
terminator = term_spec
)
set.seed(42)
quiet(tuner$tune(instance))
$archive(unnest = "tune_x") %>%
instance select(model_branch.selection, classif.acc, classif.precision,
blue_pca.rank., red_pca.rank.,
svm_lrn.cost, xgb_lrn.eta, xgb_lrn.max_depth,%>%
rf_lrn.mtry, rf_lrn.max.depth) arrange(desc(classif.precision)) %>%
kable() %>%
kable_styling()
model_branch.selection | classif.acc | classif.precision | blue_pca.rank. | red_pca.rank. | svm_lrn.cost | xgb_lrn.eta | xgb_lrn.max_depth | rf_lrn.mtry | rf_lrn.max.depth |
---|---|---|---|---|---|---|---|---|---|
svm_lrn | 0.7216275 | 0.7397342 | 3 | 4 | 0. | NA | NA | NA | NA |
svm_lrn | 0.7207271 | 0.7388769 | 2 | 7 | 0.0126493 | NA | NA | NA | NA |
xgb_lrn | 0.7173539 | 0.7386708 | 6 | 5 | NA | 0.2828803 | 3 | NA | NA |
xgb_lrn | 0.7173539 | 0.7386708 | 4 | 2 | NA | 0.1431047 | 3 | NA | NA |
xgb_lrn | 0.7173539 | 0.7386708 | 4 | 2 | NA | 0.2209044 | 3 | NA | NA |
xgb_lrn | 0.7173539 | 0.7386708 | 2 | 7 | NA | 0.0510133 | 3 | NA | NA |
xgb_lrn | 0.7173539 | 0.7386708 | 2 | 7 | NA | 0.1817170 | 3 | NA | NA |
xgb_lrn | 0.7173539 | 0.7386708 | 3 | 6 | NA | 0.3113689 | 3 | NA | NA |
xgb_lrn | 0.7173539 | 0.7386708 | 6 | 6 | NA | 0.1716529 | 3 | NA | NA |
xgb_lrn | 0.7173539 | 0.7386708 | 7 | 7 | NA | 0.3586599 | 3 | NA | NA |
xgb_lrn | 0.7173539 | 0.7386708 | 6 | 6 | NA | 0.3135614 | 3 | NA | NA |
The model with the highest precision was the Support Vector Machine at 73.97%. For this model, we had cost parameter equal to 0.00888, 3 blue team principal components, and 4 red team principal components. The cost parameter C serves as a regularization parameter – a small value for C increases the number of training errors while encouraging a smoother decision boundary. The best XGBoost model scored 73.87%. The best performing Random Forest model was not in the top 10 highest precision results.
We will now evaluate the test error for the top performing SVM model from our grid search above. We find that the test precision is 70.15%.
many_learner
final_learner =
$param_set$values$model_branch.selection <- 'svm_lrn'
final_learner$param_set$values$blue_pca.rank. <- 3
final_learner$param_set$values$red_pca.rank. <- 4
final_learner$param_set$values$svm_lrn.type <- 'C-classification'
final_learner$param_set$values$svm_lrn.cost <- 0.00888
final_learner
$train(taskTrain)
many_learner
many_learner$predict(taskTest)
prediction =
msrs(c('classif.acc', 'classif.precision'))
msrs =
$score(msrs) prediction
## classif.acc classif.precision
## 0.7064777 0.7014614
We set out to predict the winning League of Legends team based on 10 minutes of game play. Principal Component Analysis was applied to each teams’ metrics in turn in order to obtain an uncorrelated set of predictors. 7 components were sufficient to explain > 80% of the variance in the data. We pitted Support Vector Machine, Random Forest, and XGBoost against one another using mlr3 Pipelines and found that Support Vector Machine with cost parameter 0.00888, 3 blue team components, and 4 blue team components provided the highest precision on the training set at 73.97%. The test precision using this model was 70.15%. Using such a small cost parameter guards against overfitting; this is shown by the test error being only 5% lower than the cross-validated training error.