Pre-Introduction

mlr3 is a wonderful object-oriented machine learning package for R; I wanted to write a notebook that included a high level tutorial for mlr3 pipelines. Section 3.1 contains a walkthrough of how to train a decision tree using mlr3 pipelines. I cover defining "tasks" for train and test, resampling methods for cross-validation, a "learner" for decision tree, and how to combine them all together. This provides a foundation for the mlr3 Graph Learner I apply in Section 3.2 to run a hyperparameter grid search over several models.

Code link: github.com/ZackBarry/gRowing/blob/master/league_of_legends_classification/eda_and_model.Rmd

LoL: Predict Game Outcome

ZackBarry

4/22/2020

knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(message = FALSE)

library(tidyverse)
library(DataExplorer)
library(ggpubr)
library(patchwork)
library(caret)
library(knitr)
library(DT)
library(summarytools)
library(purrr)
library(car) # for vif()
library(mlr3)
library(mlr3learners)
library(mlr3measures)
library(mlr3tuning)
library(mlr3pipelines)
library(paradox) # ParamSet specifications for mlr3 tuning
library(kableExtra)
library(ddpcr)
lgr::get_logger("mlr3")$set_threshold("warn") # supress verbose output

dataset <- read_csv("Data/high_diamond_ranked_10min.csv")

1 Introduction

Leage of Legends is one of the most popular online multiplier games. Two teams of 5 players compete to battle their way to their oponents’ base. From game to game, players can assume different characters and roles on their team.

The goal of this notebook is to predict the outcome of a game with data from the first 10 minutes. Typically, games last 35min-45min, so it will be interesting to see how telling the first 10 minutes are. The dataset contains 19 different KPIs per team across 10,000 games. As e-sports betting is a growing industry, we will be using classification precision for model selection. Precision measures the percent of positive predictions that were true positives. By using precision as the target metric, we will pick the model that is most “confident” in predicting wins. To predict losses, specificity could be used.

2 Pre-Modeling Stages

2.1 Acquiring/Loading Data

We can see that the same variables are available for each the “red” and “blue” team, except blueWins records the outcome (there is no redWins).

## tibble [9,879 × 40] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gameId                      : num [1:9879] 4.52e+09 4.52e+09 4.52e+09 4.52e+09 4.44e+09 ...
##  $ blueWins                    : num [1:9879] 0 0 0 0 0 1 1 0 0 1 ...
##  $ blueWardsPlaced             : num [1:9879] 28 12 15 43 75 18 18 16 16 13 ...
##  $ blueWardsDestroyed          : num [1:9879] 2 1 0 1 4 0 3 2 3 1 ...
##  $ blueFirstBlood              : num [1:9879] 1 0 0 0 0 0 1 0 0 1 ...
##  $ blueKills                   : num [1:9879] 9 5 7 4 6 5 7 5 7 4 ...
##  $ blueDeaths                  : num [1:9879] 6 5 11 5 6 3 6 13 7 5 ...
##  $ blueAssists                 : num [1:9879] 11 5 4 5 6 6 7 3 8 5 ...
##  $ blueEliteMonsters           : num [1:9879] 0 0 1 1 0 1 1 0 0 1 ...
##  $ blueDragons                 : num [1:9879] 0 0 1 0 0 1 1 0 0 1 ...
##  $ blueHeralds                 : num [1:9879] 0 0 0 1 0 0 0 0 0 0 ...
##  $ blueTowersDestroyed         : num [1:9879] 0 0 0 0 0 0 0 0 0 0 ...
##  $ blueTotalGold               : num [1:9879] 17210 14712 16113 15157 16400 ...
##  $ blueAvgLevel                : num [1:9879] 6.6 6.6 6.4 7 7 7 6.8 6.4 7.2 6.8 ...
##  $ blueTotalExperience         : num [1:9879] 17039 16265 16221 17954 18543 ...
##  $ blueTotalMinionsKilled      : num [1:9879] 195 174 186 201 210 225 225 209 189 220 ...
##  $ blueTotalJungleMinionsKilled: num [1:9879] 36 43 46 55 57 42 53 48 61 39 ...
##  $ blueGoldDiff                : num [1:9879] 643 -2908 -1172 -1321 -1004 ...
##  $ blueExperienceDiff          : num [1:9879] -8 -1173 -1033 -7 230 ...
##  $ blueCSPerMin                : num [1:9879] 19.5 17.4 18.6 20.1 21 22.5 22.5 20.9 18.9 22 ...
##  $ blueGoldPerMin              : num [1:9879] 1721 1471 1611 1516 1640 ...
##  $ redWardsPlaced              : num [1:9879] 15 12 15 15 17 36 57 15 15 16 ...
##  $ redWardsDestroyed           : num [1:9879] 6 1 3 2 2 5 1 0 2 2 ...
##  $ redFirstBlood               : num [1:9879] 0 1 1 1 1 1 0 1 1 0 ...
##  $ redKills                    : num [1:9879] 6 5 11 5 6 3 6 13 7 5 ...
##  $ redDeaths                   : num [1:9879] 9 5 7 4 6 5 7 5 7 4 ...
##  $ redAssists                  : num [1:9879] 8 2 14 10 7 2 9 11 5 4 ...
##  $ redEliteMonsters            : num [1:9879] 0 2 0 0 1 0 0 1 2 0 ...
##  $ redDragons                  : num [1:9879] 0 1 0 0 1 0 0 1 1 0 ...
##  $ redHeralds                  : num [1:9879] 0 1 0 0 0 0 0 0 1 0 ...
##  $ redTowersDestroyed          : num [1:9879] 0 1 0 0 0 0 0 0 0 0 ...
##  $ redTotalGold                : num [1:9879] 16567 17620 17285 16478 17404 ...
##  $ redAvgLevel                 : num [1:9879] 6.8 6.8 6.8 7 7 7 6.4 6.6 7.2 6.8 ...
##  $ redTotalExperience          : num [1:9879] 17047 17438 17254 17961 18313 ...
##  $ redTotalMinionsKilled       : num [1:9879] 197 240 203 235 225 221 164 157 240 247 ...
##  $ redTotalJungleMinionsKilled : num [1:9879] 55 52 28 47 67 59 35 54 53 43 ...
##  $ redGoldDiff                 : num [1:9879] -643 2908 1172 1321 1004 ...
##  $ redExperienceDiff           : num [1:9879] 8 1173 1033 7 -230 ...
##  $ redCSPerMin                 : num [1:9879] 19.7 24 20.3 23.5 22.5 22.1 16.4 15.7 24 24.7 ...
##  $ redGoldPerMin               : num [1:9879] 1657 1762 1728 1648 1740 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gameId = col_double(),
##   ..   blueWins = col_double(),
##   ..   blueWardsPlaced = col_double(),
##   ..   blueWardsDestroyed = col_double(),
##   ..   blueFirstBlood = col_double(),
##   ..   blueKills = col_double(),
##   ..   blueDeaths = col_double(),
##   ..   blueAssists = col_double(),
##   ..   blueEliteMonsters = col_double(),
##   ..   blueDragons = col_double(),
##   ..   blueHeralds = col_double(),
##   ..   blueTowersDestroyed = col_double(),
##   ..   blueTotalGold = col_double(),
##   ..   blueAvgLevel = col_double(),
##   ..   blueTotalExperience = col_double(),
##   ..   blueTotalMinionsKilled = col_double(),
##   ..   blueTotalJungleMinionsKilled = col_double(),
##   ..   blueGoldDiff = col_double(),
##   ..   blueExperienceDiff = col_double(),
##   ..   blueCSPerMin = col_double(),
##   ..   blueGoldPerMin = col_double(),
##   ..   redWardsPlaced = col_double(),
##   ..   redWardsDestroyed = col_double(),
##   ..   redFirstBlood = col_double(),
##   ..   redKills = col_double(),
##   ..   redDeaths = col_double(),
##   ..   redAssists = col_double(),
##   ..   redEliteMonsters = col_double(),
##   ..   redDragons = col_double(),
##   ..   redHeralds = col_double(),
##   ..   redTowersDestroyed = col_double(),
##   ..   redTotalGold = col_double(),
##   ..   redAvgLevel = col_double(),
##   ..   redTotalExperience = col_double(),
##   ..   redTotalMinionsKilled = col_double(),
##   ..   redTotalJungleMinionsKilled = col_double(),
##   ..   redGoldDiff = col_double(),
##   ..   redExperienceDiff = col_double(),
##   ..   redCSPerMin = col_double(),
##   ..   redGoldPerMin = col_double()
##   .. )

No columns have missing data.

gameId	blueWins	blueWardsPlaced	blueWardsDestroyed	blueFirstBlood	blueKills	blueDeaths	blueAssists	blueEliteMonsters	blueDragons	blueHeralds	blueTowersDestroyed	blueTotalGold	blueAvgLevel	blueTotalExperience	blueTotalMinionsKilled	blueTotalJungleMinionsKilled	blueGoldDiff	blueExperienceDiff	blueCSPerMin	blueGoldPerMin	redWardsPlaced	redWardsDestroyed	redFirstBlood	redKills	redDeaths	redAssists	redEliteMonsters	redDragons	redHeralds	redTowersDestroyed	redTotalGold	redAvgLevel	redTotalExperience	redTotalMinionsKilled	redTotalJungleMinionsKilled	redGoldDiff	redExperienceDiff	redCSPerMin	redGoldPerMin
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

2.2 Data Wrangling: 1 row per team

We’d like to be able to see distributions of the winning teams’ KPIs alongside the losing teams’ KPIs. Currently, the losing and winning team for each map occupy the same row. We modify the data set so that each row is one team’s performance in a given game.

Note that there are no values that we need to impute in this dataset.

2.3 Visualization

2.3.1 Variable Correlation

Next, we look at a correlation heat map with all variables. There are too many variables to get a clear sense of what’s going on, so we’ll break it down in the next couple visualizations.

We consider which variables are highly correlated (cor > 0.5). We see that TotalGold and TotalExperience are the variables with the most highly correlated pairs. This makes sense because they are high-level metrics that are likely influenced by lower level metrics such as AvgLevel and GoldDiff.

Next, we look to see if any variables are 100% correlated. We find that GoldPerMin is 100% correlated with TotalGold and CSPerMin with TotalMinionsKilled. As such we will drop TotalGold and TotalMinionsKilled when we prepare for modeling.

Var1	Var2	PearsonCorrelation
GoldPerMin	TotalGold	1
CSPerMin	TotalMinionsKilled	1

2.3.2 KPI Performance of Winning vs. Losing Team

Our first in-depth look will be the correlation of each variable with Wins. We find that gold-related KPIs GoldDiff, ExperienceDiff, TotalGold, and GoldPerMin top the list. Experience KPIs TotalExperience and AvgLevel are slightly less correlated, as are player vs. player aggression KPIs Kills, Deaths, and Assists. Monster, Minion, and Ward KPIs are the least correlated. It will be interesting to see if our Primary Component Analysis pulls these groups out.

comparedVar	WinsCorrelation
Wins	1.0000000
GoldDiff	0.5110980
ExperienceDiff	0.4895157
TotalGold	0.4142877
GoldPerMin	0.4142877
TotalExperience	0.3918552
AvgLevel	0.3549599
Kills	0.3382602
Deaths	-0.3382602
Assists	0.2738702
EliteMonsters	0.2217446
CSPerMin	0.2185366
TotalMinionsKilled	0.2185366
Dragons	0.2114095
FirstBlood	0.2017411
TotalJungleMinionsKilled	0.1211291
TowersDestroyed	0.1097367
Heralds	0.0945192
WardsDestroyed	0.0497151
WardsPlaced	0.0120241
gameId	0.0000000

Taking a look at the distributions of the top 5 variables that are most correlated with Wins, we see that each distribution is approximately normal. Additionally, the distribution for the variable when Wins is 0 is generally shifted to the left from when Wins is 1. This is in line with out expectations - it makes sense that the losing team should have less Gold and Experience than the winning team.

2.3.3 PCA

Next we seek to reduce the dimension of our predictor space by applying Principal Component Analysis (PCA). PCA uses linear combinations of possibly correlated input variables to form a new set of variables (principal components) that are uncorrelated with one another. Additionally, the variables are created sequentially to explain the most variance possible in the dataset. A variance threshold can be used for dimensionality reduction (e.g. keep only those components that explain more than 5% of the variance). Note: we center and scale (standardize) the data before applying PCA since variables with larger mean or standard deviation will be prioritized to explain the variation of the data.

In the scree plot below, the magnitude of the eigenvalue indicates the amount of variation that each principal component captures. The proportion of variance for a given component is the component’s eigenvalue divided by the sum of all eigenvalues. We see that the first component and second components explain ~35% and ~13% of the variance respectively. Later components are similar in the amount of variance they explain. The bottom plot shows the cumulative variance explained by the first N components. One rule of thumb is to keep enough components so that this cumulative variance exceeds 80%. In this case, 7 variables appears to be sufficient.

Next we consider a loading plot which shows how strongly each input variable influences a principal component. The length of the vector along the PCX axis (i.e. the length of the vector projected to the PCX axis) indicates how much weight that variable has on PCX. The angles between vectors tell us how the variables are correlated with one another. If two vectors are close, the variables they represent are positively correlated. If the angle is closer to 90 degrees, they are not likely to be correlated. If the angle is close to 180 degress they are negatively correlated.

In this example, the variables that contribute most to PC1 are GoldDiff, ExperienceDiff, AvgLevel and TotalExperience. Not many variables contribute more to PC2 than to PC1, but Kills, Assists, and Deaths contribute significantly. PC1 appears to represent high-level team characteristics, while PC2 represents the biggest KPIs within a game.

We see that Kills and Assists are strongly correlated with one another. Deaths is negatively correlated with AvgLevel and TotalExperience.

Finally, we visualize the contribution of each variable to each component with a heat map. To make it easier to identify contributing variables, those with loading values less than 0.2 are colored as white.

3 Model fitting

We’ll compare 3 classification models - decision trees, support vector machines, and xgboost - using mlr3. We’ll also test the performance improvements granted by applying PCA to the blue and red teams’ data. mlr3pipelines provide a concise way to tune parameters across these models, including whether or not PCA is used as a preprocessing step. A short introduction to pipelines is included in the “mlr3 Walkthrough” section below.

3.1 mlr3 Walkthrough: Decision Tree

To determine a baseline for model performance, we first estimate the classification accuracy for a decision tree by using cross validation. This also gives us the opportunity to walk through the ml3 modeling process in detail. In mlr3 applying cross-validation requires three objects - (1) a learner containing the model to be trained; (2) a task containing the data to be resampled from; (3) a resampling that defines the sampling method.

# mlr3 classification task needs target to be a factor
dataset <- mutate(dataset, blueWins = as.factor(blueWins))

set.seed(1)
train <- sample_frac(dataset, 0.9)
test <- anti_join(dataset, train)

3.1.1 Task

A Task wraps a DataBackend which provides a layer of abstraction for various data storage systems (e.g. DataFrames). mlr3 comes with a data.table implementation for backends; the conversion from DataFrame to data.table is done automatically. Instead of working directly with DataBackends, they are worked with indirectly through the Task they are associated with. Tasks also store information about the role of individual columns of the DataBackend (e.g. target vs. feature).

task = TaskClassif$new("wins_classifier", backend = train, target = "blueWins", positive = "1")
task

## <TaskClassif:wins_classifier> (8891 x 40)
## * Target: blueWins
## * Properties: twoclass
## * Features (39):
##   - dbl (39): blueAssists, blueAvgLevel, blueCSPerMin, blueDeaths,
##     blueDragons, blueEliteMonsters, blueExperienceDiff,
##     blueFirstBlood, blueGoldDiff, blueGoldPerMin, blueHeralds,
##     blueKills, blueTotalExperience, blueTotalGold,
##     blueTotalJungleMinionsKilled, blueTotalMinionsKilled,
##     blueTowersDestroyed, blueWardsDestroyed, blueWardsPlaced,
##     gameId, redAssists, redAvgLevel, redCSPerMin, redDeaths,
##     redDragons, redEliteMonsters, redExperienceDiff,
##     redFirstBlood, redGoldDiff, redGoldPerMin, redHeralds,
##     redKills, redTotalExperience, redTotalGold,
##     redTotalJungleMinionsKilled, redTotalMinionsKilled,
##     redTowersDestroyed, redWardsDestroyed, redWardsPlaced

Deselect linearly dependent variables determined by EDA above - this is done in-place and “provides a different ‘view’ on the data without altering the data itself”.

task$select(cols = setdiff(task$feature_names, c("blueTotalGold", "blueTotalMinionsKilled", "redKills", "redTotalGold", "redTotalMinionsKilled")))

3.1.2 Resampling

See the different resampling implementations available to choose from:

as.data.table(mlr_resamplings)  %>%
  unnest_wider(params, names_sep = "")

## # A tibble: 7 x 4
##   key         params...1 params...2 iters
##   <chr>       <chr>      <chr>      <int>
## 1 bootstrap   repeats    ratio         30
## 2 custom      <NA>       <NA>           0
## 3 cv          folds      <NA>          10
## 4 holdout     ratio      <NA>           1
## 5 insample    <NA>       <NA>           1
## 6 repeated_cv repeats    folds        100
## 7 subsampling repeats    ratio         30

The verbose way to do this is to retrieve the specific Resampling object, set the hyperparameters, and attach it to a task in 3 separate steps:

my_cv = mlr_resamplings$get("cv")
my_cv$param_set$values <- list(folds = 5)
my_cv$instantiate(task)
my_cv

## <ResamplingCV> with 5 iterations
## * Instantiated: TRUE
## * Parameters: folds=5

An easier way is to use rsmp to retrieve the object and set hyperparameters in one go:

my_cv <- rsmp("cv", folds = 5)$instantiate(task)
my_cv

## <ResamplingCV> with 5 iterations
## * Instantiated: TRUE
## * Parameters: folds=5

3.1.3 Learner

See the different built-in learners:

as.data.table(mlr_learners)  %>%
  filter(str_detect(key, "classif")) %>%
  as.data.frame() %>%
  select(key, predict_types) %>%
  kable() %>%
  kable_styling()

key	predict_types
classif.cv_glmnet	c(“response”, “prob”)
classif.debug	c(“response”, “prob”)
classif.featureless	c(“response”, “prob”)
classif.glmnet	c(“response”, “prob”)
classif.kknn	c(“response”, “prob”)
classif.lda	c(“response”, “prob”)
classif.log_reg	c(“response”, “prob”)
classif.multinom	c(“response”, “prob”)
classif.naive_bayes	c(“response”, “prob”)
classif.qda	c(“response”, “prob”)
classif.ranger	c(“response”, “prob”)
classif.rpart	c(“response”, “prob”)
classif.svm	c(“response”, “prob”)
classif.xgboost	c(“response”, “prob”)

As with resampling methods, there is a verbose way and a concise way to get a learner:

my_learner = mlr_learners$get("classif.rpart")
my_learner

## <LearnerClassifRpart:classif.rpart>
## * Model: -
## * Parameters: xval=0
## * Packages: rpart
## * Predict Type: response
## * Feature types: logical, integer, numeric, factor, ordered
## * Properties: importance, missings, multiclass, selected_features,
##   twoclass, weights

The concise way:

my_learner = lrn("classif.rpart")
my_learner

## <LearnerClassifRpart:classif.rpart>
## * Model: -
## * Parameters: xval=0
## * Packages: rpart
## * Predict Type: response
## * Feature types: logical, integer, numeric, factor, ordered
## * Properties: importance, missings, multiclass, selected_features,
##   twoclass, weights

3.1.4 Apply Cross Validation

A call to resample returnes a ResampleResult object that can be used to access different models and metrics from the CV runs.

my_resample <- resample(task = task, learner = my_learner, resampling = my_cv, store_models = TRUE)
my_resample

## <ResampleResult> of 5 iterations
## * Task: wins_classifier
## * Learner: classif.rpart
## * Warnings: 0 in 0 iterations
## * Errors: 0 in 0 iterations

3.1.5 Retrieve Performance Metrics

We can aggregate model-specific metrics user Measure objects. Here is a list of different built-in measures:

as.data.table(mlr_measures) %>%
  filter(task_type == "classif") %>%
  as.data.frame() %>%
  select(key, predict_type, task_properties) %>%
  kable() %>%
  kable_styling()

key	predict_type	task_properties
classif.acc	response	character(0)
classif.auc	prob	twoclass
classif.bacc	response	character(0)
classif.bbrier	prob	twoclass
classif.ce	response	character(0)
classif.costs	response	character(0)
classif.dor	response	twoclass
classif.fbeta	response	twoclass
classif.fdr	response	twoclass
classif.fn	response	twoclass
classif.fnr	response	twoclass
classif.fomr	response	twoclass
classif.fp	response	twoclass
classif.fpr	response	twoclass
classif.logloss	prob	character(0)
classif.mbrier	prob	character(0)
classif.mcc	response	twoclass
classif.npv	response	twoclass
classif.ppv	response	twoclass
classif.precision	response	twoclass
classif.recall	response	twoclass
classif.sensitivity	response	twoclass
classif.specificity	response	twoclass
classif.tn	response	twoclass
classif.tnr	response	twoclass
classif.tp	response	twoclass
classif.tpr	response	twoclass

To calculate a given measure for each CV model, we first create a measure:

my_measure <- mlr_measures$get("classif.acc")
my_measure

## <MeasureClassifSimple:classif.acc>
## * Packages: mlr3measures
## * Range: [0, 1]
## * Minimize: FALSE
## * Properties: -
## * Predict type: response

And next we pass the measure to the score() method attached to our resampling object:

my_resample$score(my_measure) %>%
  select(iteration, classif.acc)

##    iteration classif.acc
## 1:         1   0.7048904
## 2:         2   0.7311586
## 3:         3   0.7322835
## 4:         4   0.7114736
## 5:         5   0.7272216

We can also do this in a more concise way using the msr() function:

my_resample$score(msr("classif.acc")) %>%
  select(iteration, classif.acc)

##    iteration classif.acc
## 1:         1   0.7048904
## 2:         2   0.7311586
## 3:         3   0.7322835
## 4:         4   0.7114736
## 5:         5   0.7272216

Instead of using score(), we can use aggregate() to get a summary statistic across all models:

my_resample$aggregate(msr("classif.acc"))

## classif.acc 
##   0.7214055

And we can also pass multiple statistics:

my_resample$aggregate(msrs(c("classif.acc", "classif.precision")))

##       classif.acc classif.precision 
##         0.7214055         0.7446703

3.2 mlr3: Graph Learner w/ Several Base Learners

The plan for this model is to apply PCA to the red and blue columns separately, join the results, and fit a model. We’ll try multiple different models including classification decision tree, support vector machine, and xgboost. This discussion in the mlr3 GitHub repo served as a great resource for creating the pipeline below.

Note: A custom function will be needed to rejoin the split PCA results since there is not an operator for renaming variables within a pipeline. This section of the mlr3 book provides examples of customer operators.

3.2.1 Test/Train Split

We set up two different tasks for training and testing, deselecting linearly dependent columns:

taskTrain = TaskClassif$new("train", backend = train, target = "blueWins", positive = "1")

taskTrain$select(cols = setdiff(taskTrain$feature_names, c("blueTotalGold", "blueTotalMinionsKilled", "redKills", "redTotalGold", "redTotalMinionsKilled")))

taskTest = TaskClassif$new("test", backend = test, target = "blueWins", positive = "1")

taskTest$select(cols = setdiff(taskTest$feature_names, c("blueTotalGold", "blueTotalMinionsKilled", "redKills", "redTotalGold", "redTotalMinionsKilled")))

3.2.2 Custom Operator

A custom operator is created for renaming PCA results so that two different PCA routines can be joined.

PipeOpPrepend = R6::R6Class("PipeOpPrepend",
  inherit = mlr3pipelines::PipeOpTaskPreprocSimple,
  public = list(
    initialize = function(id = "prepend", param_vals = list()) {
      ps = ParamSet$new(params = list(ParamUty$new("prefix", default = "", tags = "prefix")))
      super$initialize(id, param_set = ps, param_vals = param_vals)
    },

    get_state = function(task) {
      old_names = task$feature_names
      new_names = paste0(self$param_set$get_values(tags = "prefix"), old_names)
      list(old_names = old_names, new_names = new_names)
    },

    transform = function(task) {
      task$rename(self$state$old_names, self$state$new_names)
    }
  )
)

3.2.3 Feature Engineering graph

Here’s where the magic starts to happen. mlr3 pipelines can be expressed as graphs of PipeOperators. In the code below, we create two sequences of PipeOperators. The first sequence is for applying PCA to the blue team’s columns; the second sequence is for applying PCA to the red team’s columns. Finally, the features that are created by these two sequences are unioned together by the graph union operator gunion.

pca_blue <- po("select", id = "blue_cols", param_vals = list(selector = selector_grep("blue"))) %>>%
  po("pca", id = "blue_pca", param_vals = list(center = TRUE, scale. = TRUE)) %>>%
  PipeOpPrepend$new(id = "blue_pca_rename", param_vals = list(prefix = "blue_"))

pca_red <- po("select", id = "red_cols", param_vals = list(selector = selector_grep("red"))) %>>%
  po("pca", id = "red_pca", param_vals = list(center = TRUE, scale. = TRUE))  %>>%
  PipeOpPrepend$new(id = "red_pca_rename", param_vals = list(prefix = "red_"))

graph <- gunion(list(pca_blue, pca_red)) %>>%
  po("featureunion")

graph$keep_results <- TRUE

graph$plot(html = TRUE)

3.2.4 Add Models to Feature Graph

Similar to the parallel sequences of operators created above for the blue and red team PCAs, we create parallel operators for 3 different models. Since we are not stacking the resulting models, only one model will be fit in each run of the pipeline. This is different than the parallel PCA sequences which will be fit each time. This is why the branch and unbranch PipeOperators appear before and after the learners and why they didn’t appear around the PCAs. A hyperparameter is used to choose which “branch” to follow for a given training of the graph.

rf_lrn <- mlr_pipeops$get("learner", learner = mlr_learners$get("classif.ranger"), id = "rf_lrn")
svm_lrn <- mlr_pipeops$get("learner", learner = mlr_learners$get("classif.svm"), id = "svm_lrn")
xgb_lrn <- mlr_pipeops$get("learner", learner = mlr_learners$get("classif.xgboost"), id = "xgb_lrn")

model_ids <- c("rf_lrn", "svm_lrn", "xgb_lrn")
models <- gunion(list(rf_lrn, svm_lrn, xgb_lrn))

many_graph <- graph %>>%
  mlr_pipeops$get("branch", options = model_ids, id = "model_branch") %>>%
  models %>>%
  mlr_pipeops$get("unbranch", options = model_ids, id = "model_unbranch")

many_graph$plot(html = TRUE)

This is a more concise way to represent the code above:

rf_lrn <- po("learner", lrn("classif.ranger"), id = "rf_lrn")
svm_lrn <- po("learner", lrn("classif.svm"), id = "svm_lrn")
xgb_lrn <- po("learner", lrn("classif.xgboost"), id = "xgb_lrn")

model_ids <- c("rf_lrn", "svm_lrn", "xgb_lrn")
models <- gunion(list(rf_lrn, svm_lrn, xgb_lrn))

models <- gunion(list(rf_lrn, svm_lrn, xgb_lrn))

many_graph <- graph %>>%
  po("branch", options = model_ids, id = "model_branch") %>>%
  models %>>%
  po("unbranch", options = model_ids, id = "model_unbranch")

many_graph$plot(html = TRUE)

3.2.5 Fit Graph for Each Model Branch

No we’re ready to fit each model branch of the graph to get a baseline performance measure before hyperparameter tuning. This is done by creating a hyperparameter set for which the only parameter is which branch to choose. 5-folds CV will be used to estimate the untuned accuracy and precision. We find that the Random Forest learner outperforms SVM and XGBoost with regard to both metrics.

many_learner = GraphLearner$new(many_graph)

many_learner$predict_type <- "prob"

ps <- ParamSet$new(list(
  ParamFct$new("model_branch.selection", levels = model_ids)
))

num_models <- length(model_ids)

cv5 <- rsmp("cv", folds = 5)$instantiate(taskTrain)

many_instance = TuningInstance$new(
  task = taskTrain,
  learner = many_learner,
  resampling = cv5,
  measures = msrs(c("classif.acc", "classif.precision")),
  param_set = ps,
  terminator = term("evals", n_evals = num_models)
)

# Verbose:
# tuner <- TunerGridSearch$new()
# tuner$param_set$values <- list(batch_size = num_models,
#                                resolution = num_models,
#                                param_resolutions = list(model_branch.selection = num_models))

tuner <- tnr(
  "grid_search",
  batch_size = num_models,
  resolution = num_models,
  param_resolutions = list(model_branch.selection = num_models)
)

quiet(tuner$tune(many_instance))

many_instance$archive(unnest = "params")[, c("model_branch.selection", "classif.acc", "classif.precision")]

##    model_branch.selection classif.acc classif.precision
## 1:                 rf_lrn   0.7238768         0.7257330
## 2:                svm_lrn   0.7113923         0.7126994
## 3:                xgb_lrn   0.7023957         0.7062151

3.2.6 Hyperparameter Search over Graph

For each model, we’ll tune the number of PCA components to keep by searching over different values of red_pca.rank and blue_pca.rank.. We’ll also look for an optimal regularization term for the SVM by searching over svm_lrn.cost. For XGBoost we’ll search over the max depth of a tree, xgb_lrn.max_depth and the learning rate, xgb_lrn.eta. Finally, for Random Forest we’ll tune the max depth, rf_lrn.max_depth, and the number of variables available for splitting at each node, rf_lrn.mtry.

tune_ps = ParamSet$new(list(
  ParamInt$new("blue_pca.rank.", lower = 2, upper = 7)
  ,ParamInt$new("red_pca.rank.", lower = 2, upper = 7)
  ,ParamFct$new("model_branch.selection", levels = c("svm_lrn", "xgb_lrn", "rf_lrn"))
  ,ParamFct$new("svm_lrn.type", levels = "C-classification")
  ,ParamDbl$new("svm_lrn.cost", lower = 0.001, upper = 1)
  ,ParamDbl$new("xgb_lrn.eta", lower = 0.01, upper = 0.4)
  ,ParamInt$new("xgb_lrn.max_depth", lower = 3, upper = 10)
  ,ParamInt$new("rf_lrn.max.depth", lower = 3, upper = 10)
  ,ParamInt$new("rf_lrn.mtry", lower = 2, upper = 4)
))

Since both models are included in the same graph learner, we need to make sure that the SVM parameters are applied when the SVM branch is selected and vice-versa for XGBoost and RandomForest. This is done by adding dependencies to the parameters in the parameter set.

tune_ps$add_dep("svm_lrn.type",
                "model_branch.selection", CondEqual$new("svm_lrn"))
tune_ps$add_dep("svm_lrn.cost",
                "model_branch.selection", CondEqual$new("svm_lrn"))
tune_ps$add_dep("xgb_lrn.eta",
                "model_branch.selection", CondEqual$new("xgb_lrn"))
tune_ps$add_dep("xgb_lrn.max_depth",
                "model_branch.selection", CondEqual$new("xgb_lrn"))
tune_ps$add_dep("rf_lrn.max.depth",
                "model_branch.selection", CondEqual$new("rf_lrn"))
tune_ps$add_dep("rf_lrn.mtry",
                "model_branch.selection", CondEqual$new("rf_lrn"))

Finally, we run the grid search for 1 hour and report the results in order of decreasing precision.

term_spec = term("model_time", secs = 3600)  # 1 hour
tuner = tnr("random_search")

instance = TuningInstance$new(
  task = taskTrain,
  learner = many_learner,
  resampling = cv5,
  measures = msrs(c("classif.acc", "classif.precision")),
  param_set = tune_ps,
  terminator = term_spec
)

set.seed(42)

quiet(tuner$tune(instance))

instance$archive(unnest = "tune_x") %>%
  select(model_branch.selection, classif.acc, classif.precision,
         blue_pca.rank., red_pca.rank.,
         svm_lrn.cost, xgb_lrn.eta, xgb_lrn.max_depth,
         rf_lrn.mtry, rf_lrn.max.depth) %>%
  arrange(desc(classif.precision)) %>%
  kable() %>%
  kable_styling()

model_branch.selection	classif.acc	classif.precision	blue_pca.rank.	red_pca.rank.	svm_lrn.cost	xgb_lrn.eta	xgb_lrn.max_depth	rf_lrn.mtry	rf_lrn.max.depth
svm_lrn	0.7216275	0.7397342	3	4	0.	NA	NA	NA	NA
svm_lrn	0.7207271	0.7388769	2	7	0.0126493	NA	NA	NA	NA
xgb_lrn	0.7173539	0.7386708	6	5	NA	0.2828803	3	NA	NA
xgb_lrn	0.7173539	0.7386708	4	2	NA	0.1431047	3	NA	NA
xgb_lrn	0.7173539	0.7386708	4	2	NA	0.2209044	3	NA	NA
xgb_lrn	0.7173539	0.7386708	2	7	NA	0.0510133	3	NA	NA
xgb_lrn	0.7173539	0.7386708	2	7	NA	0.1817170	3	NA	NA
xgb_lrn	0.7173539	0.7386708	3	6	NA	0.3113689	3	NA	NA
xgb_lrn	0.7173539	0.7386708	6	6	NA	0.1716529	3	NA	NA
xgb_lrn	0.7173539	0.7386708	7	7	NA	0.3586599	3	NA	NA
xgb_lrn	0.7173539	0.7386708	6	6	NA	0.3135614	3	NA	NA

The model with the highest precision was the Support Vector Machine at 73.97%. For this model, we had cost parameter equal to 0.00888, 3 blue team principal components, and 4 red team principal components. The cost parameter C serves as a regularization parameter – a small value for C increases the number of training errors while encouraging a smoother decision boundary. The best XGBoost model scored 73.87%. The best performing Random Forest model was not in the top 10 highest precision results.

3.2.6.1 Calculate Test Error

We will now evaluate the test error for the top performing SVM model from our grid search above. We find that the test precision is 70.15%.

final_learner = many_learner

final_learner$param_set$values$model_branch.selection <- 'svm_lrn'
final_learner$param_set$values$blue_pca.rank. <- 3
final_learner$param_set$values$red_pca.rank. <- 4
final_learner$param_set$values$svm_lrn.type <- 'C-classification'
final_learner$param_set$values$svm_lrn.cost <- 0.00888

many_learner$train(taskTrain)

prediction = many_learner$predict(taskTest)

msrs = msrs(c('classif.acc', 'classif.precision'))

prediction$score(msrs)

##       classif.acc classif.precision 
##         0.7064777         0.7014614

4 Conclusion

We set out to predict the winning League of Legends team based on 10 minutes of game play. Principal Component Analysis was applied to each teams’ metrics in turn in order to obtain an uncorrelated set of predictors. 7 components were sufficient to explain > 80% of the variance in the data. We pitted Support Vector Machine, Random Forest, and XGBoost against one another using mlr3 Pipelines and found that Support Vector Machine with cost parameter 0.00888, 3 blue team components, and 4 blue team components provided the highest precision on the training set at 73.97%. The test precision using this model was 70.15%. Using such a small cost parameter guards against overfitting; this is shown by the test error being only 5% lower than the cross-validated training error.