My primary goal with this EDA and model fitting exercise was to practice working with dimensionality redacted techniques, specifically Primary Component Analysis. For the dataset I worked with, interpretability of results was a high priority. Thus, I had to make a judgement call as to whether or not the performance improvement of using PCA was worth the reduction in interpretability.
A secondary goal of this document was to practice building recipes for pre-modeling data preparation. In combination with caret, recipes provide an easy way to re-apply data prep stages at each fold in a cross-validation routine. This is important for a method like PCA where the components may be constructed differently for each fold.
Code link: github.com/ZackBarry/gRowing/blob/master/graduate_admissions/eda_and_model.Rmd
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(DataExplorer)
library(ggridges)
library(reshape2)
library(ggfortify) # for plotting PCA
library(patchwork)
library(caret)
library(recipes)
library(knitr)
library(kableExtra)
library(ggthemes)
dataset <- read_csv("Data/Admission_Predict_Ver1.1.csv")
In this document we consider a set of variables that are potentially related to a student’s chances of getting into graduate school. The population of students is limited to India. Model interpretability is a top priority as the result should be able to guide students towards the most effective application preparation.
This data set has nearly 500 observations of 8 different explanatory variables, 7 of which are continuous and 1 of which is binary.
The variable “SOP” stands for “Statement of Purpose Strength”; “LOR” stands for “Letter of Recommendation Strength”; “CGPA” is the student’s undergraduate GPA; “Research” is a boolean - 1 for did research, 0 for did not. “Chance of Admit” is the response variable, ranging from 0 to 1.
## Observations: 500
## Variables: 9
## $ `Serial No.` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ `GRE Score` <dbl> 337, 324, 316, 322, 314, 330, 321, 308, 302,…
## $ `TOEFL Score` <dbl> 118, 107, 104, 110, 103, 115, 109, 101, 102,…
## $ `University Rating` <dbl> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3,…
## $ SOP <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0,…
## $ LOR <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5,…
## $ CGPA <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.…
## $ Research <dbl> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ `Chance of Admit` <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.…
The field Serial No.
appears to be an ID, so we’ll drop it. The columns are renamed to be easier to work with.
None of the columns contains any missing data:
## # A tibble: 1 x 8
## gre_score toefl_score university_rati… sop lor cgpa research
## <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 0 0 0
## # … with 1 more variable: chance_of_admit <int>
The allowed range for each variable is as follows:
Each variable is within the expected range, so there is no apparent need to clean the data.
## # A tibble: 2 x 8
## gre_score toefl_score university_rati… sop lor cgpa research
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 290 92 1 1 1 6.8 0
## 2 340 120 5 5 5 9.92 1
## # … with 1 more variable: chance_of_admit <dbl>
In this section we check which explanatory variables have enough variance to be useful in explaining the variance in the response variable.
First we’ll check on the relationship between each of the continuous explanatory variables and admission chances. Since each of these variables is a score where higher is better, we’d expect that as the students’ scores go up so do their admission chances. We’ll leave the non-student variable university_rating
for later.
We see that as GRE Score, CPGA, and TOEFL Score increase, the chance of admission also tends to increase. The relationship between CPGA and admission chance is especially clear. It will be interested to see how correlated this scores are.
continuous_plots <- lapply(
c("gre_score", "toefl_score", "cgpa"),
function(x) {
wrapr::let(
alias = list(X_VAR = x, Y_VAR = "chance_of_admit"),
expr = {
ggplot(dataset, aes(x = X_VAR, y = Y_VAR)) +
geom_point() +
theme_bw() +
labs(title = paste(x, "vs. admit"), x = x, y = "chance_of_admit")
}
)
}
)
wrap_plots(continuous_plots, ncol = 2)
Next, consider the distribution of chance_of_admit
for each level of the letter of recommendation (lor
) and statement of purpose (sop
) scores. The plots are very similar – admission chances tend to rise alongisde lor
and sop
. The distributions at each level of sop
seem to have lower variance than for lor
, indicating that there might be a stronger relationsihip between sop
and chance_of_admit
than between lor
and chance_of_admit
. This is interesting because the statement of purporse is entirely student-driven where as letter of recommendations are instructor-driven.
p1 <- ggplot(dataset, aes(x = chance_of_admit, y = factor(lor))) +
geom_density_ridges(aes(fill = factor(lor))) +
theme_bw()
p2 <- ggplot(dataset, aes(x = chance_of_admit, y = factor(sop))) +
geom_density_ridges(aes(fill = factor(sop))) +
theme_bw()
p1 / p2
## Picking joint bandwidth of 0.0386
## Picking joint bandwidth of 0.0332
The relationship between university rating and admission chances is unsuprising – the higher ranked the university, the lower the chances of admission.
ggplot(dataset, aes(x = chance_of_admit, y = factor(university_rating))) +
geom_density_ridges(aes(fill = factor(university_rating))) +
theme_bw()
## Picking joint bandwidth of 0.0312
Now looking at the number of observations of research vs. no research for different levels of chance_of_admit
, we see that students who participated in research have higher chance of admission. In fact, there are no observations of students with higher than a 90% chance of admission without doing research, and there are very few observations of students with higher than 80% chance. However, we should recall from the university rating ridge plot above that most observations where chance of admit is higher than 80% have university rankings of 4 or 5.
ggplot(dataset, aes(x = chance_of_admit, fill = factor(research), color = factor(research))) +
geom_histogram(position="identity", alpha=0.6) +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We see that each variable is moderately correlated with each other variable. In particular, we see that cpga
, gre_score
, and toefl_score
are all highly correlated with one another and that cpga
is the most highly correlated with chance_of_admit
. High correlation among explanatory variables can decrease model accuracy due to multicollinearity. In the next section we’ll explore a dimensionality reduction technique called Primary Component Analysis that builds a new set of uncorrelated explanatory variables.
get_cor_matrix_plot <- function(df, round_to = 2) {
cor_matrix <- round(cor(df), round_to)
melted_cor_matrix <- melt(cor_matrix)
ggplot(melted_cor_matrix, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(
low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1)
) +
coord_fixed() +
geom_text(aes(Var1, Var2, label = value), color = "black", size = 4)
}
get_cor_matrix_plot(select(dataset, -research))
Each of the continuous features approximately normal, as evidenced by the plots below. Since one of the assumptions of linear regression is that the explanatory variables should be normally distributed, this encourages us to consider a multiple (or single) linear regression model.
get_hist_and_normal_plot <- function(dataset, var_name, na.rm = T) {
wrapr::let(
alias = list(VAR = var_name),
expr = {
mean <- mean(dataset$VAR, na.rm = na.rm)
sd <- sd(dataset$VAR, na.rm = na.rm)
ggplot(dataset, aes(x = VAR)) +
stat_function(fun = dnorm,
n = 100,
args = list(mean = mean, sd = sd), color = "red") +
labs(title = sprintf("%s vs. Normal Distribution", var_name), y = "") +
geom_density() +
theme_tufte()
}
)
}
plots <- lapply(c("gre_score", "toefl_score", "sop", "lor", "cgpa"), get_hist_and_normal_plot, dataset = dataset)
wrap_plots(plots, ncol = 2, nrow = 3)
Primary Component Analysis (PCA) is a linear algebra technique for combining several correlated variables into a chosen number of uncorrelated variables. The uncorrelated variables are constructed through linear combinations of the original variables; the uncorrelated variables are successively created to explain as much variance as possible. Since our dataset has so many highly correlated variables, this feature engineering process could lead to better test accuracy. However, an important goal of the end model is interpretability since students will want to know what factors are most important for getting into grad school. We’ll need to keep this in mind when comparing the final PCA-based model with the non-PCA model.
We’ll use the stats
function prcomp
to apply PCA to our dataset. Before PCA we need to normalize our variables to “level the playing field” before the algorithm creates linear combinations of the input variables.
to_pca <- dataset[, !(names(dataset) %in% c("research", "chance_of_admit"))]
to_pca <- map_df(to_pca, function(x) { (x - mean(x)) / sd(x) })
pca_set <- prcomp(to_pca)
pca_values <- cbind(pca_set$x, dataset[, names(dataset) %in% c("research", "chance_of_admit")])
As stated above, each successive variable that comes out of PCA is created to explain as much variance as possible. To get an idea of exactly how much variance each variable explains, consider the plot below. The first dimension explains ~42.5% of the variance, and each of the dimensions explains at least 8%. PCA is sometimes used as a method to reduce the number of explanatory variables; in that case we’d look to make a cut in the number of components after the explained variance drops below, say, 2%. In our case we are primarily interested in avoiding multicollinearity in our feature set so we will not be dropping any dimensions.
var_exp <- data.frame(
var_explained = 100 * pca_set$sdev / sum(pca_set$sdev),
dim_number = 1:length(pca_set$sdev)
)
ggplot(var_exp, aes(x = dim_number, y = var_explained)) +
geom_col() +
theme_bw()
To get a sense for how the new variables created by PCA are related to chance_of_admit
, consider the scatter plots below. PC1
and chance_of_admit
are very clearly negatively correlated. None of the other components appear to have much of a direct relationship with chance_of_admit
. This observation is supported by the correlation matrix below. Our naive model choice is then to simply regress chance_of_admit
onto PC1
. It will be interesting to see if multiple linear regression or a tree based method will offer much improvement.
get_scatter_plots <- function(x_name, y_name, dataset) {
wrapr::let(
alias = list(X_VAR = x_name, Y_VAR = y_name),
expr = {
ggplot(dataset, aes(x = X_VAR, y = Y_VAR)) +
geom_point()
}
)
}
scatter_plots <- lapply(paste0("PC", 1:6),
get_scatter_plots,
y_name = "chance_of_admit",
dataset = pca_values)
wrap_plots(scatter_plots, ncol = 2, nrow = 3)
PC1
is highly correlated with chance_of_admit
while none of the other components are.
Just as the original continuous features were approximately normal, so are each of the features extracted by our PCA analysis. Since one of the assumptions of linear regression is that the explanatory variables should be normally distributed, this encourages us to consider a multiple (or single) linear regression model.
plots <- lapply(paste0("PC", 1:6), get_hist_and_normal_plot, dataset = pca_values)
wrap_plots(plots, ncol = 2, nrow = 3)
Split dataset 80/20 into train and test. Note that some combinations of university_rating
and research
have relatively few observations. We group by those variables when splitting the dataset to ensure some observations show up in each.
set.seed(541)
dataset_with_index <- dataset %>%
group_by(university_rating, research) %>%
mutate(set = sample(c("train", "test"), size = n(), replace = TRUE, prob = c(0.8, 0.2)))
train <- filter(dataset_with_index, set == "train") %>%
select(-set)
test <- filter(dataset_with_index, set == "test") %>%
select(-set)
Create recipe for feature preparation - center and scale (normalize) the variables:
For variable selection we’ll be using backwards feature selection, also known as recursive feature elimination (RFE). RFE selects features by recursively considering smaller sets of features until only 1 feature remains. That is, RFE fits a model using \(N\) explanatory variables, removes the least significant variable, and then repeats the process with the \(N-1\) reamaining variables. The set of variables used to create the model with the lowest estimated test error is the set used for the final model. In our case, we’ll be estimating the test error using cross-validation.
Based on the discussion above about the normality of the primary components, multiple linear regression is the model we’ll be fitting. RMSE will be used as metric by which to identify the best performing subset of variables.
set.seed(503)
ctrl <- rfeControl(functions = lmFuncs,
method = "repeatedcv",
repeats = 5,
verbose = FALSE)
lmProfile <- rfe(feature_recipe,
data = train,
metric = "RMSE",
rfeControl = ctrl,
sizes = 1:7)
plot(lmProfile, type = c("g", "o"))
The model with the best approximate test RMSE was the model with 6 variables at ~0.0605 RMSE. However, the model with 5 variables had near identical performance.
lmProfile$results %>%
filter(Variables %in% c(5, 6)) %>%
select(Variables, RMSE, Rsquared) %>%
kable(format = "html") %>%
kable_styling()
Variables | RMSE | Rsquared |
---|---|---|
5 | 0.0599415 | 0.8256671 |
6 | 0.0601058 | 0.8247154 |
To avoid overfitting, we’ll choose the model with 5 variables as the most important. Those top 5 variables are:
lmProfile$variables %>%
filter(Variables == 5) %>%
distinct(var) %>%
kable(format = "html") %>%
kable_styling()
var |
---|
cgpa |
research |
toefl_score |
lor |
gre_score |
For model fitting, we need to be careful about when we apply PCA. In the EDA section above, we extracted primary components by analyzing the entire data set. However, this leaks data about the test set into the training process. Therefore we need to define our PCA process on the training set alone and later apply it to the test set when we are ready to test the model. This includes the normalization of the explanatory variables – to normalize the test set, we will subtract the mean of the training set and divide by the standard deviation of the training set.
Create recipe for feature preparation - center and scale (normalize) the variables used for PCA in the EDA section before applying PCA. Note that we ask the step_pca
function to keep all components rather than using an explained variance threshold. This allows us to do our own variable selection later on.
For variable selection we’ll be using the same method as in Model Fitting Part 1 – recursive feature elimination with cross-validation.
set.seed(503)
ctrl <- rfeControl(functions = lmFuncs,
method = "repeatedcv",
repeats = 5,
verbose = FALSE)
lmProfile_pca <- rfe(feature_recipe_pca,
data = train,
metric = "RMSE",
rfeControl = ctrl,
sizes = 1:7)
plot(lmProfile_pca, type = c("g", "o"))
The model with the best approximate test RMSE was the model with 6 variables (0.0603). However, the RMSE of this model is only a 0.16% improvment over the best model from the non-PCA section
lmProfile_pca$results %>%
filter(Variables %in% c(5, 6)) %>%
select(Variables, RMSE, Rsquared) %>%
kable(format = "html") %>%
kable_styling()
Variables | RMSE | Rsquared |
---|---|---|
5 | 0.0619364 | 0.8144378 |
6 | 0.0600513 | 0.8251570 |
The 6 variables used for the best PCA model were:
## [1] "cgpa" "research" "toefl_score" "gre_score" "lor"
The best linear regression model based on a subset of the untransformed explanatory variables had a CV RMSE of ~0.0604. The best model trained on a subset of the PCA variables offered an improvement of only 0.16%. Since the PCA model offers far less interpretability, we’ll go with the non-PCA model for our final model.
First, apply the recipe to the train/test data:
fit_feature_recipe <- prep(feature_recipe, training = train)
train_baked <- bake(fit_feature_recipe, train)
test_baked <- bake(fit_feature_recipe, test)
Next, fit a linear model using the 5 variables selected by RFE:
Lastly, predict chance_of_admit
on the test set and consider the results:
predicted <- predict(model, test_baked)
results <- test_baked %>%
mutate(predicted = predicted) %>%
mutate(residual = predicted - chance_of_admit)
The RMSE value for our result is quite low (0.059):
## [1] 0.06003708
And the R^2 value, which measures the percent of variance in the response variable that is captured by the selected model, is quite high (82.6%):
## [1] 0.8238991
Additionally, each of the predictor variables was statistically significant at an 0.05 confidence level as evidenced by the coefficient p-values:
##
## Call:
## lm(formula = chance_of_admit ~ cgpa + research + gre_score +
## lor + toefl_score, data = train_baked)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.261506 -0.024478 0.007325 0.035199 0.159850
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.707380 0.005236 135.093 < 2e-16 ***
## cgpa 0.076177 0.006333 12.028 < 2e-16 ***
## research 0.029027 0.007633 3.803 0.000166 ***
## gre_score 0.018578 0.006496 2.860 0.004467 **
## lor 0.017369 0.003931 4.419 1.29e-05 ***
## toefl_score 0.020005 0.005771 3.466 0.000587 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06016 on 386 degrees of freedom
## Multiple R-squared: 0.8239, Adjusted R-squared: 0.8216
## F-statistic: 361.2 on 5 and 386 DF, p-value: < 2.2e-16
Linear regression models should have normally distributed residuals. The residuals should also have mean zero and by independent and randomly distributed. These assumption are satisfied reasonably well by our model:
p1 <- get_hist_and_normal_plot(results, "residual")
p2 <- ggplot(results, aes(x = cgpa, y = residual)) +
geom_point() +
geom_abline(slope = 0, intercept = mean(results$residual)) +
theme_tufte() +
labs(title = "residuals vs. most important variable (CGPA)")
p1 + p2
The most important variable in our final model was College GPA. It was much more important than any of the other predictors. Letter of Recommendation was second most important, followed by GRE Score, Research, and TOEFL Score. This result should be encouraging to students eager to apply to grad school - the most important variable is entirely in their control!
Overall | |
---|---|
cgpa | 12.028427 |
research | 3.802771 |
gre_score | 2.859966 |
lor | 4.418691 |
toefl_score | 3.466362 |
The coefficients of the model are all positive. This is very much in line with our expectations - increasing a student’s GPA, GRE Score, Letter of Rec Score, or TOEFL score all increases their chance of admission.
## (Intercept) cgpa research gre_score lor toefl_score
## 0.70738036 0.07617744 0.02902694 0.01857793 0.01736857 0.02000539
We set out to predict students’ chance of admission at universities based on their academic history and application materials. Many of these explanatory variables turned out to be highly correlated with one another, so primary component analysis was used to create an alternative set of uncorrelated features. After applying Recursive Feature Elimination to a linear regression model with the original and then uncorrelated features, it became clear the the multicollinearity was not much of a problem. In order to maintain the interpretability of the final model, the original predictors were used.
The final multiple linear regression model performed quite well with 0.05 MSE, 79% R^2, and apprixately i.i.d. normal residuals. Also, each predictor was statistically significant at an 0.05 alpha level. The most important predictor turned out to be CGPA followed by Letter of Rec, GRE Score, Research, and TOEFL Score. Students should be advised to focus on their coursework above all else. The should, however, attempt to receive a strong letter of recommendation and do well on standardized tests. It was suprising to see that University Ranking and Statement of Purpose were eliminated as part of RFE – people would probably tend to place University Ranking as one of the most important factors.
## R version 3.5.3 (2019-03-11)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS 10.15.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggthemes_4.2.0 kableExtra_1.1.0 knitr_1.22
## [4] recipes_0.1.9 caret_6.0-84 lattice_0.20-38
## [7] patchwork_1.0.0 ggfortify_0.4.8 reshape2_1.4.3
## [10] ggridges_0.5.2 DataExplorer_0.8.0 forcats_0.4.0
## [13] stringr_1.4.0 dplyr_0.8.3 purrr_0.3.2
## [16] readr_1.3.1 tidyr_1.0.0 tibble_2.1.3
## [19] ggplot2_3.2.1 tidyverse_1.2.1
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.1 jsonlite_1.6.1 viridisLite_0.3.0
## [4] splines_3.5.3 foreach_1.4.4 prodlim_2018.04.18
## [7] modelr_0.1.5 assertthat_0.2.1 highr_0.8
## [10] stats4_3.5.3 wrapr_1.9.3 cellranger_1.1.0
## [13] yaml_2.2.0 ipred_0.9-9 pillar_1.4.2
## [16] backports_1.1.5 glue_1.3.1 digest_0.6.23
## [19] rvest_0.3.3 colorspace_1.4-1 htmltools_0.4.0
## [22] Matrix_1.2-15 plyr_1.8.4 timeDate_3043.102
## [25] pkgconfig_2.0.3 broom_0.5.2 haven_2.1.0
## [28] webshot_0.5.1 scales_1.0.0 gower_0.2.1
## [31] lava_1.6.6 generics_0.0.2 withr_2.1.2
## [34] nnet_7.3-12 lazyeval_0.2.2 cli_1.1.0
## [37] survival_2.43-3 magrittr_1.5 crayon_1.3.4
## [40] readxl_1.3.1 evaluate_0.13 fansi_0.4.0
## [43] nlme_3.1-137 MASS_7.3-51.4 xml2_1.2.1
## [46] class_7.3-15 tools_3.5.3 data.table_1.12.2
## [49] hms_0.4.2 lifecycle_0.1.0 munsell_0.5.0
## [52] networkD3_0.4 compiler_3.5.3 rlang_0.4.2
## [55] grid_3.5.3 iterators_1.0.10 rstudioapi_0.10
## [58] htmlwidgets_1.5.1 igraph_1.2.4.1 labeling_0.3
## [61] rmarkdown_1.12 gtable_0.3.0 ModelMetrics_1.2.2
## [64] codetools_0.2-16 R6_2.4.1 gridExtra_2.3
## [67] lubridate_1.7.4 utf8_1.1.4 zeallot_0.1.0
## [70] stringi_1.4.3 parallel_3.5.3 Rcpp_1.0.3
## [73] vctrs_0.2.1 rpart_4.1-13 tidyselect_0.2.5
## [76] xfun_0.9