knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(DataExplorer)
library(ggridges)
library(reshape2)
library(ggfortify) # for plotting PCA
library(patchwork)
library(caret)
library(recipes)
library(knitr)
library(kableExtra)
library(ggthemes)

dataset <- read_csv("Data/Admission_Predict_Ver1.1.csv")

1 Introduction

In this document we consider a set of variables that are potentially related to a student’s chances of getting into graduate school. The population of students is limited to India. Model interpretability is a top priority as the result should be able to guide students towards the most effective application preparation.

1.1 Data source

This data set has nearly 500 observations of 8 different explanatory variables, 7 of which are continuous and 1 of which is binary.

2 Pre-Modeling Stages

2.1 Acquiring/Loading Data

The variable “SOP” stands for “Statement of Purpose Strength”; “LOR” stands for “Letter of Recommendation Strength”; “CGPA” is the student’s undergraduate GPA; “Research” is a boolean - 1 for did research, 0 for did not. “Chance of Admit” is the response variable, ranging from 0 to 1.

## Observations: 500
## Variables: 9
## $ `Serial No.`        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ `GRE Score`         <dbl> 337, 324, 316, 322, 314, 330, 321, 308, 302,…
## $ `TOEFL Score`       <dbl> 118, 107, 104, 110, 103, 115, 109, 101, 102,…
## $ `University Rating` <dbl> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3,…
## $ SOP                 <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0,…
## $ LOR                 <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5,…
## $ CGPA                <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.…
## $ Research            <dbl> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ `Chance of Admit`   <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.…

The field Serial No. appears to be an ID, so we’ll drop it. The columns are renamed to be easier to work with.

2.2 Check for missing/out-of-range values

None of the columns contains any missing data:

## # A tibble: 1 x 8
##   gre_score toefl_score university_rati…   sop   lor  cgpa research
##       <int>       <int>            <int> <int> <int> <int>    <int>
## 1         0           0                0     0     0     0        0
## # … with 1 more variable: chance_of_admit <int>

The allowed range for each variable is as follows:

  • GRE Scores ( out of 340 )
  • TOEFL Scores ( out of 120 )
  • University Rating ( out of 5 )
  • Statement of Purpose and Letter of Recommendation Strength ( out of 5 )
  • Undergraduate GPA ( out of 10 )
  • Research Experience ( either 0 or 1 )
  • Chance of Admit ( ranging from 0 to 1 )

Each variable is within the expected range, so there is no apparent need to clean the data.

## # A tibble: 2 x 8
##   gre_score toefl_score university_rati…   sop   lor  cgpa research
##       <dbl>       <dbl>            <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1       290          92                1     1     1  6.8         0
## 2       340         120                5     5     5  9.92        1
## # … with 1 more variable: chance_of_admit <dbl>

2.3 Distribution of explanatory variables

In this section we check which explanatory variables have enough variance to be useful in explaining the variance in the response variable.

First we’ll check on the relationship between each of the continuous explanatory variables and admission chances. Since each of these variables is a score where higher is better, we’d expect that as the students’ scores go up so do their admission chances. We’ll leave the non-student variable university_rating for later.

We see that as GRE Score, CPGA, and TOEFL Score increase, the chance of admission also tends to increase. The relationship between CPGA and admission chance is especially clear. It will be interested to see how correlated this scores are.

Next, consider the distribution of chance_of_admit for each level of the letter of recommendation (lor) and statement of purpose (sop) scores. The plots are very similar – admission chances tend to rise alongisde lor and sop. The distributions at each level of sop seem to have lower variance than for lor, indicating that there might be a stronger relationsihip between sop and chance_of_admit than between lor and chance_of_admit. This is interesting because the statement of purporse is entirely student-driven where as letter of recommendations are instructor-driven.

## Picking joint bandwidth of 0.0386
## Picking joint bandwidth of 0.0332

The relationship between university rating and admission chances is unsuprising – the higher ranked the university, the lower the chances of admission.

## Picking joint bandwidth of 0.0312

Now looking at the number of observations of research vs. no research for different levels of chance_of_admit, we see that students who participated in research have higher chance of admission. In fact, there are no observations of students with higher than a 90% chance of admission without doing research, and there are very few observations of students with higher than 80% chance. However, we should recall from the university rating ridge plot above that most observations where chance of admit is higher than 80% have university rankings of 4 or 5.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We see that each variable is moderately correlated with each other variable. In particular, we see that cpga, gre_score, and toefl_score are all highly correlated with one another and that cpga is the most highly correlated with chance_of_admit. High correlation among explanatory variables can decrease model accuracy due to multicollinearity. In the next section we’ll explore a dimensionality reduction technique called Primary Component Analysis that builds a new set of uncorrelated explanatory variables.

Each of the continuous features approximately normal, as evidenced by the plots below. Since one of the assumptions of linear regression is that the explanatory variables should be normally distributed, this encourages us to consider a multiple (or single) linear regression model.

2.4 Dimensionality Reduction

Primary Component Analysis (PCA) is a linear algebra technique for combining several correlated variables into a chosen number of uncorrelated variables. The uncorrelated variables are constructed through linear combinations of the original variables; the uncorrelated variables are successively created to explain as much variance as possible. Since our dataset has so many highly correlated variables, this feature engineering process could lead to better test accuracy. However, an important goal of the end model is interpretability since students will want to know what factors are most important for getting into grad school. We’ll need to keep this in mind when comparing the final PCA-based model with the non-PCA model.

We’ll use the stats function prcomp to apply PCA to our dataset. Before PCA we need to normalize our variables to “level the playing field” before the algorithm creates linear combinations of the input variables.

As stated above, each successive variable that comes out of PCA is created to explain as much variance as possible. To get an idea of exactly how much variance each variable explains, consider the plot below. The first dimension explains ~42.5% of the variance, and each of the dimensions explains at least 8%. PCA is sometimes used as a method to reduce the number of explanatory variables; in that case we’d look to make a cut in the number of components after the explained variance drops below, say, 2%. In our case we are primarily interested in avoiding multicollinearity in our feature set so we will not be dropping any dimensions.

To get a sense for how the new variables created by PCA are related to chance_of_admit, consider the scatter plots below. PC1 and chance_of_admit are very clearly negatively correlated. None of the other components appear to have much of a direct relationship with chance_of_admit. This observation is supported by the correlation matrix below. Our naive model choice is then to simply regress chance_of_admit onto PC1. It will be interesting to see if multiple linear regression or a tree based method will offer much improvement.

PC1 is highly correlated with chance_of_admit while none of the other components are.

Just as the original continuous features were approximately normal, so are each of the features extracted by our PCA analysis. Since one of the assumptions of linear regression is that the explanatory variables should be normally distributed, this encourages us to consider a multiple (or single) linear regression model.

3 Model Building/Evaluation

Split dataset 80/20 into train and test. Note that some combinations of university_rating and research have relatively few observations. We group by those variables when splitting the dataset to ensure some observations show up in each.

3.1 Model Fitting Part 1: No PCA

3.1.1 Data Prep Recipe

Create recipe for feature preparation - center and scale (normalize) the variables:

3.1.2 Variable Selection

For variable selection we’ll be using backwards feature selection, also known as recursive feature elimination (RFE). RFE selects features by recursively considering smaller sets of features until only 1 feature remains. That is, RFE fits a model using \(N\) explanatory variables, removes the least significant variable, and then repeats the process with the \(N-1\) reamaining variables. The set of variables used to create the model with the lowest estimated test error is the set used for the final model. In our case, we’ll be estimating the test error using cross-validation.

Based on the discussion above about the normality of the primary components, multiple linear regression is the model we’ll be fitting. RMSE will be used as metric by which to identify the best performing subset of variables.

The model with the best approximate test RMSE was the model with 6 variables at ~0.0605 RMSE. However, the model with 5 variables had near identical performance.

Variables RMSE Rsquared
5 0.0599415 0.8256671
6 0.0601058 0.8247154

To avoid overfitting, we’ll choose the model with 5 variables as the most important. Those top 5 variables are:

var
cgpa
research
toefl_score
lor
gre_score

3.2 Model Fitting Part 2: PCA

For model fitting, we need to be careful about when we apply PCA. In the EDA section above, we extracted primary components by analyzing the entire data set. However, this leaks data about the test set into the training process. Therefore we need to define our PCA process on the training set alone and later apply it to the test set when we are ready to test the model. This includes the normalization of the explanatory variables – to normalize the test set, we will subtract the mean of the training set and divide by the standard deviation of the training set.

3.2.1 Data Prep Recipe

Create recipe for feature preparation - center and scale (normalize) the variables used for PCA in the EDA section before applying PCA. Note that we ask the step_pca function to keep all components rather than using an explained variance threshold. This allows us to do our own variable selection later on.

3.2.2 Variable Selection

For variable selection we’ll be using the same method as in Model Fitting Part 1 – recursive feature elimination with cross-validation.

The model with the best approximate test RMSE was the model with 6 variables (0.0603). However, the RMSE of this model is only a 0.16% improvment over the best model from the non-PCA section

Variables RMSE Rsquared
5 0.0619364 0.8144378
6 0.0600513 0.8251570

The 6 variables used for the best PCA model were:

## [1] "cgpa"        "research"    "toefl_score" "gre_score"   "lor"

3.3 Fitting/Testing Final Model

The best linear regression model based on a subset of the untransformed explanatory variables had a CV RMSE of ~0.0604. The best model trained on a subset of the PCA variables offered an improvement of only 0.16%. Since the PCA model offers far less interpretability, we’ll go with the non-PCA model for our final model.

First, apply the recipe to the train/test data:

Next, fit a linear model using the 5 variables selected by RFE:

Lastly, predict chance_of_admit on the test set and consider the results:

The RMSE value for our result is quite low (0.059):

## [1] 0.06003708

And the R^2 value, which measures the percent of variance in the response variable that is captured by the selected model, is quite high (82.6%):

## [1] 0.8238991

Additionally, each of the predictor variables was statistically significant at an 0.05 confidence level as evidenced by the coefficient p-values:

## 
## Call:
## lm(formula = chance_of_admit ~ cgpa + research + gre_score + 
##     lor + toefl_score, data = train_baked)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.261506 -0.024478  0.007325  0.035199  0.159850 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.707380   0.005236 135.093  < 2e-16 ***
## cgpa        0.076177   0.006333  12.028  < 2e-16 ***
## research    0.029027   0.007633   3.803 0.000166 ***
## gre_score   0.018578   0.006496   2.860 0.004467 ** 
## lor         0.017369   0.003931   4.419 1.29e-05 ***
## toefl_score 0.020005   0.005771   3.466 0.000587 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06016 on 386 degrees of freedom
## Multiple R-squared:  0.8239, Adjusted R-squared:  0.8216 
## F-statistic: 361.2 on 5 and 386 DF,  p-value: < 2.2e-16

Linear regression models should have normally distributed residuals. The residuals should also have mean zero and by independent and randomly distributed. These assumption are satisfied reasonably well by our model:

3.4 Interpreting Final Model

The most important variable in our final model was College GPA. It was much more important than any of the other predictors. Letter of Recommendation was second most important, followed by GRE Score, Research, and TOEFL Score. This result should be encouraging to students eager to apply to grad school - the most important variable is entirely in their control!

Overall
cgpa 12.028427
research 3.802771
gre_score 2.859966
lor 4.418691
toefl_score 3.466362

The coefficients of the model are all positive. This is very much in line with our expectations - increasing a student’s GPA, GRE Score, Letter of Rec Score, or TOEFL score all increases their chance of admission.

## (Intercept)        cgpa    research   gre_score         lor toefl_score 
##  0.70738036  0.07617744  0.02902694  0.01857793  0.01736857  0.02000539

4 Conclusion

We set out to predict students’ chance of admission at universities based on their academic history and application materials. Many of these explanatory variables turned out to be highly correlated with one another, so primary component analysis was used to create an alternative set of uncorrelated features. After applying Recursive Feature Elimination to a linear regression model with the original and then uncorrelated features, it became clear the the multicollinearity was not much of a problem. In order to maintain the interpretability of the final model, the original predictors were used.

The final multiple linear regression model performed quite well with 0.05 MSE, 79% R^2, and apprixately i.i.d. normal residuals. Also, each predictor was statistically significant at an 0.05 alpha level. The most important predictor turned out to be CGPA followed by Letter of Rec, GRE Score, Research, and TOEFL Score. Students should be advised to focus on their coursework above all else. The should, however, attempt to receive a strong letter of recommendation and do well on standardized tests. It was suprising to see that University Ranking and Statement of Purpose were eliminated as part of RFE – people would probably tend to place University Ranking as one of the most important factors.

## R version 3.5.3 (2019-03-11)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS  10.15.2
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggthemes_4.2.0     kableExtra_1.1.0   knitr_1.22        
##  [4] recipes_0.1.9      caret_6.0-84       lattice_0.20-38   
##  [7] patchwork_1.0.0    ggfortify_0.4.8    reshape2_1.4.3    
## [10] ggridges_0.5.2     DataExplorer_0.8.0 forcats_0.4.0     
## [13] stringr_1.4.0      dplyr_0.8.3        purrr_0.3.2       
## [16] readr_1.3.1        tidyr_1.0.0        tibble_2.1.3      
## [19] ggplot2_3.2.1      tidyverse_1.2.1   
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.1         jsonlite_1.6.1     viridisLite_0.3.0 
##  [4] splines_3.5.3      foreach_1.4.4      prodlim_2018.04.18
##  [7] modelr_0.1.5       assertthat_0.2.1   highr_0.8         
## [10] stats4_3.5.3       wrapr_1.9.3        cellranger_1.1.0  
## [13] yaml_2.2.0         ipred_0.9-9        pillar_1.4.2      
## [16] backports_1.1.5    glue_1.3.1         digest_0.6.23     
## [19] rvest_0.3.3        colorspace_1.4-1   htmltools_0.4.0   
## [22] Matrix_1.2-15      plyr_1.8.4         timeDate_3043.102 
## [25] pkgconfig_2.0.3    broom_0.5.2        haven_2.1.0       
## [28] webshot_0.5.1      scales_1.0.0       gower_0.2.1       
## [31] lava_1.6.6         generics_0.0.2     withr_2.1.2       
## [34] nnet_7.3-12        lazyeval_0.2.2     cli_1.1.0         
## [37] survival_2.43-3    magrittr_1.5       crayon_1.3.4      
## [40] readxl_1.3.1       evaluate_0.13      fansi_0.4.0       
## [43] nlme_3.1-137       MASS_7.3-51.4      xml2_1.2.1        
## [46] class_7.3-15       tools_3.5.3        data.table_1.12.2 
## [49] hms_0.4.2          lifecycle_0.1.0    munsell_0.5.0     
## [52] networkD3_0.4      compiler_3.5.3     rlang_0.4.2       
## [55] grid_3.5.3         iterators_1.0.10   rstudioapi_0.10   
## [58] htmlwidgets_1.5.1  igraph_1.2.4.1     labeling_0.3      
## [61] rmarkdown_1.12     gtable_0.3.0       ModelMetrics_1.2.2
## [64] codetools_0.2-16   R6_2.4.1           gridExtra_2.3     
## [67] lubridate_1.7.4    utf8_1.1.4         zeallot_0.1.0     
## [70] stringi_1.4.3      parallel_3.5.3     Rcpp_1.0.3        
## [73] vctrs_0.2.1        rpart_4.1-13       tidyselect_0.2.5  
## [76] xfun_0.9