Pre-Introduction

My goal with this EDA and model fitting exercise was to practice working with tree-based methods. I thought I'd do some basic EDA, apply some grid searches to XGBoost parameters, and select a final model. The process turned out to be a little more involved than that, and I'm glad it did! My continuous response variable y was log-normal and I chose to transform it so that the models were predicting $\log(y+1)$. Because of the transformation, the cross-validation mean squared error value provided by the modeling package (H2O) was the MSE of the log results. However, I wanted to use the CV MSE of y itself. Since H2O's R package doesn't allow for custom error functions, I had to work with the CV datasets directly to get the desired metric.

Code link: github.com/ZackBarry/gRowing/blob/master/airbnb_nyc/exploratory_analysis.Rmd

Airbnb NYC - EDA and XGBoost

ZackBarry

1/5/2020

1 Introduction
- 1.1 Data source
2 Pre-Modeling Stages
3 Modeling Stages
4 Conclusion
- 4.1 Ideas for improvement

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(DataExplorer)
library(ggpubr)
library(leaflet)
library(leaflet.extras)
library(sqldf)
library(patchwork)
library(caret)
library(knitr)
library(DT)

dataset <- read_csv("Data/AB_NYC_2019.csv")

1 Introduction

Airbnb is one of the most prominent online marketplaces for reserving travel lodgings. It hosts vacation rentals all over the globe, hosting over two million people each night. However, Airbnb does not actually own any of the properties it lists. Rather, it facilates connections between owners and renters. At its core, Airbnb is truly a data company. Each rental has a myriad of labels and restrictions: which days they are available, cost, location, ameneties, host descriptions… the list goes on.

The data set we’ll be exploring in this document is a collection of property data from Airbnb in New York City.

1.1 Data source

This data set has nearly 50,000 observations of 16 different variables, 6 of which are discrete and 10 of which are continuous.

2 Pre-Modeling Stages

2.1 Acquiring/Loading Data

A high-level overview of the types of variables / amount of missing data shows that we have 10 continuous and 6 discrete variables. Most rows have no missing values, and the total number of missing observations is quite low. Since there are 50,000 observations to draw from, we should be able to generate some great insights!

datatable(head(dataset, 5), rownames = F, options = list(pageLength = 5, scrollX = T))

plot_intro(dataset)

2.2 Data Wrangling

Missing observations are mostly limited to reviews_per_month and last_review. Upon further investigation, these missing observations occur when number_of_reviews is equal to 0. This makes sense because if a property has never received any reviews there is no concept of “most recent review” nor can a rate of reviews be calculated. However, it makes sense to set reviews_per_month to 0 when it is missing. For last_review, we’ll replace it with a new variable days_since_review. Each non-missing observation is formatted YYYY-MM-DD so this will be easy to do. Furthermore we can replace the missing values with -1 so that the new variable is prepped for modeling.

plot_missing(dataset)

Replace missing reviews_per_month with 0 and replace last_review with days_since_last_review where -1 stands in for missing values.

dataset <- dataset %>%
  mutate(reviews_per_month = replace_na(reviews_per_month, 0))

dataset <- dataset %>%
  mutate(
    last_review = as.Date(last_review, "%Y-%m-%d"),
    latest_review_day = max(last_review, na.rm = TRUE)
  ) %>%
  mutate(days_since_review = as.numeric(difftime(latest_review_day, last_review, units = "days")) + 1) %>%
  mutate(days_since_review = replace_na(days_since_review, -1)) %>%
  select(-c(last_review, latest_review_day))

The other missing observations are in the host_name and name variables. Since there is already a host_id field, these are not needed for identifying distinct hosts. If we were to use these variables for modeling purposes, we could replace their missing values with host_id. However, using name opens up privacy and discrimination concerns that we should avoid. We’ll drop both fields for this analysis.

dataset <- select(dataset, -c(name, host_name))

2.3 Exploring and visualizating data.

The bar plots below give further insight into our discrete (categorical) variables. Our neighbourhood_group observations are limited to 5 neighborhood groups, with Manhattan and Brooklyn comprising the majority. Room type is mostly evenly split between “entire home/apt” and “private room”, with “shared room” rarely appearing. Unsurprisingly, days_since_review has too many distinct categories to plot - it should probably be treated as a continuous variable. Finally, there are 221 distinct neighborhoods within our neighborhood groups.

DataExplorer::plot_bar(dataset)

# neighborhood-specific room_type occurences
dataset %>%
  count(neighbourhood_group, room_type) %>%
  arrange(room_type, desc(n)) %>%
  kable(caption = "neighborhood-specific room type counts")

neighborhood-specific room type counts
neighbourhood_group	room_type	n
Manhattan	Entire home/apt	13199
Brooklyn	Entire home/apt	9559
Queens	Entire home/apt	2096
Bronx	Entire home/apt	379
Staten Island	Entire home/apt	176
Brooklyn	Private room	10132
Manhattan	Private room	7982
Queens	Private room	3372
Bronx	Private room	652
Staten Island	Private room	188
Manhattan	Shared room	480
Brooklyn	Shared room	413
Queens	Shared room	198
Bronx	Shared room	60
Staten Island	Shared room	9

# count of distinct neighbourhoods per group
dataset %>%
  distinct(neighbourhood_group, neighbourhood) %>%
  count(neighbourhood_group) %>%
  kable(caption = "distinct neighborhood per neighborhood group")

distinct neighborhood per neighborhood group
neighbourhood_group	n
Bronx	48
Brooklyn	47
Manhattan	32
Queens	51
Staten Island	43

Let’s further investigate our continuous variables via histograms:

plot_histogram(dataset)

We see that id is fairly evenly distributed, while most other metrics are right-skewed. latidue and longitude indicate that the properties are distributed across an oval with the center having a higher concentration of properties. availability_365 is 0 in ~17,500 cases (~37% of observations). As shown in the violin plot below, Brooklyn and Manhattan have the largest number of 0 availability properties and are skewed heavily towards having less nights available. Bronx and Staten Island are more evenly distributed. We might infer that there are more home/apartment owners in Brooklyn and Manhattan who are renting out their property for a small portion of the year.

dataset %>%
  ggviolin(x = "neighbourhood_group", y = "availability_365",
           trim = TRUE, xlab = "Neighbourhood Group", ylab = "Days Available")

Looking at the density of observations of price vs. the normal distribution drawn from the same mean and standard deviation, it is obvious they are quite different. The bottom left hand blot below shows the quantile-quantile (QQ) plot for price vs. normal distribution; if price was normally distributed, the line would be close to straight. If we apply a log transformation to price and compare the result to the normal distribution, they are much more closely aligned. Indeed, the QQ plot in the bottom right is much more linear. Many statistical models assume that the response variable follows a normal distribution. When we attempt to predict price further down, we will apply a log transformation so that our model predicts log(price). The true price prediction can then be extracted.

Note: price is sometimes 0, and $\log(0)$ is undefined. To avoid this problem, we apply the transformation $\log(price + 1)$.

mean <- mean(dataset$price)
sd <- sd(dataset$price)
normal_sample <- rnorm(1000, mean = mean, sd = sd)

p1 <- ggplot(dataset, aes(x = price)) +
  stat_function(fun = dnorm, n = 100, args = list(mean = mean, sd = sd), color = "red") +
  labs(title = "Price Density vs. Normal Distribution", ylab = "") +
  geom_density() +
  ggthemes::theme_tufte()

p2 <- ggplot(dataset, aes(sample = price)) +
    stat_qq() +
    stat_qq_line() +
    labs(title = "QQ Plot: Price vs. Theoretical Normal", ylab = "") +
    ggthemes::theme_tufte()

mean_log <- mean(log(dataset$price + 1))
sd_log <- sd(log(dataset$price + 1))
normal_sample_log <- rnorm(1000, mean = mean_log, sd = sd_log)

p3 <- ggplot(dataset, aes(x = log(price + 1))) +
  stat_function(fun = dnorm, n = 100, args = list(mean = mean_log, sd = sd_log), color = "red") +
  labs(title = "Log Price Density vs. Normal Distribution", ylab = "") +
  geom_density() +
  ggthemes::theme_tufte()

p4 <- ggplot(dataset, aes(sample = log(price + 1))) +
    stat_qq() +
    stat_qq_line() +
    labs(title = "QQ Plot: Log Price vs. Theoretical Normal", ylab = "") +
    ggthemes::theme_tufte()

(p1 + p3) / (p2 + p4)

Is there a relationship between the number of nights a property is available and its cost? Properties that are available for a majority of the year are likely taken more seriously as business ventures (indeed they might be purely business ventures rather than, say, an individual renting out their converted basement to make some extra money). The scatter plots below show that for most of the neighborhood groups, price and availability do not appear to be related. However, Manhattan apartments tend to increase in price with their availability. We also see that Manhattan is the most expensive neighbourhood group to rent from, while Queens, Bronx, and Staten Island all look similarly inexpensive.

dataset %>%
  group_by(availability_365, neighbourhood_group) %>%
  summarise(average_price = mean(price, na.rm = TRUE)) %>%
  filter(average_price < 1000) %>%
  ggplot(aes(x = availability_365, y = average_price)) +
  geom_point() +
  facet_wrap(~neighbourhood_group)

To view the distribution of price by neighbourhood group, we’ll plot another violin plot (below). This plot has its y-axis (price) on a log10 scale because while there are a wide range of prices, most properties are priced less than 200. Without the log scale, it would be hard to see the distribution among properties in this price range. The plot confirms that Manhattan is the most expensive area - only 25% of its properties are priced at less than $100/night, but Bronx, Queens, and Staten Island all have close to 75% of their properties for less than $100/night.

ggviolin(dataset, x = "neighbourhood_group", y = "price",
         trim = TRUE, yscale = "log10", draw_quantiles = c(0.25, 0.5, 0.75))

To view the prices on a map, we’ll use the leaflet package. We’ll bin the locations into a rectangular grid and calculate the average price within each grid. A continuous color palette will allow us to visualize the magnitude of the average price. Note that we could also plot the individual locations, coloring each point based on the price. However, since there are 50,000 locations it would be difficult to get a sense for the distribution.

We can see that the most expensive area to rent is Manhattan (confirming what we already saw in the violin plots above). However, we can now also see that the most expensive regions to rent are located along the waterfronts!

# create a data frame where each row is the (long1, long2, lat1, lat2)
# coordinates in a 50 x 50 grid covering the longitude and latitude range
# present in the data set.
long_part = 50
lat_part = 50

long_delt = (max(dataset$longitude) - min(dataset$longitude)) / long_part
lat_delt = (max(dataset$latitude) - min(dataset$latitude)) / lat_part

longs = seq(min(dataset$longitude), max(dataset$longitude), by = long_delt)
lats  = seq(min(dataset$latitude), max(dataset$latitude), by = lat_delt)

grid <- expand.grid(longs, lats)

names(grid) <- c("long1", "lat1")

grid <- grid %>%
  mutate(long2 = long1 + long_delt, lat2 = lat1 + lat_delt)

to_plot <- filter(dataset, !is.na(longitude) & !is.na(latitude))

# conditionally join the grid to the original data set so that each location
# is identified with a square in the grid.  we could also do a cartesian join
# and then filter based on containment using dplyr, but this could potentially
# cause OOM issues on some machines --- there are 2,500 grid points and 50,000
# locations.
merged <- sqldf('SELECT
         long1, long2, lat1, lat2, longitude, latitude, id, price
       FROM
         grid
         LEFT JOIN dataset
         ON longitude >= long1
            AND longitude < long2
            AND latitude >= lat1
            AND latitude < lat2
       WHERE
         longitude IS NOT NULL
         AND latitude IS NOT NULL')

# get the average price per square and create a palette based on the (natural)
# log of those values
squares <- merged %>%
  group_by(long1, long2, lat1, lat2) %>%
  summarise(price = mean(price, na.rm = T), obs = n())

pal <- colorNumeric(
  palette = "YlGnBu",
  domain = range(log(squares$price))
)

myf <- function(x) { round(exp(1) ^ x, -2) }

leaflet(squares) %>%
  addTiles() %>%
  addRectangles(lng1 = ~long1, lat1 = ~lat1, lng2 = ~long2, lat2 = ~lat2,
                fillColor = ~pal(log(squares$price)),
                fillOpacity = 0.7,
                popup = ~price,
                weight = 1,
                smoothFactor = 0.2) %>%
  addLegend(pal = pal, values = ~log(price), title = "Price", labFormat = labelFormat(transform = myf))

3 Modeling Stages

We’ll be leveraging H2O’s implementation of XGBoost to predict a rental’s price, using grid search to approximate the optimal model parameters. The naive model of $price = avg(price)$ will be used as a baseline from which to compare the MSE of our model. As discussed above, the response variable will be log-transformed to get us closer to a normal distribution. This transformation introduces extra complexity in our model selection process - if we select parameters based on the ability of the model to predict log(price), we may not be selecting the best parameters for predicting the resulting price. Some custom helper functions are written below to transform predicted log(price) values back into price values before calculating MSE and R^2 values.

3.1 Prepare features for modeling.

We’ll be predicting price as a function of most of the features in the original dataset. In particular, id, host_id, and host_name will dropped since they are mostly distinct across the observations. Categorical variables are one-hot encoded with full rank; that is, a categorical variable with N levels will be encoded into N-1 columns so that no linear dependences are induced between columns. Finally, price will be log-scaled to make its distribution closer to the uniform distribution. As discussed above, the transformation will be $price\rightarrow \log(price + 1)$ instead of $price\rightarrow\log(price)$ since price is sometimes 0 and $\log(0)$ is undefined.

Select relevant columns:

to_model <- dataset %>%
  select(price, neighbourhood, latitude, longitude, room_type, neighbourhood_group,
         minimum_nights, number_of_reviews, reviews_per_month,
         availability_365, calculated_host_listings_count, days_since_review)

Log-transform price:

to_model <- to_model %>%
  mutate(log_price = log(price + 1)) %>%
  select(-price)

One-hot encoding:

dmy <- dummyVars("~ .", data = to_model, fullRank = T)

to_model <- data.frame(predict(dmy, newdata = to_model))

Split into train and test:

set.seed(1234)

to_model <- to_model %>%
  mutate(id = row_number())

train <- to_model %>%
  sample_frac(0.8)

test <- anti_join(to_model, train, by = "id") %>%
  select(-id)

train <- train %>%
  select(-id)

to_model <- to_model %>%
  select(-id)

Set response and features

response <- "log_price"
features <- setdiff(colnames(to_model), response)

3.2 Helper functions for reporting.

Since price has been log-transformed for prediction purposes, some helper functions have been defined below to extract price from $log(price)$ and calculate the resulting $R^2$ and MSE values. These functions will be important when we tune our model’s parameters for optimal performance. If we optimize towards lower values of the MSE of the log result, we may not actually be improving the MSE of the unlogged result.

Remove log transformations from predictions and true results before calculating MSE and R^2:

mse_unlog <- function(pred_log, true_log) {
  pred <- exp(pred_log) - 1
  true <- exp(true_log) - 1

  mean((pred - true) ^ 2)
}

rsquared_unlog <- function(pred_log, true_log) {
  1 - mse_unlog(pred_log, true_log) / mse_unlog(rep(mean(true_log), length(true_log)), true_log)
}

We’ll be estimating the test MSE for each model fitted on the training data using cross validation. However, since our response variable has been log-transformed, the CV MSE returned by H2O will be the CV MSE for the logged data. The functions below takes the CV predictions from an H2O model for which keep_cross_validation_predictions = TRUE and keep_cross_validation_fold_assignment = TRUE have been set and removes the log transformation before calculating the CV MSE. A function for calculating the baseline MSE is also included. Note: the H2O package (in R) does not accept custom error functions.

get_cv_predictions_as_df <- function(model) {

    # Combine CV predictions into a data.frame. Note: The cross validation predictions have the same number of rows as the entire input training frame with 0s filled in for all rows that are not in the hold out.
  cv_predictions <- h2o.cross_validation_predictions(model)
  cv_predictions <- lapply(cv_predictions, as.data.frame)
  cv_df <- do.call("cbind", cv_predictions)

    # Get the fold assignment of each CV set so that the rows that are not in the hold out can be set to NA.  If the nth row of the fold_assignment vector is equal to i, then the i+1th row of column i is not set to NA, otherwise it is set to NA.
  fold_assignment_vec <- as.vector(h2o.cross_validation_fold_assignment(model))
  for (i in 1:ncol(cv_df)) {
    cv_df[, i][fold_assignment_vec != i - 1] <- NA
  }

  cv_df

}


get_unlogged_cv_mse <- function(model, true_result_vec) {

  cv_df <- get_cv_predictions_as_df(model)

  # Calculate the difference between the hold out predictions and the true training result, removing the log transormation beforehand.
  cv_df <- apply(cv_df, 2, function(x) exp(x) - exp(true_result_vec))

  # Calculate the MSE for each hold out set and then average the results to get the final CV MSE
  mses <- apply(cv_df, 2, function(x) sum(x^2, na.rm = T) / length(x[!is.na(x)]))
  mean(as.vector(mses))

}

get_unlogged_baseline_mse <- function(model, true_result_vec) {

  cv_df <- get_cv_predictions_as_df(model)

  for (i in 1:ncol(cv_df)) {
    cv_true_result_vec <- true_result_vec
    cv_true_result_vec[is.na(cv_df)[, i]] <- NA
    cv_baseline_estimate <- mean(exp(cv_true_result_vec), na.rm = T)
    cv_df[, i][!is.na(cv_df[, i])] <- cv_baseline_estimate
  }

  # Calculate the difference between the hold out predictions and the true training result, removing the log transormation beforehand.
  cv_df <- apply(cv_df, 2, function(x) x - exp(true_result_vec))

  # Calculate the MSE for each hold out set and then average the results to get the final CV MSE
  mses <- apply(cv_df, 2, function(x) sum(x^2, na.rm = T) / length(x[!is.na(x)]))
  mean(as.vector(mses))

}

We also have a helper function to retrieve parameters of interest from a set of grid search models alongside measures of fit:

get_results <- function(h2o_model, true_result) {
  data.frame(
    mse_logged = h2o.mse(h2o_model),
    mse        = get_unlogged_cv_mse(h2o_model, true_result),
    mse_baseline = get_unlogged_baseline_mse(h2o_model, true_result),
    learn_rate = h2o_model@allparameters$learn_rate,
    max_depth  = h2o_model@allparameters$max_depth,
    reg_alpha  = h2o_model@allparameters$reg_alpha,
    ntrees     = h2o_model@allparameters$ntrees
  )
}

get_grid_results <- function(h2o_grid, true_result) {
  do.call(
    rbind,
    lapply(h2o_grid@model_ids, function(x) {
      model <- h2o.getModel(x)
      get_results(model, true_result)
    })
  )
}

3.3 Training XGBoost.

Start-up H2O and initialize H2O data frames:

library(h2o)

# start and connect to H2O instance; defaults to using all available threads
h2o.init()

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         11 days 4 hours 
##     H2O cluster timezone:       America/Los_Angeles 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.28.0.2 
##     H2O cluster version age:    1 month  
##     H2O cluster name:           H2O_started_from_R_zackbarry_ggu835 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.42 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 3.5.3 (2019-03-11)

# don't show progress bar during h2o processes
h2o.no_progress()

train_h2o <- as.h2o(train)
test_h2o  <- as.h2o(test)

3.3.1 Hyperparameter search

We’ll focus on tuning learning rate, number of trees, number of terminal nodes, and the L1 regularization parameter. To start with, ntrees is set to 200 and the L1 term (reg_alpha) is set to 0 (the default value); telescopic search is applied to learning rate alongside evenly spaced values for max_depth.

The plots below show that learn rates of 0.01 and 0.001 never perform better than the baseline model. However, learn rates of 0.1 and 1 have lower CV MSE than the baseline for each value of max depth. Learn rate 0.1 performs better than learn rate 1 for most values of max depth, but not for small values of max depth (<= 3). We need to be careful about using higher values for max depth since (in general), bias increases with max depth. With that in mind, we’ll next fine tune max depth for a fixed learning rate of 0.1 before testing different values of the L1 regularization parameter.

ntrees <- 200
reg_alpha <- 0

hyper_parameters <- list(
  learn_rate = c(0.001, 0.01, 0.1, 1),
  max_depth = c(2, 5, 8, 11)
)

grid <- h2o.grid("xgboost",
                 hyper_params = hyper_parameters,
                 x = features,
                 y = response,
                 ntrees = ntrees,
                 reg_alpha = reg_alpha,
                 nfolds = 5,
                 training_frame = train_h2o,
                 distribution = "gaussian",
                 keep_cross_validation_predictions = TRUE,
                 keep_cross_validation_fold_assignment = TRUE
)

results <- get_grid_results(grid, train$log_price)

ymin <- 0.975 * min(c(results$mse, results$mse_baseline))
ymax <- 1.025 * max(c(results$mse, results$mse_baseline))

p1 <- ggplot(results, aes(x = max_depth, y = mse, color = factor(learn_rate))) +
  ylim(ymin, ymax) +
  geom_point() +
  geom_line() +
  ggthemes::theme_tufte() +
  labs(title = "MSE vs. Max Depth", x = "Max Depth", y = "MSE")

p2 <- ggplot(results, aes(x = max_depth, y = mse_baseline, color = factor(learn_rate))) +
  ylim(ymin, ymax) +
  geom_point() +
  geom_line() +
  ggthemes::theme_tufte() +
  labs(title = "Baseline MSE", x = "MSE", y = "Max Depth")

p1 + p2

Below, the cross-validated MSE is plotted against max depth values in the set (5, 8, 11, 14, 17). Learn rate is fixed at 0.1 based on the results above. We see that the model outperforms the baseline estimate for each max depth and that larger values of max depth tend to outperform smaller values. However, we are also cautious about choosing too large a value for max depth and overfitting the data. There appears to be diminishing returns in MSE improvements for max depth values higher than 11.

ntrees <- 200
reg_alpha <- 0
learn_rate <- 0.1

hyper_parameters <- list(
  max_depth = c(5, 8, 11, 14, 17)
)

grid_2 <- h2o.grid("xgboost",
                 hyper_params = hyper_parameters,
                 x = features,
                 y = response,
                 distribution = "gaussian",
                 keep_cross_validation_predictions = TRUE,
                 keep_cross_validation_fold_assignment = TRUE,
                 training_frame = train_h2o,
                 nfolds = 5,
                 ntrees = ntrees,
                 learn_rate = learn_rate,
                 reg_alpha = 0
)

results_2 <- get_grid_results(grid_2, train$log_price)

ymin <- 0.975 * min(c(results_2$mse, results_2$mse_baseline))
ymax <- 1.025 * max(c(results_2$mse, results_2$mse_baseline))

p1 <- ggplot(results_2, aes(x = max_depth, y = mse)) +
  ylim(ymin, ymax) +
  geom_point() +
  geom_line() +
  ggthemes::theme_tufte() +
  labs(title = "MSE vs. Max Depth \n Learn Rate = 0.1", x = "Max Depth", y = "MSE")

p2 <- ggplot(results_2, aes(x = max_depth, y = mse_baseline)) +
  ylim(ymin, ymax) +
  geom_point() +
  geom_line() +
  ggthemes::theme_tufte() +
  labs(title = "Baseline MSE", x = "Max Depth", y = "MSE")

p1 + p2

Based on the results below, we see lower CV MSE for an L1 regularization parameter of 0.6. Learn rate was fixed at 0.1 and max depth fixed at 11 for this result. Our final step will be to fine tune the learning rate.

ntrees <- 200
learn_rate <- 0.1
max_depth <- 11

hyper_parameters <- list(
  reg_alpha = c(0, 0.2, 0.4, 0.6, 0.8, 1.0)
)

grid_3 <- h2o.grid("xgboost",
                 hyper_params = hyper_parameters,
                 x = features,
                 y = response,
                 ntrees = ntrees,
                 learn_rate = learn_rate,
                 max_depth = max_depth,
                 nfolds = 5,
                 training_frame = train_h2o,                 
                 distribution = "gaussian",
                 keep_cross_validation_predictions = TRUE,
                 keep_cross_validation_fold_assignment = TRUE
)

results_3 <- get_grid_results(grid_3, train$log_price)

ymin <- 0.975 * min(c(results_3$mse, results_3$mse_baseline))
ymax <- 1.025 * max(c(results_3$mse, results_3$mse_baseline))

p1 <- ggplot(results_3, aes(x = reg_alpha, y = mse)) +
  ylim(ymin, ymax) +
  geom_point() +
  geom_line() +
  ggthemes::theme_tufte() +
  labs(title = "MSE vs. Max Depth \n Learn Rate = 0.1 \n Max Depth = 11", x = "L1 Coefficient", y = "MSE")

p2 <- ggplot(results_3, aes(x = reg_alpha, y = mse_baseline)) +
  ylim(ymin, ymax) +
  geom_point() +
  geom_line() +
  ggthemes::theme_tufte() +
  labs(title = "Baseline MSE", x = "L1 Coefficient", y = "MSE")

p1 + p2

With max depth set tuned to 11 and the L1 regularization term tuned to 0.6, the optimal learn rate value appears to be near 0.22. We’re now ready to apply our final parameter set to the test set!

ntrees <- 200
max_depth <- 11
reg_alpha <- 0.6

hyper_parameters <- list(
  learn_rate = c(0.06, 0.1, 0.14, 0.18, 0.22, 0.26)
)

grid_4 <- h2o.grid("xgboost",
                 hyper_params = hyper_parameters,
                 x = features,
                 y = response,
                 ntrees = ntrees,
                 max_depth = max_depth,
                 reg_alpha = reg_alpha,
                 nfolds = 5,
                 training_frame = train_h2o,
                 distribution = "gaussian",
                 keep_cross_validation_predictions = TRUE,
                 keep_cross_validation_fold_assignment = TRUE
)

results_4 <- get_grid_results(grid_4, train$log_price)

ymin <- 0.975 * min(c(results_4$mse, results_4$mse_baseline))
ymax <- 1.025 * max(c(results_4$mse, results_4$mse_baseline))

p1 <- ggplot(results_4, aes(x = learn_rate, y = mse)) +
  ylim(ymin, ymax) +
  geom_point() +
  geom_line() +
  ggthemes::theme_tufte() +
  labs(title = "MSE vs. Max Depth \n Learn Rate = 0.1 \n Max Depth = 11", x = "Learn Rate", y = "MSE")

p2 <- ggplot(results_4, aes(x = learn_rate, y = mse_baseline)) +
  ylim(ymin, ymax) +
  geom_point() +
  geom_line() +
  ggthemes::theme_tufte() +
  labs(title = "Baseline MSE", x = "Learn Rate", y = "MSE")

p1 + p2

3.4 Final model performance

Our final tuned model has an MSE of 51500 (and RMSE of $227), which is an improvement over the baseline model which has MSE 64000 (RMSE of $252). The R^2 value tells us that the tuned model explains 20% of the variance in the test data.

final_model <- h2o.xgboost(
                 x = features,
                 y = response,
                 ntrees = 200,
                 max_depth = 11,
                 reg_alpha = 0.6,
                 learn_rate = 0.22,
                 training_frame = train_h2o,
                 distribution = "gaussian",
                 keep_cross_validation_predictions = TRUE,
                 keep_cross_validation_fold_assignment = TRUE
)

final_prediction <- h2o.predict(final_model, test_h2o) %>%
  as.vector()

baseline_pred_log_price <- rep(mean(train$log_price), nrow(test))

sprintf("Baseline MSE: %f", mse_unlog(baseline_pred_log_price, test$log_price))

## [1] "Baseline MSE: 64082.639529"

sprintf("Tuned XGBoost MSE: %f", mse_unlog(final_prediction, test$log_price))

## [1] "Tuned XGBoost MSE: 51003.870218"

sprintf("Tuned XGBoost R2: %f", rsquared_unlog(final_prediction, test$log_price))

## [1] "Tuned XGBoost R2: 0.204177"

h2o.varimp_plot(final_model)

4 Conclusion

We were able to use XGBoost to predict rental prices much more effectively than the baseline “naive” model. The baseline model had an MSE of 64000 (RSME of $252). Tuning the learning rate, depth, and L1 regularization parameters of the XGBoost model led to a final MSE value of 51500 (RMSE of $\227) and a final R^2 value of 0.2. These results are as good as we might have hoped, especially for a feature as important to the hospitality business as price. XGBoost is, however, one of the most popular non-parametric modeling frameworks so we trust that its results are at least similar to what other packages could provide. The “ideas for improvement” section below suggests some next steps one could take, using this document as a starting point.

Two of the four most important predictors turned out to be whether the room was private or shared. Due to the one-hot encoding, implictly included is whether the room was the entire home or apartment. This makes sense because a shared room is likely to command a significantly lower price than a private room or an entire home/apartment. Longitude and latitude were also important; as we saw in the map in the exploratory analysis section there are regions of NY that are more expensive than others. This geospatial relationship is reflect in the combination of latitude and longitude.

4.1 Ideas for improvement

Create features using the names of the listings to see if name has a noticeable impact on the predictive accuracy of the model.
Reduce the complexity of the feature space by combining some of the one-hot encoded neighborhoods. Lasso fusion for categorical variables would be interesting: https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels.
Use the catboost framework to see if its handling of categorical data outperforms XGBoost.

sessionInfo()

## R version 3.5.3 (2019-03-11)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS  10.15.2
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] h2o_3.28.0.2         DT_0.8               knitr_1.22          
##  [4] caret_6.0-84         lattice_0.20-38      patchwork_1.0.0     
##  [7] sqldf_0.4-11         RSQLite_2.2.0        gsubfn_0.7          
## [10] proto_1.0.0          leaflet.extras_1.0.0 leaflet_2.0.2       
## [13] ggpubr_0.2.4         magrittr_1.5         DataExplorer_0.8.0  
## [16] forcats_0.4.0        stringr_1.4.0        dplyr_0.8.3         
## [19] purrr_0.3.2          readr_1.3.1          tidyr_1.0.0         
## [22] tibble_2.1.3         ggplot2_3.2.1        tidyverse_1.2.1     
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.4-1   ggsignif_0.5.0     ellipsis_0.3.0    
##  [4] class_7.3-15       rstudioapi_0.10    bit64_0.9-7       
##  [7] prodlim_2018.04.18 lubridate_1.7.4    xml2_1.2.1        
## [10] codetools_0.2-16   splines_3.5.3      zeallot_0.1.0     
## [13] jsonlite_1.6.1     broom_0.5.2        shiny_1.4.0       
## [16] compiler_3.5.3     httr_1.4.1         backports_1.1.5   
## [19] assertthat_0.2.1   Matrix_1.2-15      fastmap_1.0.1     
## [22] lazyeval_0.2.2     cli_1.1.0          later_1.0.0       
## [25] htmltools_0.4.0    tools_3.5.3        igraph_1.2.4.1    
## [28] gtable_0.3.0       glue_1.3.1         reshape2_1.4.3    
## [31] ggthemes_4.2.0     Rcpp_1.0.3         cellranger_1.1.0  
## [34] vctrs_0.2.1        nlme_3.1-137       iterators_1.0.10  
## [37] crosstalk_1.0.0    timeDate_3043.102  gower_0.2.1       
## [40] xfun_0.9           networkD3_0.4      rvest_0.3.3       
## [43] mime_0.7           lifecycle_0.1.0    MASS_7.3-51.4     
## [46] scales_1.0.0       ipred_0.9-9        hms_0.4.2         
## [49] promises_1.1.0     parallel_3.5.3     RColorBrewer_1.1-2
## [52] yaml_2.2.0         memoise_1.1.0      gridExtra_2.3     
## [55] rpart_4.1-13       stringi_1.4.3      highr_0.8         
## [58] foreach_1.4.4      lava_1.6.6         chron_2.3-54      
## [61] rlang_0.4.2        pkgconfig_2.0.3    bitops_1.0-6      
## [64] evaluate_0.13      recipes_0.1.6      htmlwidgets_1.5.1 
## [67] labeling_0.3       bit_1.1-14         tidyselect_0.2.5  
## [70] plyr_1.8.4         R6_2.4.1           generics_0.0.2    
## [73] DBI_1.1.0          pillar_1.4.2       haven_2.1.0       
## [76] withr_2.1.2        survival_2.43-3    RCurl_1.95-4.12   
## [79] nnet_7.3-12        modelr_0.1.5       crayon_1.3.4      
## [82] rmarkdown_1.12     grid_3.5.3         readxl_1.3.1      
## [85] data.table_1.12.2  blob_1.2.1         ModelMetrics_1.2.2
## [88] digest_0.6.23      xtable_1.8-4       httpuv_1.5.2      
## [91] stats4_3.5.3       munsell_0.5.0      tcltk_3.5.3