knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(DataExplorer)
library(ggpubr)
library(leaflet)
library(leaflet.extras)
library(sqldf)
library(patchwork)
library(caret)
library(knitr)
library(DT)

dataset <- read_csv("Data/AB_NYC_2019.csv")

1 Introduction

Airbnb is one of the most prominent online marketplaces for reserving travel lodgings. It hosts vacation rentals all over the globe, hosting over two million people each night. However, Airbnb does not actually own any of the properties it lists. Rather, it facilates connections between owners and renters. At its core, Airbnb is truly a data company. Each rental has a myriad of labels and restrictions: which days they are available, cost, location, ameneties, host descriptions… the list goes on.

The data set we’ll be exploring in this document is a collection of property data from Airbnb in New York City.

1.1 Data source

This data set has nearly 50,000 observations of 16 different variables, 6 of which are discrete and 10 of which are continuous.


2 Pre-Modeling Stages

2.1 Acquiring/Loading Data

A high-level overview of the types of variables / amount of missing data shows that we have 10 continuous and 6 discrete variables. Most rows have no missing values, and the total number of missing observations is quite low. Since there are 50,000 observations to draw from, we should be able to generate some great insights!

2.2 Data Wrangling

Missing observations are mostly limited to reviews_per_month and last_review. Upon further investigation, these missing observations occur when number_of_reviews is equal to 0. This makes sense because if a property has never received any reviews there is no concept of “most recent review” nor can a rate of reviews be calculated. However, it makes sense to set reviews_per_month to 0 when it is missing. For last_review, we’ll replace it with a new variable days_since_review. Each non-missing observation is formatted YYYY-MM-DD so this will be easy to do. Furthermore we can replace the missing values with -1 so that the new variable is prepped for modeling.

Replace missing reviews_per_month with 0 and replace last_review with days_since_last_review where -1 stands in for missing values.

The other missing observations are in the host_name and name variables. Since there is already a host_id field, these are not needed for identifying distinct hosts. If we were to use these variables for modeling purposes, we could replace their missing values with host_id. However, using name opens up privacy and discrimination concerns that we should avoid. We’ll drop both fields for this analysis.

2.3 Exploring and visualizating data.

The bar plots below give further insight into our discrete (categorical) variables. Our neighbourhood_group observations are limited to 5 neighborhood groups, with Manhattan and Brooklyn comprising the majority. Room type is mostly evenly split between “entire home/apt” and “private room”, with “shared room” rarely appearing. Unsurprisingly, days_since_review has too many distinct categories to plot - it should probably be treated as a continuous variable. Finally, there are 221 distinct neighborhoods within our neighborhood groups.

neighborhood-specific room type counts
neighbourhood_group room_type n
Manhattan Entire home/apt 13199
Brooklyn Entire home/apt 9559
Queens Entire home/apt 2096
Bronx Entire home/apt 379
Staten Island Entire home/apt 176
Brooklyn Private room 10132
Manhattan Private room 7982
Queens Private room 3372
Bronx Private room 652
Staten Island Private room 188
Manhattan Shared room 480
Brooklyn Shared room 413
Queens Shared room 198
Bronx Shared room 60
Staten Island Shared room 9
distinct neighborhood per neighborhood group
neighbourhood_group n
Bronx 48
Brooklyn 47
Manhattan 32
Queens 51
Staten Island 43

Let’s further investigate our continuous variables via histograms:

We see that id is fairly evenly distributed, while most other metrics are right-skewed. latidue and longitude indicate that the properties are distributed across an oval with the center having a higher concentration of properties. availability_365 is 0 in ~17,500 cases (~37% of observations). As shown in the violin plot below, Brooklyn and Manhattan have the largest number of 0 availability properties and are skewed heavily towards having less nights available. Bronx and Staten Island are more evenly distributed. We might infer that there are more home/apartment owners in Brooklyn and Manhattan who are renting out their property for a small portion of the year.

Looking at the density of observations of price vs. the normal distribution drawn from the same mean and standard deviation, it is obvious they are quite different. The bottom left hand blot below shows the quantile-quantile (QQ) plot for price vs. normal distribution; if price was normally distributed, the line would be close to straight. If we apply a log transformation to price and compare the result to the normal distribution, they are much more closely aligned. Indeed, the QQ plot in the bottom right is much more linear. Many statistical models assume that the response variable follows a normal distribution. When we attempt to predict price further down, we will apply a log transformation so that our model predicts log(price). The true price prediction can then be extracted.

Note: price is sometimes 0, and \(\log(0)\) is undefined. To avoid this problem, we apply the transformation \(\log(price + 1)\).

Is there a relationship between the number of nights a property is available and its cost? Properties that are available for a majority of the year are likely taken more seriously as business ventures (indeed they might be purely business ventures rather than, say, an individual renting out their converted basement to make some extra money). The scatter plots below show that for most of the neighborhood groups, price and availability do not appear to be related. However, Manhattan apartments tend to increase in price with their availability. We also see that Manhattan is the most expensive neighbourhood group to rent from, while Queens, Bronx, and Staten Island all look similarly inexpensive.

To view the distribution of price by neighbourhood group, we’ll plot another violin plot (below). This plot has its y-axis (price) on a log10 scale because while there are a wide range of prices, most properties are priced less than 200. Without the log scale, it would be hard to see the distribution among properties in this price range. The plot confirms that Manhattan is the most expensive area - only 25% of its properties are priced at less than $100/night, but Bronx, Queens, and Staten Island all have close to 75% of their properties for less than $100/night.

To view the prices on a map, we’ll use the leaflet package. We’ll bin the locations into a rectangular grid and calculate the average price within each grid. A continuous color palette will allow us to visualize the magnitude of the average price. Note that we could also plot the individual locations, coloring each point based on the price. However, since there are 50,000 locations it would be difficult to get a sense for the distribution.

We can see that the most expensive area to rent is Manhattan (confirming what we already saw in the violin plots above). However, we can now also see that the most expensive regions to rent are located along the waterfronts!

# create a data frame where each row is the (long1, long2, lat1, lat2)
# coordinates in a 50 x 50 grid covering the longitude and latitude range
# present in the data set.
long_part = 50
lat_part = 50

long_delt = (max(dataset$longitude) - min(dataset$longitude)) / long_part
lat_delt = (max(dataset$latitude) - min(dataset$latitude)) / lat_part

longs = seq(min(dataset$longitude), max(dataset$longitude), by = long_delt)
lats  = seq(min(dataset$latitude), max(dataset$latitude), by = lat_delt)

grid <- expand.grid(longs, lats)

names(grid) <- c("long1", "lat1")

grid <- grid %>%
  mutate(long2 = long1 + long_delt, lat2 = lat1 + lat_delt)

to_plot <- filter(dataset, !is.na(longitude) & !is.na(latitude))

# conditionally join the grid to the original data set so that each location
# is identified with a square in the grid.  we could also do a cartesian join
# and then filter based on containment using dplyr, but this could potentially
# cause OOM issues on some machines --- there are 2,500 grid points and 50,000
# locations.
merged <- sqldf('SELECT
         long1, long2, lat1, lat2, longitude, latitude, id, price
       FROM
         grid
         LEFT JOIN dataset
         ON longitude >= long1
            AND longitude < long2
            AND latitude >= lat1
            AND latitude < lat2
       WHERE
         longitude IS NOT NULL
         AND latitude IS NOT NULL')

# get the average price per square and create a palette based on the (natural)
# log of those values
squares <- merged %>%
  group_by(long1, long2, lat1, lat2) %>%
  summarise(price = mean(price, na.rm = T), obs = n())

pal <- colorNumeric(
  palette = "YlGnBu",
  domain = range(log(squares$price))
)

myf <- function(x) { round(exp(1) ^ x, -2) }

leaflet(squares) %>%
  addTiles() %>%
  addRectangles(lng1 = ~long1, lat1 = ~lat1, lng2 = ~long2, lat2 = ~lat2,
                fillColor = ~pal(log(squares$price)),
                fillOpacity = 0.7,
                popup = ~price,
                weight = 1,
                smoothFactor = 0.2) %>%
  addLegend(pal = pal, values = ~log(price), title = "Price", labFormat = labelFormat(transform = myf))

3 Modeling Stages

We’ll be leveraging H2O’s implementation of XGBoost to predict a rental’s price, using grid search to approximate the optimal model parameters. The naive model of \(price = avg(price)\) will be used as a baseline from which to compare the MSE of our model. As discussed above, the response variable will be log-transformed to get us closer to a normal distribution. This transformation introduces extra complexity in our model selection process - if we select parameters based on the ability of the model to predict log(price), we may not be selecting the best parameters for predicting the resulting price. Some custom helper functions are written below to transform predicted log(price) values back into price values before calculating MSE and R^2 values.

3.1 Prepare features for modeling.

We’ll be predicting price as a function of most of the features in the original dataset. In particular, id, host_id, and host_name will dropped since they are mostly distinct across the observations. Categorical variables are one-hot encoded with full rank; that is, a categorical variable with N levels will be encoded into N-1 columns so that no linear dependences are induced between columns. Finally, price will be log-scaled to make its distribution closer to the uniform distribution. As discussed above, the transformation will be \(price\rightarrow \log(price + 1)\) instead of \(price\rightarrow\log(price)\) since price is sometimes 0 and \(\log(0)\) is undefined.

Select relevant columns:

Log-transform price:

One-hot encoding:

Split into train and test:

Set response and features

3.2 Helper functions for reporting.

Since price has been log-transformed for prediction purposes, some helper functions have been defined below to extract price from \(log(price)\) and calculate the resulting \(R^2\) and MSE values. These functions will be important when we tune our model’s parameters for optimal performance. If we optimize towards lower values of the MSE of the log result, we may not actually be improving the MSE of the unlogged result.

Remove log transformations from predictions and true results before calculating MSE and R^2:

We’ll be estimating the test MSE for each model fitted on the training data using cross validation. However, since our response variable has been log-transformed, the CV MSE returned by H2O will be the CV MSE for the logged data. The functions below takes the CV predictions from an H2O model for which keep_cross_validation_predictions = TRUE and keep_cross_validation_fold_assignment = TRUE have been set and removes the log transformation before calculating the CV MSE. A function for calculating the baseline MSE is also included. Note: the H2O package (in R) does not accept custom error functions.

get_cv_predictions_as_df <- function(model) {

    # Combine CV predictions into a data.frame. Note: The cross validation predictions have the same number of rows as the entire input training frame with 0s filled in for all rows that are not in the hold out.
  cv_predictions <- h2o.cross_validation_predictions(model)
  cv_predictions <- lapply(cv_predictions, as.data.frame)
  cv_df <- do.call("cbind", cv_predictions)

    # Get the fold assignment of each CV set so that the rows that are not in the hold out can be set to NA.  If the nth row of the fold_assignment vector is equal to i, then the i+1th row of column i is not set to NA, otherwise it is set to NA.
  fold_assignment_vec <- as.vector(h2o.cross_validation_fold_assignment(model))
  for (i in 1:ncol(cv_df)) {
    cv_df[, i][fold_assignment_vec != i - 1] <- NA
  }

  cv_df

}


get_unlogged_cv_mse <- function(model, true_result_vec) {

  cv_df <- get_cv_predictions_as_df(model)

  # Calculate the difference between the hold out predictions and the true training result, removing the log transormation beforehand.
  cv_df <- apply(cv_df, 2, function(x) exp(x) - exp(true_result_vec))

  # Calculate the MSE for each hold out set and then average the results to get the final CV MSE
  mses <- apply(cv_df, 2, function(x) sum(x^2, na.rm = T) / length(x[!is.na(x)]))
  mean(as.vector(mses))

}

get_unlogged_baseline_mse <- function(model, true_result_vec) {

  cv_df <- get_cv_predictions_as_df(model)

  for (i in 1:ncol(cv_df)) {
    cv_true_result_vec <- true_result_vec
    cv_true_result_vec[is.na(cv_df)[, i]] <- NA
    cv_baseline_estimate <- mean(exp(cv_true_result_vec), na.rm = T)
    cv_df[, i][!is.na(cv_df[, i])] <- cv_baseline_estimate
  }

  # Calculate the difference between the hold out predictions and the true training result, removing the log transormation beforehand.
  cv_df <- apply(cv_df, 2, function(x) x - exp(true_result_vec))

  # Calculate the MSE for each hold out set and then average the results to get the final CV MSE
  mses <- apply(cv_df, 2, function(x) sum(x^2, na.rm = T) / length(x[!is.na(x)]))
  mean(as.vector(mses))

}

We also have a helper function to retrieve parameters of interest from a set of grid search models alongside measures of fit:

3.3 Training XGBoost.

Start-up H2O and initialize H2O data frames:

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         11 days 4 hours 
##     H2O cluster timezone:       America/Los_Angeles 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.28.0.2 
##     H2O cluster version age:    1 month  
##     H2O cluster name:           H2O_started_from_R_zackbarry_ggu835 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.42 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 3.5.3 (2019-03-11)

4 Conclusion

We were able to use XGBoost to predict rental prices much more effectively than the baseline “naive” model. The baseline model had an MSE of 64000 (RSME of $252). Tuning the learning rate, depth, and L1 regularization parameters of the XGBoost model led to a final MSE value of 51500 (RMSE of $\227) and a final R^2 value of 0.2. These results are as good as we might have hoped, especially for a feature as important to the hospitality business as price. XGBoost is, however, one of the most popular non-parametric modeling frameworks so we trust that its results are at least similar to what other packages could provide. The “ideas for improvement” section below suggests some next steps one could take, using this document as a starting point.

Two of the four most important predictors turned out to be whether the room was private or shared. Due to the one-hot encoding, implictly included is whether the room was the entire home or apartment. This makes sense because a shared room is likely to command a significantly lower price than a private room or an entire home/apartment. Longitude and latitude were also important; as we saw in the map in the exploratory analysis section there are regions of NY that are more expensive than others. This geospatial relationship is reflect in the combination of latitude and longitude.

4.1 Ideas for improvement

  1. Create features using the names of the listings to see if name has a noticeable impact on the predictive accuracy of the model.
  2. Reduce the complexity of the feature space by combining some of the one-hot encoded neighborhoods. Lasso fusion for categorical variables would be interesting: https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels.
  3. Use the catboost framework to see if its handling of categorical data outperforms XGBoost.
## R version 3.5.3 (2019-03-11)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS  10.15.2
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] h2o_3.28.0.2         DT_0.8               knitr_1.22          
##  [4] caret_6.0-84         lattice_0.20-38      patchwork_1.0.0     
##  [7] sqldf_0.4-11         RSQLite_2.2.0        gsubfn_0.7          
## [10] proto_1.0.0          leaflet.extras_1.0.0 leaflet_2.0.2       
## [13] ggpubr_0.2.4         magrittr_1.5         DataExplorer_0.8.0  
## [16] forcats_0.4.0        stringr_1.4.0        dplyr_0.8.3         
## [19] purrr_0.3.2          readr_1.3.1          tidyr_1.0.0         
## [22] tibble_2.1.3         ggplot2_3.2.1        tidyverse_1.2.1     
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.4-1   ggsignif_0.5.0     ellipsis_0.3.0    
##  [4] class_7.3-15       rstudioapi_0.10    bit64_0.9-7       
##  [7] prodlim_2018.04.18 lubridate_1.7.4    xml2_1.2.1        
## [10] codetools_0.2-16   splines_3.5.3      zeallot_0.1.0     
## [13] jsonlite_1.6.1     broom_0.5.2        shiny_1.4.0       
## [16] compiler_3.5.3     httr_1.4.1         backports_1.1.5   
## [19] assertthat_0.2.1   Matrix_1.2-15      fastmap_1.0.1     
## [22] lazyeval_0.2.2     cli_1.1.0          later_1.0.0       
## [25] htmltools_0.4.0    tools_3.5.3        igraph_1.2.4.1    
## [28] gtable_0.3.0       glue_1.3.1         reshape2_1.4.3    
## [31] ggthemes_4.2.0     Rcpp_1.0.3         cellranger_1.1.0  
## [34] vctrs_0.2.1        nlme_3.1-137       iterators_1.0.10  
## [37] crosstalk_1.0.0    timeDate_3043.102  gower_0.2.1       
## [40] xfun_0.9           networkD3_0.4      rvest_0.3.3       
## [43] mime_0.7           lifecycle_0.1.0    MASS_7.3-51.4     
## [46] scales_1.0.0       ipred_0.9-9        hms_0.4.2         
## [49] promises_1.1.0     parallel_3.5.3     RColorBrewer_1.1-2
## [52] yaml_2.2.0         memoise_1.1.0      gridExtra_2.3     
## [55] rpart_4.1-13       stringi_1.4.3      highr_0.8         
## [58] foreach_1.4.4      lava_1.6.6         chron_2.3-54      
## [61] rlang_0.4.2        pkgconfig_2.0.3    bitops_1.0-6      
## [64] evaluate_0.13      recipes_0.1.6      htmlwidgets_1.5.1 
## [67] labeling_0.3       bit_1.1-14         tidyselect_0.2.5  
## [70] plyr_1.8.4         R6_2.4.1           generics_0.0.2    
## [73] DBI_1.1.0          pillar_1.4.2       haven_2.1.0       
## [76] withr_2.1.2        survival_2.43-3    RCurl_1.95-4.12   
## [79] nnet_7.3-12        modelr_0.1.5       crayon_1.3.4      
## [82] rmarkdown_1.12     grid_3.5.3         readxl_1.3.1      
## [85] data.table_1.12.2  blob_1.2.1         ModelMetrics_1.2.2
## [88] digest_0.6.23      xtable_1.8-4       httpuv_1.5.2      
## [91] stats4_3.5.3       munsell_0.5.0      tcltk_3.5.3