The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
In a model with geographic components, we want to express a functional \(T\) (usually the expectation or a quantile) of a response \(Y\) as a function \(f\) of a set of geographic features (latitude/longitude and/or postal code and/or other features varying with location), and other features:
\[ T(Y \mid X^\textrm{geo}, X^\textrm{other}) \approx f(X^\textrm{geo}, X^\textrm{other}) \] Like any feature, the effect of a single geographic feature \(X^{\textrm{geo}, j}\) can be described using SHAP dependence plots. However, studying the effect of latitude (or any other location dependent feature) alone is often not very illuminating - simply due to strong interaction effects and correlations with other geographic features.
That’s where the additivity of SHAP values comes into play: The sum of SHAP values of all geographic components represent the total effect of \(X^\textrm{geo}\), and this sum can be visualized as a heatmap or 3D scatterplot against latitude/longitude (or any other geographic representation).
For illustration, we will use a beautiful house price dataset containing information on about 14’000 houses sold in 2016 in Miami-Dade County. Some of the columns are as follows:
(Italic features are geographic components.) For more background on this dataset, see Mayer et al. (2022).
We will fit an XGBoost model to explain the expected log(price) as a function of lat/long, size, and quality/age.
library(xgboost)
library(ggplot2)
library(shapviz)
miami <- miami |>
transform(
log_living = log(TOT_LVG_AREA),
log_land = log(LND_SQFOOT),
log_price = log(SALE_PRC)
)
x_coord <- c("LATITUDE", "LONGITUDE")
x_nongeo <- c("log_living", "log_land", "structure_quality", "age")
xvars <- c(x_coord, x_nongeo)
# Select training data
set.seed(1)
ix <- sample(nrow(miami), 0.8 * nrow(miami))
train <- miami[ix, ]
X_train <- train[xvars]
y_train <- train$log_price
# Fit XGBoost model
params <- list(learning_rate = 0.2, nthread = 1)
dtrain <- xgb.DMatrix(data.matrix(X_train), label = y_train, nthread = 1)
fit <- xgb.train(params, dtrain, nrounds = 200)
Let’s first study selected SHAP dependence plots for an explanation dataset of size 2000.
X_explain <- X_train[1:2000, ]
sv <- shapviz(fit, X_pred = data.matrix(X_explain))
sv_dependence(
sv,
v = c("log_living", "structure_quality", "LONGITUDE", "LATITUDE"),
alpha = 0.2
)
# And now the two-dimensional plot of the sum of SHAP values
sv_dependence2D(sv, x = "LONGITUDE", y = "LATITUDE") +
coord_equal()
The last plot gives a good impression on price levels.
Notes:
We will now change above model in two ways, not unlike the model in Mayer et al. (2022):
The second step leads to a model that is additive in each non-geographic component and also additive in the combined location effect. According to the technical report Mayer (2022), SHAP dependence plots of additive components in a boosted trees model are shifted versions of corresponding partial dependence plots (evaluated at observed values). This allows a “Ceteris Paribus” interpretation of SHAP dependence plots of corresponding components.
# Extend the feature set
more_geo <- c("CNTR_DIST", "OCEAN_DIST", "RAIL_DIST", "HWY_DIST")
xvars <- c(xvars, more_geo)
X_train <- train[xvars]
dtrain <- xgb.DMatrix(data.matrix(X_train), label = y_train, nthread = 1)
# Build interaction constraint vector and add it to params
ic <- c(
list(which(xvars %in% c(x_coord, more_geo)) - 1),
as.list(which(xvars %in% x_nongeo) - 1)
)
params$interaction_constraints <- ic
# Fit XGBoost model
fit <- xgb.train(params, dtrain, nrounds = 200)
# SHAP analysis
X_explain <- X_train[2:2000, ]
sv <- shapviz(fit, X_pred = data.matrix(X_explain))
# Two selected features: Thanks to additivity, structure_quality can be read as
# Ceteris Paribus
sv_dependence(sv, v = c("structure_quality", "LONGITUDE"), alpha = 0.2)
# Total geographic effect (Ceteris Paribus thanks to additivity)
sv_dependence2D(sv, x = "LONGITUDE", y = "LATITUDE", add_vars = more_geo) +
coord_equal()
Again, the resulting total geographic effect looks reasonable. Note that, unlike in the first example, there are no interactions to non-geographic components, leading to a Ceteris Paribus interpretation. Furthermore, it contains the effect of the other regional features.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.