XGBoost + SHAP: Building an Explainable House Price Predictor (R²=0.88)
Predicting house prices is easy. Predicting them accurately and explaining exactly why the model arrived at that number — that's the harder problem. That's what I built.
Why Explainability Matters in ML
A model that says "this house is worth ₹45 lakhs" is useful. A model that says "this house is worth ₹45 lakhs — primarily because it's 3BHK (+₹8L), in a high-demand neighborhood (+₹12L), built post-2015 (+₹5L), but loses value since it has no covered parking (-₹3L)" is actually actionable.
That's the difference SHAP makes. And that's exactly what users see when they use this application.
Feature Engineering: Where the R² Lives
Raw data is never model-ready. Here's what I engineered beyond the baseline features:
- price_per_sqft — normalized area relative price signal
- age_of_property — derived from year_built, accounts for depreciation curve
- location_tier — locality encoded into demand tiers (1-5) based on median historical prices
- amenity_score — weighted sum of amenities (gym, parking, swimming pool, etc.)
- distance_to_hub — Euclidean distance to nearest commercial/IT hub
Feature engineering alone moved the baseline R² from 0.72 to 0.84. The remaining 0.04 came from hyperparameter tuning.
XGBoost: The Model
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
model = xgb.XGBRegressor(
n_estimators=800,
learning_rate=0.05,
max_depth=6,
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0.1, # L1 regularization
reg_lambda=1.0, # L2 regularization
random_state=42
)
model.fit(X_train, y_train,
eval_set=[(X_val, y_val)],
early_stopping_rounds=50,
verbose=False)
SHAP: The Explainability Layer
SHAP (SHapley Additive exPlanations) computes the contribution of each feature to each individual prediction. It's mathematically grounded in game theory — each feature's contribution is its "Shapley value."
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_input)
# For each prediction, you get a per-feature contribution score
# Positive SHAP = feature increased the price
# Negative SHAP = feature decreased the price
The Flask API returns both the prediction and the top 5 SHAP contributors, formatted as human-readable strings: "Location quality adds ₹8.2L", "Age of building reduces by ₹2.1L".
Serving via Flask REST API
The model and explainer are serialized with joblib and loaded once at app startup. Each prediction request is:
- Validated (required fields, correct ranges)
- Feature-engineered (same pipeline as training)
- Passed through XGBoost → price prediction
- Passed through SHAP → explanation values
- Formatted and returned as JSON
Average prediction latency: ~120ms including SHAP computation.
Need a machine learning model integrated into your product with real explanations for users? Let's talk.
Get In Touch
