XGBoost + SHAP: Building an Explainable House Price Predictor
How I Replaced Fragile If-Else Chains with a Supervisor Agent
A model that predicts house prices at 88% accuracy (R²=0.88) sounds impressive. But when a client asks "why does your model say my house is worth ₹45 lakhs and not ₹60 lakhs?" — accuracy alone doesn't answer that. Explainability does. That's why XGBoost + SHAP became my go-to combination for tabular ML.
Why XGBoost for Tabular Data
When I started this project, I considered several models: linear regression, random forest, neural networks, XGBoost. For tabular real estate data, XGBoost won on three criteria:
- Performance: XGBoost consistently outperforms other algorithms on structured/tabular data. It's the algorithm behind most Kaggle competition winners on tabular datasets.
- Speed: Trains fast even on large datasets. Gradient boosting is parallelized across trees.
- Interpretability: Unlike neural networks, tree-based models have natural feature importance metrics. SHAP works especially well with them.
The Dataset and Features
I used a housing dataset with features including area (sq ft), number of bedrooms/bathrooms, location tier, age of property, floor level, parking availability, and proximity to amenities.
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error
# Feature engineering
df['price_per_sqft'] = df['price'] / df['area_sqft']
df['total_rooms'] = df['bedrooms'] + df['bathrooms']
df['age_category'] = pd.cut(df['age_years'],
bins=[0, 5, 15, 30, 100],
labels=['new', 'recent', 'mature', 'old'])
features = ['area_sqft', 'bedrooms', 'bathrooms',
'location_tier', 'age_years', 'floor',
'has_parking', 'distance_to_metro_km']
X = df[features]
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
# Train XGBoost
model = xgb.XGBRegressor(
n_estimators=500,
learning_rate=0.05,
max_depth=6,
subsample=0.8,
colsample_bytree=0.8,
random_state=42
)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print(f"R² Score: {r2_score(y_test, y_pred):.3f}") # 0.882
print(f"MAE: ₹{mean_absolute_error(y_test, y_pred):,.0f}")
Why SHAP for Explainability
Standard feature importance from tree models tells you which features are important overall — "location_tier is the most important feature across all predictions." That's useful but incomplete.
SHAP (SHapley Additive exPlanations) tells you why the model made a specific prediction for a specific instance. "This house is predicted at ₹45 lakhs instead of the average ₹52 lakhs because: its distance from metro (-₹8L) and old age (-₹4L) dragged it down, while its large area (+₹5L) pushed it up."
import shap
# Create SHAP explainer
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
# Explain a single prediction
instance_idx = 0
shap.force_plot(
explainer.expected_value,
shap_values[instance_idx],
X_test.iloc[instance_idx],
feature_names=features
)
# Summary plot — feature importance across all predictions
shap.summary_plot(shap_values, X_test, feature_names=features)
The SHAP Output: Making It Understandable
The force plot for a single prediction looks like this conceptually:
# For a specific house:
# Base price (average): ₹52,00,000
# + Large area (2,400 sqft): +₹ 5,20,000
# + Prime location tier 1: +₹ 8,40,000
# - Old property (35 years): -₹ 4,80,000
# - Far from metro (8.5 km): -₹ 7,20,000
# - Only 1 bathroom: -₹ 2,40,000
# = Predicted price: ₹51,20,000
This is what explainability means in practice. The client can see exactly what's driving their house's valuation — and can make informed decisions. "If I renovate the bathroom, the model would likely predict higher" is a conversation you can have. "The black box says ₹45L" is not.
Accuracy vs. Trustworthiness
R²=0.88 is good. But I've seen 0.95 accuracy models that clients wouldn't trust — because they couldn't understand them.
Trustworthiness comes from explainability. A client who understands why the model values their property as it does will trust the prediction, even if they disagree with specific factors. A client who gets a number from a black box has no reason to trust it.
This lesson generalizes: in any domain where humans act on ML predictions — real estate, credit scoring, medical diagnosis, hiring — explainability is not a nice-to-have. It's the difference between a tool people use and a tool people ignore.
Key Learnings
- XGBoost is the right default for tabular regression/classification — don't start with neural networks unless you have a specific reason
- SHAP is computationally expensive but worth it for any model that humans will act on
- Feature engineering matters more than model tuning — my biggest accuracy jump came from
price_per_sqftandtotal_rooms, not from hyperparameter optimization - Always visualize SHAP values on test examples before declaring success — global feature importance can hide local anomalies
Want to discuss ML models, explainability, or data science approaches? Let's connect.
Get In Touch
