XGBoost + SHAP: Building an Explainable House Price Predictor

XGBoostSHAPMLExplainabilityPython

How I Replaced Fragile If-Else Chains with a Supervisor Agent

AI ArchitectureAgent SystemsSystem DesignLLMs

A model that predicts house prices at 88% accuracy (R²=0.88) sounds impressive. But when a client asks "why does your model say my house is worth ₹45 lakhs and not ₹60 lakhs?" — accuracy alone doesn't answer that. Explainability does. That's why XGBoost + SHAP became my go-to combination for tabular ML.

Why XGBoost for Tabular Data

When I started this project, I considered several models: linear regression, random forest, neural networks, XGBoost. For tabular real estate data, XGBoost won on three criteria:

Performance: XGBoost consistently outperforms other algorithms on structured/tabular data. It's the algorithm behind most Kaggle competition winners on tabular datasets.
Speed: Trains fast even on large datasets. Gradient boosting is parallelized across trees.
Interpretability: Unlike neural networks, tree-based models have natural feature importance metrics. SHAP works especially well with them.

The Dataset and Features

I used a housing dataset with features including area (sq ft), number of bedrooms/bathrooms, location tier, age of property, floor level, parking availability, and proximity to amenities.

import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error

# Feature engineering
df['price_per_sqft'] = df['price'] / df['area_sqft']
df['total_rooms'] = df['bedrooms'] + df['bathrooms']
df['age_category'] = pd.cut(df['age_years'], 
    bins=[0, 5, 15, 30, 100], 
    labels=['new', 'recent', 'mature', 'old'])

features = ['area_sqft', 'bedrooms', 'bathrooms', 
           'location_tier', 'age_years', 'floor',
           'has_parking', 'distance_to_metro_km']

X = df[features]
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Train XGBoost
model = xgb.XGBRegressor(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")  # 0.882
print(f"MAE: ₹{mean_absolute_error(y_test, y_pred):,.0f}")

Why SHAP for Explainability

Standard feature importance from tree models tells you which features are important overall — "location_tier is the most important feature across all predictions." That's useful but incomplete.

SHAP (SHapley Additive exPlanations) tells you why the model made a specific prediction for a specific instance. "This house is predicted at ₹45 lakhs instead of the average ₹52 lakhs because: its distance from metro (-₹8L) and old age (-₹4L) dragged it down, while its large area (+₹5L) pushed it up."

import shap

# Create SHAP explainer
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Explain a single prediction
instance_idx = 0
shap.force_plot(
    explainer.expected_value,
    shap_values[instance_idx],
    X_test.iloc[instance_idx],
    feature_names=features
)

# Summary plot — feature importance across all predictions
shap.summary_plot(shap_values, X_test, feature_names=features)

The SHAP Output: Making It Understandable

The force plot for a single prediction looks like this conceptually:

# For a specific house:
# Base price (average):          ₹52,00,000
# + Large area (2,400 sqft):    +₹ 5,20,000
# + Prime location tier 1:      +₹ 8,40,000
# - Old property (35 years):    -₹ 4,80,000
# - Far from metro (8.5 km):    -₹ 7,20,000
# - Only 1 bathroom:            -₹ 2,40,000
# = Predicted price:             ₹51,20,000

This is what explainability means in practice. The client can see exactly what's driving their house's valuation — and can make informed decisions. "If I renovate the bathroom, the model would likely predict higher" is a conversation you can have. "The black box says ₹45L" is not.

Accuracy vs. Trustworthiness

R²=0.88 is good. But I've seen 0.95 accuracy models that clients wouldn't trust — because they couldn't understand them.

Trustworthiness comes from explainability. A client who understands why the model values their property as it does will trust the prediction, even if they disagree with specific factors. A client who gets a number from a black box has no reason to trust it.

This lesson generalizes: in any domain where humans act on ML predictions — real estate, credit scoring, medical diagnosis, hiring — explainability is not a nice-to-have. It's the difference between a tool people use and a tool people ignore.

Key Learnings

XGBoost is the right default for tabular regression/classification — don't start with neural networks unless you have a specific reason
SHAP is computationally expensive but worth it for any model that humans will act on
Feature engineering matters more than model tuning — my biggest accuracy jump came from price_per_sqft and total_rooms, not from hyperparameter optimization
Always visualize SHAP values on test examples before declaring success — global feature importance can hide local anomalies

Want to discuss ML models, explainability, or data science approaches? Let's connect.

Get In Touch

Rugved Chandekar AI Systems Engineer @ Idyllic Services — ML & XGBoost Specialist — IEEE Author

GitHub LinkedIn