One Prompt Mistake Cost Me Hours: The Empty List Debugging Story

DebuggingLLMsAI EngineeringValidation

AI-Powered IDEs: 20% AI and 80% Software Engineering

AI EngineeringIDEsSoftware ArchitectureLLMs

The bug took me four hours to find. The fix took four lines of code. The lesson cost nothing but time — and it's now the first thing I check in every AI pipeline I touch: validate inputs before the LLM ever sees them. Here's the story.

The Symptom

At Idyllic Services, I was working on an agentic pipeline that processed customer support tickets. The pipeline pulled relevant context documents from our knowledge base, then passed them to an LLM to generate a response.

One day, QA flagged something alarming: the pipeline was generating confident, professional-sounding responses to certain queries — but the responses were completely wrong. Not slightly off. Fabricated. The LLM was citing policies that didn't exist, referencing processes we'd never described.

My first assumption: model issue. Bad model version, temperature too high, something in the LLM itself. I checked model configuration. Everything looked right.

The Investigation

I started logging everything — prompts sent to the LLM, responses received, the retrieved documents. Four hours of adding logging, re-running tests, comparing outputs.

Then I looked at one specific failing case and noticed something in the logged prompt:

# What was being sent to the LLM
prompt = f"""You are a customer support agent. 
Answer the customer's question using the following context documents:

Context Documents:
[]   # <-- This. Right here.

Customer Question: {customer_query}

Provide a helpful, accurate response."""

The context documents list was empty. We were sending [] — literally an empty Python list, serialized to string — as the "relevant context" for the LLM to draw from.

"The LLM didn't have any real context, so it did what LLMs do when given nothing to work with: it made something up that sounded reasonable."

Why the List Was Empty

Tracing back further: the semantic search step was returning zero results for certain edge-case queries. When the query was very short, very technical, or used terminology that wasn't in our knowledge base, the similarity threshold wasn't met and we retrieved nothing.

Instead of handling the empty case gracefully, the pipeline just... passed the empty list forward. The LLM received no context and confidently invented an answer.

This is the failure mode unique to AI systems: they fail silently and confidently. A traditional system with no data to return would throw an error or return nothing. An LLM with no data returns something that sounds authoritative.

The Fix: Guard Clauses Before Every LLM Call

The fix was four lines:

def generate_response(customer_query: str, context_docs: list) -> str:
    # GUARD CLAUSE: Never call LLM with empty context
    if not context_docs:
        return "I don't have enough information to answer this question accurately. Please contact our support team directly."
    
    # Only reach here if we have real context
    prompt = build_prompt(customer_query, context_docs)
    return llm_client.complete(prompt)

Four lines. Four hours to find the need for them.

What This Revealed About AI System Design

This debugging story revealed a fundamental principle of AI system engineering that's different from traditional software:

Traditional software fails loudly: null pointer exception, 500 error, empty response. Obvious failures.
AI systems fail silently and confidently: the system produces output that looks correct. Nothing crashes. No error is logged. The failure is only visible when you check the output against ground truth.

This means AI systems need validation layers that traditional software doesn't require. Not just "did the call succeed?" but "does the result make sense?" Not just "is the list non-null?" but "does the list contain meaningful data?"

The Validation Layer I Built After

After this incident, I added validation at every boundary in the pipeline:

class PipelineValidator:
    @staticmethod
    def validate_retrieval_results(results: list, min_results: int = 1) -> bool:
        if not results:
            return False
        if len(results) < min_results:
            return False
        # Check each result has actual content
        for result in results:
            if not result.get('text') or len(result['text'].strip()) < 10:
                return False
        return True
    
    @staticmethod
    def validate_llm_response(response: str) -> bool:
        if not response or not response.strip():
            return False
        # Check for obvious hallucination signals
        hallucination_signals = [
            "as mentioned in the document",  # if we didn't give documents
            "according to our records"        # if we're not sure we have records
        ]
        # ... domain-specific checks
        return True

The Three Rules I Now Follow

Every AI pipeline I build now follows three invariants:

Rule 1: Never send empty inputs to an LLM. Always check before sending. Define explicit fallback behavior for every empty case.

Rule 2: Validate LLM outputs, not just inputs. Post-process responses to check for format compliance, impossible values, and domain-specific integrity rules.

Rule 3: Log everything at every boundary. The only way to debug AI system failures is to have complete visibility into what the system received at each step. Silent pipelines create invisible bugs.

Building an AI pipeline and want a review of your validation strategy? I've found the hard way what can go wrong.

Get In Touch

Rugved Chandekar AI Systems Engineer @ Idyllic Services — AI Pipeline Architecture — IEEE Author

GitHub LinkedIn