How We Cut LLM Token Usage by 99% — 4 Engineering Techniques That Actually Work

LLM OptimizationToken EfficiencyRAGProduction EngineeringPython

At Idyllic Services, I was working on a production agentic AI pipeline · an automated candidate sourcing system that ran LLM calls at scale. One iteration consumed 200,000+ tokens. That's ?15·20 per run, multiple times a day. I was asked to fix it. Here's exactly what I did · four engineering techniques that combined to reduce token usage by ~99%, bringing each run down to around 5,000 tokens.

The Starting Point: 200K Tokens Per Run

The pipeline processed job descriptions, matched them against candidate profiles, scored candidates, and generated outreach drafts. Each stage called the LLM individually. Each call carried a fat JSON payload · full candidate profiles, full job descriptions, complete conversation history. Every call was redundant. Every payload was bloated.

The root problem: we were treating the LLM like a database that needed all context at all times. It doesn't. It needs relevant context. That one insight unlocked everything.

Technique 1: TOON Format · 30·40% Baseline Cut

TOON stands for Token-Oriented Object Notation. It's a lightweight serialization format built specifically for LLM prompts. JSON has enormous syntactic overhead · braces, brackets, repeated key names, quotes everywhere. For structured data sent to LLMs, that overhead is pure waste.

TOON strips that noise. For uniform arrays of objects (like candidate profiles), it declares field names once as a header row and lists values in subsequent rows · like CSV but with explicit structure the LLM can parse reliably.

# JSON (before) · 847 tokens for 5 candidates:
[
  {"name": "Rahul Sharma", "skills": ["Python", "AWS"], "yoe": 4, "location": "Pune"},
  {"name": "Priya Patil",  "skills": ["React", "Node"],  "yoe": 2, "location": "Mumbai"},
  ...
]

# TOON (after) · 290 tokens for the same 5 candidates:
{name | skills | yoe | location}
Rahul Sharma | Python, AWS | 4 | Pune
Priya Patil  | React, Node | 2 | Mumbai
...

Result: 30·40% token reduction on every structured payload. No loss of information. The LLM parses it correctly · you're just removing JSON's syntactic fat. Keep JSON internally; convert to TOON only at the "LLM boundary."

Technique 2: Removing LLM from Loops · Batch API Calls

This was the single biggest fix. The original pipeline called the LLM once per candidate · inside a for loop. 50 candidates = 50 separate API calls. Each carrying the full job description + full instructions. Multiply that across pipeline stages and the costs explode.

# BEFORE: LLM inside a loop (catastrophically expensive)
for candidate in candidates:  # 50 iterations = 50 LLM calls
    score = llm.invoke(
        f"Score this candidate:\nJD: {full_jd}\nCandidate: {json.dumps(candidate)}"
    )
    scores.append(score)

# AFTER: Single batched call
batch_prompt = build_batch_prompt(jd_toon, candidates_toon)
all_scores = llm.invoke(batch_prompt)  # 1 LLM call, all 50 candidates

The key insight: LLMs are excellent at processing structured batches. You don't need 50 calls to score 50 candidates · you need one call with a well-structured batch prompt. The JD context is sent once. The model processes all candidates in a single pass and returns structured results.

Impact: 50 calls · 4,000 tokens = 200,000 tokens → 1 call · ~6,000 tokens = 6,000 tokens. That is the single largest structural fix in the pipeline · achieved by changing how work is organised, not by tuning parameters.

Technique 3: RAG for Selective Memory

Agentic pipelines need memory. The naive approach: dump everything into the context window. Full conversation history, all previous decisions, all notes. Token usage grows linearly with pipeline runtime.

The fix: treat memory like a database, and use RAG to fetch only what's relevant to the current step. I embedded all memory chunks into a vector store (Qdrant). At each step, I embedded the current task, ran a nearest-neighbor search, and pulled only the top 2·3 most relevant memories.

class PipelineMemory:
    def __init__(self):
        self.vector_store = QdrantClient()
        self.embedder = EmbeddingModel()

    def store(self, key: str, content: str):
        embedding = self.embedder.embed(content)
        self.vector_store.upsert(key, embedding, content)

    def retrieve(self, query: str, top_k: int = 3) -> list:
        query_embedding = self.embedder.embed(query)
        return self.vector_store.search(query_embedding, limit=top_k)

# BEFORE: include everything (15,000 tokens of context)
prompt = f"Full history: {all_previous_context}\nTask: {current_task}"

# AFTER: retrieve only what's relevant (~400 tokens)
relevant = memory.retrieve(current_task, top_k=3)
prompt = f"Relevant context: {relevant}\nTask: {current_task}"

This replaced 15,000-token context dumps with 300·500 token targeted retrievals. The LLM gets exactly what it needs. Nothing else.

Technique 4: Result Caching · Zero Tokens for Repeat Work

Many LLM calls in a pipeline are semantically identical across runs · same JD, same candidate, same task. Without caching, every run recomputes from scratch. With content-addressed caching, the second call costs nothing.

import hashlib
from functools import wraps

def llm_cache(ttl_seconds=86400):
    """Cache LLM results by prompt content hash"""
    _cache = {}
    def decorator(fn):
        @wraps(fn)
        def wrapper(prompt, *args, **kwargs):
            key = hashlib.sha256(prompt.encode()).hexdigest()
            if key in _cache:
                return _cache[key]  # 0 tokens · free from cache
            result = fn(prompt, *args, **kwargs)
            _cache[key] = result
            return result
        return wrapper
    return decorator

@llm_cache(ttl_seconds=86400)
def score_candidate(prompt):
    return llm.invoke(prompt)

Cache key = hash of the exact prompt. Same input = same cached output, served instantly. On pipeline reruns and parameter sweeps, this alone eliminates 70·100% of the LLM calls for already-processed candidates.

The Combined Result

Technique	Token Impact
TOON format · compact payload	-30 to -40%
Remove LLM from loops → batch calls	largest single reduction
RAG selective memory vs full context	-90 to -97% on memory tokens
Result caching · repeat run elimination	-70 to -100% on cached steps

Final numbers: 200,000 tokens per iteration → ~5,000 tokens. ~99% reduction. Runtime dropped from 6 minutes to 90 seconds because batch calls replaced the sequential per-candidate loop.

The Core Principle

Every LLM call should answer one question: what is the minimum context this model needs to do this specific job well? If your answer is "all the context we have" · that's the bug.

TOON gives it compact structure. Batch calls give it efficient work units. RAG gives it relevant memory. Caching gives it free answers when you've already paid for the result. None of these are magic. They're engineering.

Running an LLM pipeline with runaway token costs? I've solved this at production scale. Let's talk.

Discuss Your Pipeline

Rugved Chandekar AI/ML Engineer & Backend Developer · Agentic AI & RAG Pipeline Architect · IEEE AIC 2026 Author · Idyllic Services

GitHub LinkedIn