Skip to main content

3-Tier Prompt Routing: How to Cut LLM API Costs Without Sacrificing Output Quality

Published on June 10, 2026 · 8 min read

The default approach to prompt optimization: send every prompt to a frontier LLM, ask it to improve the text, return the result. It works. It also consumes API budget on prompts that didn't need an LLM call in the first place, adds 1-3 seconds of latency to simple requests, and frequently makes straightforward prompts worse by introducing unnecessary complexity.

When we mapped our incoming prompt distribution, 40% of prompts were basic — clear intent, simple structure, no real ambiguity. Sending those to GPT-4o or Claude Sonnet wasn't optimization. It was overhead.

The solution is intelligent routing: classify each prompt before optimizing, then send it to the cheapest tier that can handle it correctly.

The Three Tiers

Tier 1: Rules-Based (<10ms)

Deterministic pattern-matching optimization with no LLM involvement. The system applies known transformations for the detected prompt context. A Terraform prompt gets IaC-specific structure enforced. A JSON conversion prompt gets strict field preservation rules applied. A code generation prompt for a specific language gets language-idiomatic formatting enforced.

Routes here when the composite routing score is ≤ 0.40. Latency under 10ms. Zero API cost. For 40% of all prompts, this is the correct tier.

Tier 2: Hybrid (Rules + Targeted LLM Call)

Rules run first and handle the deterministic improvements. A focused LLM call then addresses the parts that genuinely need intelligence — ambiguous phrasing, missing context, structural gaps that rules can't resolve. Lighter than full LLM optimization because the rules absorb the mechanical work.

Routes here when composite score is 0.40–0.85. Covers approximately 35% of incoming prompts.

Tier 3: Full LLM

Complete LLM optimization with context-aware system prompting. Reserved for complex, expert-level prompts where a full LLM rewrite is genuinely justified — multi-step technical workflows, nuanced meta-prompts, high-stakes content where optimization quality directly affects outcomes.

Routes here when composite score ≥ 0.85. Covers approximately 25% of incoming prompts.

The Routing Score Formula

Every prompt gets a composite routing score before optimization runs:

composite = (context_weight × 0.5)
           + (sophistication × 0.3)
           + (load_factor × 0.2)

Context Weight (50% of score)

The dominant factor. Derived from context detection confidence. High-confidence image generation prompts score higher toward LLM tier — creative enhancement benefits from LLM reasoning. High-confidence structured output prompts score lower — rules are sufficient and safer. If context detection confidence falls below 0.60, the router falls back to Tier 1 regardless of other signals. Don't apply sophisticated optimization to a prompt you can't confidently categorize.

Sophistication Score (30% of score)

Prompt complexity analysis. "Generate a hello world function" is basic. "Design a multi-region failover architecture with RPO constraints and runbook procedures" is expert-level. The sophistication detector maps lexical density, structural complexity, technical vocabulary depth, and instruction nesting.

Load Factor (20% of score)

Dynamic routing pressure from system load. Under heavy traffic, the router shifts borderline prompts toward lower tiers to maintain response time guarantees. A prompt that would normally route to Hybrid might route to Rules under peak load.

Value Hierarchy Routing Floors

User-defined value hierarchies can override the routing formula. When a NON-NEGOTIABLE priority is set (e.g., "output must always include security considerations"), the router floors that prompt at a minimum routing score — ensuring it reaches a tier capable of enforcing the constraint. A HIGH-priority label floors at 0.45. NON-NEGOTIABLE floors at 0.72.

This prevents important prompts from being under-optimized by the cost-saving logic.

Practical Results

For a typical workload distribution — 40% basic, 35% moderate, 25% complex — the routing system produces approximately 75% fewer full LLM calls compared to sending everything to a frontier model. The rules tier returns results in under 10ms compared to 1-3 seconds for LLM tiers.

The quality tradeoff is minimal: rules-based optimization is deterministic and domain-specific. For prompts where rules are sufficient, rules produce more consistent output than LLM calls that may vary between requests.

Model-Agnostic Architecture

The routing system is independent of which LLM handles the optimization step. You can configure Claude 4.6 for the full LLM tier, GPT-4.1 for hybrid, and rules-only for the base tier — or use any other combination. Switching from one provider to another doesn't require changes to the routing logic.

This matters when LLM pricing changes (which it does frequently). The routing system's cost characteristics are determined by tier distribution, not by which specific model is on the other end.

Building Routing Into Your Own Pipeline

You don't need to use Prompt Optimizer to apply this pattern. The core insight is: classify before you spend. Any LLM pipeline that sends every request to the same endpoint, regardless of complexity, is leaving optimization headroom on the table.

Start with two tiers: a fast path for prompts that match known simple patterns, and a slow path for everything else. Measure how many requests go through each path. You'll find the distribution skews toward simple more than you expect.

Comments

Loading comments...