Three independent, compounding cost-reduction layers in every request pipeline. Most AI proxies stop at layer one. Smartflow compounds all three.
Every request is checked against four local lookup phases before any token is sent to a provider. Phases execute in order; the first hit wins and the remaining phases are skipped.
sentence-transformers/all-MiniLM-L6-v2 (384-dim vectors, local inference — no external API call). A K-nearest-neighbours search runs across all stored response embeddings in the Redis Stack vector index using cosine similarity. Threshold default 0.90. Catches paraphrased questions that differ entirely in wording but carry the same meaning.Full conversation and response history is written to MongoDB in parallel with every response. Supports long-range cost attribution, compliance retrieval, cache warming on restart, and audit replay. Not a lookup phase — runs asynchronously and does not add latency.
On every cache miss, Smartflow compresses the outgoing request body before forwarding it to the provider. Compression runs transparently — clients send and receive standard API payloads.
39 common verbose patterns are replaced with concise equivalents automatically. “In order to” → “To”. “Due to the fact that” → “Because”. Average 20–30% token reduction on typical prose-heavy system prompts.
Repeated concepts across the message history are detected via embedding similarity and replaced with compact references. On decompression, references are resolved before the response reaches the client.
Compression aggressiveness is tuned per provider — Anthropic models tolerate tighter compression than GPT-3.5-class models. Token savings are tracked per request in VAS logs.
Providers like Anthropic (Claude 3+) and OpenAI offer server-side prompt caching — repeated input prefixes stored on their infrastructure and billed at a fraction of the full input rate. Anthropic cached token rate ~$3/MTok vs ~$15/MTok. OpenAI caches automatically for prompts over 1,024 tokens.
The problem: clients must opt in by attaching cache markers. Most do not. Smartflow’s prompt cache injector does this transparently for every proxied request.
cache_control: {type: ephemeral} injected on every request, immediately priming the provider cache. Every subsequent call within the 5-minute window hits cached tokens.
Cache markers injected automatically once the repetition threshold is reached — covering applications with shorter but heavily reused system prompts.
System message hashes stored in Redis with a 15-minute TTL. The injector knows whether a system message is new (priming) or warm (cache likely live).
Cache hit tokens read from every provider response (cache_read_input_tokens for Anthropic, prompt_tokens_details.cached_tokens for OpenAI) and logged with estimated dollar savings.
Every request produces a VAS audit log entry at GET /api/vas/logs. Cache-related fields:
| Field | Cache Hit | Cache Miss |
|---|---|---|
metacache.hit | true | false |
metacache.query | The matched query string | null |
metacache.tokens_saved | ≈ response length ÷ 4 | 0 |
latency_ms | 5 (no provider call) | Actual RTT |
routing_strategy | "cache" | "direct" |
routing_reason | "cache_hit:tier=L1" | "provider:openai" |
model | From request body | From provider response |
Response headers on every cache hit:
X-Cache-Hit: true x-cache-similarity: 1.0 x-tokens-saved: 209
| Request Type | Layer 1 (Local) | Layer 2 (Compression) | Layer 3 (Provider) | Est. Saving |
|---|---|---|---|---|
| Repeated query (exact) | Phase 1 hit | — | — | ~100% cost |
| Same intent, different phrasing | Phase 2 hit | — | — | ~100% cost |
| Structured / multimodal exact | Phase 3 hit | — | — | ~100% cost |
| Paraphrased (≥ 0.90 similarity) | Phase 4 VectorLite | — | — | ~100% cost |
| New query, large system prompt | Miss | 20–30% saved | 60–90% on prefix | 70–95% cost |
| New query, novel content | Miss | 20–30% saved | — | 20–30% cost |
Basic AI proxies offer Phase 1 exact-match caching only. LiteLLM adds a Redis semantic cache backed by an external vector store (Qdrant). Neither applies in-flight compression. Neither injects provider-side cache markers. Neither tracks what was saved or why across all surfaces.
Smartflow’s Phase 4 VectorLite search runs entirely inside the proxy process using a native BERT model — no external vector database dependency, no network hop, no per-embedding API cost.