LangSmart Smartflow · Cost & Performance

The Cache
Ecosystem

Three independent, compounding cost-reduction layers in every request pipeline. Most AI proxies stop at layer one. Smartflow compounds all three.

4-Phase MetaCache Prompt Compression Provider-Side Caching 65% Avg Compression
4
Cache Phases
65%
Avg Compression
1.25M+
Tokens Saved
~5ms
Cache Hit Latency
0.90
Semantic Threshold
Why single-layer caching misses almost everything
Identical prompts from identical users are rare. System messages evolve. User questions vary in phrasing. Conversation context grows with each turn. A naive exact-match cache misses almost everything — and the tokens it does not catch are forwarded to the provider in full, with no compression, no provider-side caching signal, and no visibility into what was saved or why.
🔒 Layer 1 — Four-Phase Local Cache (MetaCache)

Every request is checked against four local lookup phases before any token is sent to a provider. Phases execute in order; the first hit wins and the remaining phases are skipped.

1
Intent Signature Match
Every request is normalised and its semantic intent is fingerprinted. If the same intent fingerprint has been seen before, the stored response is returned immediately — sub-millisecond. Covers repeated questions regardless of minor phrasing variation.
2
Intent Signature Near-Miss
When the exact fingerprint is not present, the intent model checks for close-enough signatures using lightweight token overlap. Catches reformulations that share the same core question.
3
Exact Key Lookup (SHA-256)
SHA-256 key lookup against the Redis store. Reliable fallback for multimodal and structured requests where intent fingerprinting is less applicable — image inputs, JSON-heavy payloads, code prompts.
4
VectorLite BERT Semantic Search
Request text is embedded using sentence-transformers/all-MiniLM-L6-v2 (384-dim vectors, local inference — no external API call). A K-nearest-neighbours search runs across all stored response embeddings in the Redis Stack vector index using cosine similarity. Threshold default 0.90. Catches paraphrased questions that differ entirely in wording but carry the same meaning.
A cache hit at any phase = zero provider tokens consumed
Embeddings are generated locally using native BERT inference — no external vector database (Qdrant, Weaviate, Pinecone) required. No network hop. No per-embedding API cost.
Persistent Archive

Full conversation and response history is written to MongoDB in parallel with every response. Supports long-range cost attribution, compliance retrieval, cache warming on restart, and audit replay. Not a lookup phase — runs asynchronously and does not add latency.

☁ Layer 2 — Prompt Compression (In-Flight Token Reduction)

On every cache miss, Smartflow compresses the outgoing request body before forwarding it to the provider. Compression runs transparently — clients send and receive standard API payloads.

Verbose Phrase Reduction

39 common verbose patterns are replaced with concise equivalents automatically. “In order to” → “To”. “Due to the fact that” → “Because”. Average 20–30% token reduction on typical prose-heavy system prompts.

Semantic Deduplication

Repeated concepts across the message history are detected via embedding similarity and replaced with compact references. On decompression, references are resolved before the response reaches the client.

Model-Specific Optimisation

Compression aggressiveness is tuned per provider — Anthropic models tolerate tighter compression than GPT-3.5-class models. Token savings are tracked per request in VAS logs.

65%
Avg Compression Ratio
1.25M+
Tokens Saved to Date
39
Verbose Patterns
☀ Layer 3 — Transparent LLM-Side Prompt Caching

Providers like Anthropic (Claude 3+) and OpenAI offer server-side prompt caching — repeated input prefixes stored on their infrastructure and billed at a fraction of the full input rate. Anthropic cached token rate ~$3/MTok vs ~$15/MTok. OpenAI caches automatically for prompts over 1,024 tokens.

The problem: clients must opt in by attaching cache markers. Most do not. Smartflow’s prompt cache injector does this transparently for every proxied request.

Large System Messages (≥ 4,000 chars)

cache_control: {type: ephemeral} injected on every request, immediately priming the provider cache. Every subsequent call within the 5-minute window hits cached tokens.

Repetitive Medium Messages (≥ 2,000 chars, ≥ 3×)

Cache markers injected automatically once the repetition threshold is reached — covering applications with shorter but heavily reused system prompts.

SHA-256 Hash Tracking

System message hashes stored in Redis with a 15-minute TTL. The injector knows whether a system message is new (priming) or warm (cache likely live).

Response Parsing

Cache hit tokens read from every provider response (cache_read_input_tokens for Anthropic, prompt_tokens_details.cached_tokens for OpenAI) and logged with estimated dollar savings.

📈 Observability — Cache Data in VAS Logs

Every request produces a VAS audit log entry at GET /api/vas/logs. Cache-related fields:

FieldCache HitCache Miss
metacache.hittruefalse
metacache.queryThe matched query stringnull
metacache.tokens_saved≈ response length ÷ 40
latency_ms5 (no provider call)Actual RTT
routing_strategy"cache""direct"
routing_reason"cache_hit:tier=L1""provider:openai"
modelFrom request bodyFrom provider response

Response headers on every cache hit:

X-Cache-Hit: true
x-cache-similarity: 1.0
x-tokens-saved: 209
📊 Combined Savings by Request Type
Request TypeLayer 1 (Local)Layer 2 (Compression)Layer 3 (Provider)Est. Saving
Repeated query (exact)Phase 1 hit~100% cost
Same intent, different phrasingPhase 2 hit~100% cost
Structured / multimodal exactPhase 3 hit~100% cost
Paraphrased (≥ 0.90 similarity)Phase 4 VectorLite~100% cost
New query, large system promptMiss20–30% saved60–90% on prefix70–95% cost
New query, novel contentMiss20–30% saved20–30% cost
📊 What Other Solutions Do

Basic AI proxies offer Phase 1 exact-match caching only. LiteLLM adds a Redis semantic cache backed by an external vector store (Qdrant). Neither applies in-flight compression. Neither injects provider-side cache markers. Neither tracks what was saved or why across all surfaces.

Smartflow’s Phase 4 VectorLite search runs entirely inside the proxy process using a native BERT model — no external vector database dependency, no network hop, no per-embedding API cost.