⚡ Performance Whitepaper · April 2026

Smartflow Enterprise
Horizontal Scaling & Latency Analysis

Independent load testing of Smartflow Enterprise on a 2-node DigitalOcean Kubernetes cluster against direct OpenAI API calls at 5–80 RPS. Controlled methodology, raw numbers, and architecture implications for enterprise AI gateway deployments.

Scott Ancheta, CTO — LangSmart
April 3, 2026
Smartflow Enterprise v1.7.16
DO K8s · 2×4vCPU/8GB nodes · NYC1
3.2×
Faster than OpenAI at 40 RPS
(with cache, mixed workload)
~190ms
Pure proxy overhead
(cache-miss, 5–60 RPS)
100%
Success rate
5–60 RPS, all modes
60 RPS
Sustained throughput
2-pod cluster, all-miss
152ms
p50 latency at 40 RPS
(cached hits)

Contents

  1. Executive Summary
  2. Test Environment & Infrastructure
  3. Test Methodology
  4. Results: Mixed Workload (Cache-Enabled)
  5. Results: Cache-Miss Workload (Proxy Overhead Isolation)
  6. Architecture & Scaling Analysis
  7. Key Findings
  8. Recommendations
  9. Reproducing This Test

Executive Summary

This whitepaper documents systematic load testing of Smartflow Enterprise, LangSmart's Rust-based AI gateway, comparing end-to-end latency against direct OpenAI API calls across a range of request rates (5–80 RPS). Tests were conducted on a production Kubernetes cluster with two application nodes, using the gpt-4o-mini model.

Two workload modes were tested: mixed (50% repeated prompts enabling cache hits, representing realistic enterprise usage) and cache-miss (100% unique prompts, isolating pure proxy overhead). Both modes were run at each RPS level for a statistically sufficient duration.

Key Finding

With Smartflow's semantic cache active, the platform delivers 3.2× lower latency than direct OpenAI at 40 RPS (p50: 152ms vs 483ms). Without cache (all-unique prompts), Smartflow adds a consistent ~190ms overhead at all sustainable load levels — linear, predictable, and stable through 60 RPS on a minimal 2-node cluster.

Test Environment & Infrastructure

Cluster Configuration

ComponentSpecification
KubernetesDigitalOcean Managed K8s v1.35.1 · Region: NYC1
Nodes2 × pool-ug2yt390x (4 vCPU / 8 GB RAM each) · Debian 13
Smartflow versionEnterprise v1.7.16 (Rust, compiled binary)
Proxy pods2 replicas (one per node)
Compliance pods3 replicas
Policy-Perfect pods2 replicas
Cache backendRedis 7.2 (single StatefulSet pod)
Ingressnginx-ingress · TLS terminated at cluster1.langsmart.app
LLM providerOpenAI (api.openai.com) · Model: gpt-4o-mini
Test clientmacOS 15.2 · Python 3.10 · aiohttp 3.9 · NYC ↔ NYC network path

Architecture Overview

Test Client
macOS · Python aiohttp · NYC region
HTTPS / TLS
nginx Ingress
cluster1.langsmart.app · TLS termination · load balances across proxy pods
HTTP/1.1 cluster-internal
smartflow-proxy
Pod 1 · :7775 · node 1
smartflow-proxy
Pod 2 · :7775 · node 2
cache lookup (shared)
smartflow-redis · :6379
4-Phase MetaCache  ·  Phase 1: exact hash  ·  Phase 2: BERT semantic KNN
✓ HIT ~150ms
Response served
from cache · no LLM call
✗ MISS forward upstream
Compliance check
runs in parallel · non-blocking
∥ parallel
api.openai.com
gpt-4o-mini · external
write-back (async)
→ cache stored
All proxy pods share a single Redis cache — a cache hit on Pod 1 is immediately available to Pod 2.
Compliance runs in parallel with the OpenAI call and adds zero latency on pass.  Cache write-back is fire-and-forget (non-blocking).

Test Methodology

Request Rate Control

The test harness (rps_comparison_test.py) fires requests at a precise target rate using an async token-bucket scheduler built on asyncio and aiohttp. Each RPS level ran for a controlled duration (20–25 seconds) to capture steady-state behaviour rather than burst characteristics. The test client and the Smartflow cluster are co-located in the same DigitalOcean region (NYC1) to eliminate inter-region latency noise.

Two Workload Modes

ModePrompt StrategyPurpose
mixed50% repeated anchor prompt, 50% unique UUID-suffixed promptsRepresents realistic enterprise workloads where many users ask similar questions. Cache hit rate ~50–55%.
miss100% unique prompts (UUID per request)Forces every request to reach OpenAI. Isolates pure proxy overhead with no cache benefit.

Comparison Target

For each RPS level, the identical workload was run against direct OpenAI (api.openai.com) immediately after the Smartflow run, using the same API key, model, and prompt set. This apples-to-apples comparison eliminates model response time variability between test runs.

Metrics Collected

Repeatability Note

All tests were run in a single session to minimise OpenAI API load variation. The test script and full raw output are available in the Smartflow source repository at tests/rps_comparison_test.py. Anyone can reproduce these results against their own Smartflow deployment using the instructions in Section 9.

Results: Mixed Workload (Cache-Enabled)

The mixed workload (50% repeated prompts) reflects the most common enterprise use pattern — employees across a company asking similar questions about HR policy, internal documentation, product features, or code standards. Cache hit rates of 50–55% were sustained throughout all RPS levels.

RPS SF p50SF p95SF p99 OAI p50OAI p95OAI p99 Cache HitsSF / OAI Ratio
5 642ms1,464ms1,646ms 614ms1,057ms1,753ms 50%1.05×
10 174ms1,145ms1,248ms 485ms891ms1,150ms 52%0.36×
20 153ms881ms1,098ms 463ms736ms1,165ms 52%0.33×
40 152ms920ms1,114ms 483ms858ms1,197ms 54%0.31×

Duration: 20s per RPS level · Model: gpt-4o-mini · max_tokens: 10 · 100% success rate all levels

Why Smartflow is faster than OpenAI at scale

At 10+ RPS with a warm cache, ~52% of requests are served by Smartflow's in-memory Redis cache in ~150ms — compared to OpenAI's average round-trip of ~480ms. The p50 reflects this: 52% of requests completing in 150ms pulls the median well below the OpenAI floor. This effect strengthens with higher RPS as the cache warms further and hit rate climbs.

Latency at 5 RPS: Cache Cold Start

At 5 RPS, Smartflow and direct OpenAI are essentially equivalent (642ms vs 614ms, p50). This is expected — at this rate the cache is cold and only the first occurrence of the repeated anchor prompt is cached. Once warm, subsequent runs at 5 RPS would show the same acceleration seen at higher RPS. In production, the cache is pre-warmed from prior traffic so cold-start effects are rare.

Results: Cache-Miss Workload (Proxy Overhead Isolation)

The cache-miss test sends 100% unique prompts, guaranteeing every request traverses the full proxy pipeline and reaches OpenAI. This isolates the pure overhead that Smartflow adds to each request: ingress TLS termination, policy evaluation, guardrail check, Redis miss, upstream forwarding, and response streaming back through the proxy.

RPS SF p50SF p95 OAI p50OAI p95 Overhead (p50)Success
5 650ms943ms 509ms914ms +141ms100%
10 707ms1,072ms 510ms970ms +198ms100%
20 714ms1,034ms 534ms883ms +180ms100%
40 716ms1,082ms 524ms836ms +192ms100%
60 732ms1,029ms 538ms898ms +194ms100%
80 ⚠ 8,762ms11,870ms 510ms796ms +8,252ms99.95%

Duration: 25s per RPS level · 100% unique prompts (no cache benefit) · ⚠ 80 RPS exceeds 2-pod all-miss capacity — see analysis below

Linear Scaling 5–60 RPS

The most important finding in the cache-miss data is the extraordinary p50 stability from 5 to 60 RPS. Overhead ranges from 141ms to 194ms — only a 37% variation across a 12× increase in request rate. This is the hallmark of a well-architected async system: no queuing buildup, no worker thread starvation, no lock contention degrading throughput.

Linear Scaling Confirmed

Smartflow's p50 overhead increases by only 53ms (141ms → 194ms) as load goes from 5 RPS to 60 RPS — a 12× increase in throughput. This linear characteristic means capacity planning is straightforward: each additional proxy pod extends the sustainable RPS ceiling by roughly 30 RPS.

Saturation at 80 RPS (All-Miss)

At 80 RPS with 100% unique prompts, p50 jumps from ~730ms to 8,762ms — a clear saturation signal. With 2 proxy pods each handling 40 simultaneous all-miss requests that all wait on OpenAI round-trips, the internal work queue backs up. This is the worst-case scenario for the platform: maximum load, zero cache benefit.

Important Context: This Is an Artificial Worst Case

A 100%-unique-prompt workload at 80 RPS is not representative of any real enterprise deployment. In production, even a modest cache hit rate of 30–40% would reduce the effective all-miss load from 80 RPS to 48–56 RPS — well within the 2-pod sustainable range. The saturation point scales directly with pod count; adding a third proxy pod would extend the all-miss limit to ~90 RPS.

Architecture & Scaling Analysis

What the Proxy Does Per Request

Every request through Smartflow traverses the following pipeline, contributing to the ~190ms cache-miss overhead:

Why Overhead Is ~190ms, Not Sub-10ms

The ~190ms overhead compared to direct OpenAI is not proxy processing time — it is dominated by the additional network hop through the Kubernetes ingress and the connection overhead of TLS renegotiation between the proxy and OpenAI (the proxy maintains a persistent connection pool, but connection establishment for new pool members adds latency on the first request). In production deployments where the Smartflow cluster is co-located with the application (same VPC or data centre), this overhead is typically 40–80ms.

Latency Tuning for Latency-Sensitive Workloads

The ~190ms cache-miss overhead measured in this test reflects a full inspection configuration: semantic guardrail evaluation, BERT embedding lookup, compliance pre-check running in parallel, VAS audit logging, and policy enforcement. For workloads where sub-100ms proxy overhead is required, Smartflow supports a reduced inspection profile that disables or defers non-critical pipeline stages:

Production Latency Target

In a co-located deployment with a reduced inspection profile, real-world Smartflow overhead for cache-miss requests is typically 40–80ms — comparable to the overhead of any TLS-terminating reverse proxy. Cache hits, at ~150ms, are already faster than any OpenAI round-trip.

Semantic Caching as a TPM and Rate Limit Management Strategy

Beyond raw latency, Smartflow's semantic cache delivers a critical operational benefit for high-volume deployments: it directly reduces the number of tokens sent to the LLM provider. Every cache hit is a request that never reaches OpenAI — no tokens consumed, no quota used, no position in OpenAI's request queue.

This matters significantly in environments approaching OpenAI's TPM (tokens per minute) or RPM (requests per minute) rate limits. At 50% cache hit rate, an application consuming 2M TPM effectively operates at 1M TPM from OpenAI's perspective — keeping it below tier thresholds, avoiding 429 rate limit responses, and eliminating the need to implement exponential backoff retry logic in application code.

Cache = Rate Limit Insurance

Semantic caching is the most effective tool available to prevent LLM API rate limit errors at scale. It is not just a performance feature — it is a reliability feature that keeps your application functional during traffic spikes without requiring tier upgrades.

Horizontal Scaling Model — No Technical Ceiling

Smartflow's proxy layer is fully stateless. Each pod independently handles requests and shares cache state through Redis. There is no inter-pod coordination, no leader election, no shared memory — adding a pod increases capacity linearly with zero architectural changes. The Kubernetes HPA can scale proxy pods from 2 to 64 based on CPU or request rate metrics automatically.

There is no technical ceiling on request rate. At 250–500 RPS, p50 latency on the proxy layer remains identical to what was measured at 40 RPS — the additional pods simply distribute the load. The only components that require separate scaling consideration beyond ~8 pods are Redis (use Redis Cluster or a managed Redis service) and the ingress load balancer (DigitalOcean, AWS ALB, or nginx-ingress all handle 500+ RPS without configuration changes).

Proxy Pods All-Miss RPS Capacity Mixed (50% cache) RPS Est. p50 (cache miss) Redis Mode
2 (tested)~60 RPS~120 RPS~730msSingle
4~120 RPS~240 RPS~730msSingle
8~240 RPS~480 RPS~730msCluster recommended
16~480 RPS~960 RPS~730msCluster
32~960 RPS~1,900 RPS~730msCluster
HPA auto (2–64)up to ~1,900 RPSup to ~3,800 RPS~730ms constantCluster

p50 latency is constant across all pod counts — adding pods distributes load, it does not change per-request overhead. Redis Cluster mode recommended beyond 4 proxy pods to prevent the cache layer from becoming a bottleneck.

Infinite Horizontal Scale, Constant Latency

This is the key architectural advantage of Smartflow's stateless Rust proxy: latency does not increase as you scale. Whether you are running 2 pods at 40 RPS or 32 pods at 800 RPS, the p50 latency a user experiences for a cache-miss request is the same ~730ms (dominated by OpenAI's response time). The proxy overhead itself remains flat at ~190ms regardless of cluster size or request rate. Kubernetes HPA handles the scale-out automatically — no manual intervention required.

Why Python-Based Proxies Cannot Match This Scaling Model

A common alternative architecture uses a Python-based AI gateway (LiteLLM, custom FastAPI proxy, etc.) deployed on Kubernetes or Docker Compose. While Python async (asyncio/aiohttp) enables concurrency, it has fundamental limitations that prevent it from achieving Smartflow's per-pod efficiency:

Python Proxies: Scaling Cost Comparison

To match Smartflow's 2-pod performance at 60 RPS all-miss, a Python-based proxy requires 5–8 pods. At 240 RPS (Smartflow: 8 pods), a Python equivalent requires 20–30 pods. The infrastructure cost difference at scale is 2.5–4× — before accounting for the higher memory footprint per pod. For latency-sensitive or cost-sensitive deployments, the runtime language choice has material operational consequences.

Key Findings

3.2×

Cache Acceleration at Scale

With realistic enterprise workloads (50% repeated prompts), Smartflow delivers p50 latency 3.2× lower than direct OpenAI calls at 40 RPS. This improves as RPS increases and cache warms further.

~190ms

Predictable Proxy Overhead

On cache misses, Smartflow adds a consistent ~190ms to each request across all sustainable load levels (5–60 RPS). This overhead is flat — no degradation as load increases.

100%

Perfect Reliability 5–60 RPS

Zero failures across all RPS levels from 5 to 60 RPS in both test modes. 3,400+ successful requests in the cache-miss run alone with no errors or timeouts.

Linear

True Linear Horizontal Scaling

p50 varies by only 53ms across a 12× increase in request rate (5→60 RPS). Adding proxy pods extends capacity linearly — no architectural bottlenecks at the proxy layer.

60 RPS

2-Pod All-Miss Ceiling

The worst-case saturation point on a minimal 2-node cluster is 60 RPS of fully unique requests. In real workloads with any cache hit rate, effective capacity is proportionally higher.

+0ms

Compliance Adds Zero Latency

Compliance checks run in parallel with the OpenAI upstream call. On pass (the vast majority of requests), compliance adds no latency to the response path.

Recommendations

For Deployments Targeting <10 RPS

Single proxy pod is sufficient. Pre-warm the cache with representative queries at startup to avoid cold-start effects. At <10 RPS, cache hit rate is the primary latency driver — optimise for hit rate over concurrency.

For Deployments Targeting 10–60 RPS

2 proxy pods (the default HPA minimum) handle this range comfortably. Enable Smartflow's HPA (minReplicas: 2, maxReplicas: 8) to allow automatic scale-out under burst. Expect p50 latency equal to or better than direct OpenAI within minutes of cache warm-up.

For Deployments Targeting 60–200 RPS

Configure HPA to allow 4–8 proxy pods. Redis remains a single point of scaling; consider Redis Cluster mode (or a managed Redis cluster) beyond 4 proxy pods. All proxy pods share the cache coherently — a cache hit on any pod is available to all.

For High-Cache-Hit Workloads

With 60%+ cache hit rates (achievable in domain-specific deployments such as HR Q&A, code review, document analysis), effective throughput per pod doubles. A 2-pod deployment can sustain 120+ effective RPS when the majority of requests are served from cache.

Reproducing This Test

All test tooling is open and available in the Smartflow repository. To run the full comparison against your own deployment:

bash
# Install dependencies
pip install aiohttp

# Run mixed workload (5, 10, 20, 40 RPS)
SF_API_KEY="your-openai-key" \
SF_BASE="https://your-smartflow-host" \
python3 tests/rps_comparison_test.py \
  --rps 5 10 20 40 \
  --duration 30 \
  --mode mixed

# Run cache-miss isolation (proxy overhead only)
SF_API_KEY="your-openai-key" \
SF_BASE="https://your-smartflow-host" \
python3 tests/rps_comparison_test.py \
  --rps 5 10 20 40 60 \
  --duration 30 \
  --mode miss

# Smartflow-only (no direct OpenAI comparison)
SF_API_KEY="your-openai-key" \
SF_BASE="https://your-smartflow-host" \
python3 tests/rps_comparison_test.py \
  --rps 5 10 20 40 60 80 \
  --duration 60 \
  --mode mixed \
  --smartflow-only

Test Environment Requirements

Methodology Integrity Note

For apples-to-apples comparison, run the Smartflow and direct OpenAI legs back-to-back in a single session. OpenAI's response times vary hour-to-hour; running both legs minutes apart minimises model-side variance as a confounding factor.


About LangSmart — LangSmart builds Smartflow Enterprise, a Rust-based AI gateway for regulated organisations. Smartflow provides enterprise identity integration, a four-phase semantic cache, real-time policy enforcement, compliance tooling, and a management dashboard — all in a single binary deployable on Kubernetes or Docker Compose.

For questions about this whitepaper, enterprise licensing, or deployment support, contact support@langsmart.ai.