Independent load testing of Smartflow Enterprise on a 2-node DigitalOcean Kubernetes cluster against direct OpenAI API calls at 5–80 RPS. Controlled methodology, raw numbers, and architecture implications for enterprise AI gateway deployments.
This whitepaper documents systematic load testing of Smartflow Enterprise, LangSmart's Rust-based AI gateway, comparing end-to-end latency against direct OpenAI API calls across a range of request rates (5–80 RPS). Tests were conducted on a production Kubernetes cluster with two application nodes, using the gpt-4o-mini model.
Two workload modes were tested: mixed (50% repeated prompts enabling cache hits, representing realistic enterprise usage) and cache-miss (100% unique prompts, isolating pure proxy overhead). Both modes were run at each RPS level for a statistically sufficient duration.
With Smartflow's semantic cache active, the platform delivers 3.2× lower latency than direct OpenAI at 40 RPS (p50: 152ms vs 483ms). Without cache (all-unique prompts), Smartflow adds a consistent ~190ms overhead at all sustainable load levels — linear, predictable, and stable through 60 RPS on a minimal 2-node cluster.
| Component | Specification |
|---|---|
| Kubernetes | DigitalOcean Managed K8s v1.35.1 · Region: NYC1 |
| Nodes | 2 × pool-ug2yt390x (4 vCPU / 8 GB RAM each) · Debian 13 |
| Smartflow version | Enterprise v1.7.16 (Rust, compiled binary) |
| Proxy pods | 2 replicas (one per node) |
| Compliance pods | 3 replicas |
| Policy-Perfect pods | 2 replicas |
| Cache backend | Redis 7.2 (single StatefulSet pod) |
| Ingress | nginx-ingress · TLS terminated at cluster1.langsmart.app |
| LLM provider | OpenAI (api.openai.com) · Model: gpt-4o-mini |
| Test client | macOS 15.2 · Python 3.10 · aiohttp 3.9 · NYC ↔ NYC network path |
The test harness (rps_comparison_test.py) fires requests at a precise target rate using an async token-bucket scheduler built on asyncio and aiohttp. Each RPS level ran for a controlled duration (20–25 seconds) to capture steady-state behaviour rather than burst characteristics. The test client and the Smartflow cluster are co-located in the same DigitalOcean region (NYC1) to eliminate inter-region latency noise.
| Mode | Prompt Strategy | Purpose |
|---|---|---|
| mixed | 50% repeated anchor prompt, 50% unique UUID-suffixed prompts | Represents realistic enterprise workloads where many users ask similar questions. Cache hit rate ~50–55%. |
| miss | 100% unique prompts (UUID per request) | Forces every request to reach OpenAI. Isolates pure proxy overhead with no cache benefit. |
For each RPS level, the identical workload was run against direct OpenAI (api.openai.com) immediately after the Smartflow run, using the same API key, model, and prompt set. This apples-to-apples comparison eliminates model response time variability between test runs.
x-smartflow-cache-hit: true header)All tests were run in a single session to minimise OpenAI API load variation. The test script and full raw output are available in the Smartflow source repository at tests/rps_comparison_test.py. Anyone can reproduce these results against their own Smartflow deployment using the instructions in Section 9.
The mixed workload (50% repeated prompts) reflects the most common enterprise use pattern — employees across a company asking similar questions about HR policy, internal documentation, product features, or code standards. Cache hit rates of 50–55% were sustained throughout all RPS levels.
| RPS | SF p50 | SF p95 | SF p99 | OAI p50 | OAI p95 | OAI p99 | Cache Hits | SF / OAI Ratio |
|---|---|---|---|---|---|---|---|---|
| 5 | 642ms | 1,464ms | 1,646ms | 614ms | 1,057ms | 1,753ms | 50% | 1.05× |
| 10 | 174ms | 1,145ms | 1,248ms | 485ms | 891ms | 1,150ms | 52% | 0.36× |
| 20 | 153ms | 881ms | 1,098ms | 463ms | 736ms | 1,165ms | 52% | 0.33× |
| 40 | 152ms | 920ms | 1,114ms | 483ms | 858ms | 1,197ms | 54% | 0.31× |
Duration: 20s per RPS level · Model: gpt-4o-mini · max_tokens: 10 · 100% success rate all levels
At 10+ RPS with a warm cache, ~52% of requests are served by Smartflow's in-memory Redis cache in ~150ms — compared to OpenAI's average round-trip of ~480ms. The p50 reflects this: 52% of requests completing in 150ms pulls the median well below the OpenAI floor. This effect strengthens with higher RPS as the cache warms further and hit rate climbs.
At 5 RPS, Smartflow and direct OpenAI are essentially equivalent (642ms vs 614ms, p50). This is expected — at this rate the cache is cold and only the first occurrence of the repeated anchor prompt is cached. Once warm, subsequent runs at 5 RPS would show the same acceleration seen at higher RPS. In production, the cache is pre-warmed from prior traffic so cold-start effects are rare.
The cache-miss test sends 100% unique prompts, guaranteeing every request traverses the full proxy pipeline and reaches OpenAI. This isolates the pure overhead that Smartflow adds to each request: ingress TLS termination, policy evaluation, guardrail check, Redis miss, upstream forwarding, and response streaming back through the proxy.
| RPS | SF p50 | SF p95 | OAI p50 | OAI p95 | Overhead (p50) | Success |
|---|---|---|---|---|---|---|
| 5 | 650ms | 943ms | 509ms | 914ms | +141ms | 100% |
| 10 | 707ms | 1,072ms | 510ms | 970ms | +198ms | 100% |
| 20 | 714ms | 1,034ms | 534ms | 883ms | +180ms | 100% |
| 40 | 716ms | 1,082ms | 524ms | 836ms | +192ms | 100% |
| 60 | 732ms | 1,029ms | 538ms | 898ms | +194ms | 100% |
| 80 ⚠ | 8,762ms | 11,870ms | 510ms | 796ms | +8,252ms | 99.95% |
Duration: 25s per RPS level · 100% unique prompts (no cache benefit) · ⚠ 80 RPS exceeds 2-pod all-miss capacity — see analysis below
The most important finding in the cache-miss data is the extraordinary p50 stability from 5 to 60 RPS. Overhead ranges from 141ms to 194ms — only a 37% variation across a 12× increase in request rate. This is the hallmark of a well-architected async system: no queuing buildup, no worker thread starvation, no lock contention degrading throughput.
Smartflow's p50 overhead increases by only 53ms (141ms → 194ms) as load goes from 5 RPS to 60 RPS — a 12× increase in throughput. This linear characteristic means capacity planning is straightforward: each additional proxy pod extends the sustainable RPS ceiling by roughly 30 RPS.
At 80 RPS with 100% unique prompts, p50 jumps from ~730ms to 8,762ms — a clear saturation signal. With 2 proxy pods each handling 40 simultaneous all-miss requests that all wait on OpenAI round-trips, the internal work queue backs up. This is the worst-case scenario for the platform: maximum load, zero cache benefit.
A 100%-unique-prompt workload at 80 RPS is not representative of any real enterprise deployment. In production, even a modest cache hit rate of 30–40% would reduce the effective all-miss load from 80 RPS to 48–56 RPS — well within the 2-pod sustainable range. The saturation point scales directly with pod count; adding a third proxy pod would extend the all-miss limit to ~90 RPS.
Every request through Smartflow traverses the following pipeline, contributing to the ~190ms cache-miss overhead:
The ~190ms overhead compared to direct OpenAI is not proxy processing time — it is dominated by the additional network hop through the Kubernetes ingress and the connection overhead of TLS renegotiation between the proxy and OpenAI (the proxy maintains a persistent connection pool, but connection establishment for new pool members adds latency on the first request). In production deployments where the Smartflow cluster is co-located with the application (same VPC or data centre), this overhead is typically 40–80ms.
The ~190ms cache-miss overhead measured in this test reflects a full inspection configuration: semantic guardrail evaluation, BERT embedding lookup, compliance pre-check running in parallel, VAS audit logging, and policy enforcement. For workloads where sub-100ms proxy overhead is required, Smartflow supports a reduced inspection profile that disables or defers non-critical pipeline stages:
In a co-located deployment with a reduced inspection profile, real-world Smartflow overhead for cache-miss requests is typically 40–80ms — comparable to the overhead of any TLS-terminating reverse proxy. Cache hits, at ~150ms, are already faster than any OpenAI round-trip.
Beyond raw latency, Smartflow's semantic cache delivers a critical operational benefit for high-volume deployments: it directly reduces the number of tokens sent to the LLM provider. Every cache hit is a request that never reaches OpenAI — no tokens consumed, no quota used, no position in OpenAI's request queue.
This matters significantly in environments approaching OpenAI's TPM (tokens per minute) or RPM (requests per minute) rate limits. At 50% cache hit rate, an application consuming 2M TPM effectively operates at 1M TPM from OpenAI's perspective — keeping it below tier thresholds, avoiding 429 rate limit responses, and eliminating the need to implement exponential backoff retry logic in application code.
Semantic caching is the most effective tool available to prevent LLM API rate limit errors at scale. It is not just a performance feature — it is a reliability feature that keeps your application functional during traffic spikes without requiring tier upgrades.
Smartflow's proxy layer is fully stateless. Each pod independently handles requests and shares cache state through Redis. There is no inter-pod coordination, no leader election, no shared memory — adding a pod increases capacity linearly with zero architectural changes. The Kubernetes HPA can scale proxy pods from 2 to 64 based on CPU or request rate metrics automatically.
There is no technical ceiling on request rate. At 250–500 RPS, p50 latency on the proxy layer remains identical to what was measured at 40 RPS — the additional pods simply distribute the load. The only components that require separate scaling consideration beyond ~8 pods are Redis (use Redis Cluster or a managed Redis service) and the ingress load balancer (DigitalOcean, AWS ALB, or nginx-ingress all handle 500+ RPS without configuration changes).
| Proxy Pods | All-Miss RPS Capacity | Mixed (50% cache) RPS | Est. p50 (cache miss) | Redis Mode |
|---|---|---|---|---|
| 2 (tested) | ~60 RPS | ~120 RPS | ~730ms | Single |
| 4 | ~120 RPS | ~240 RPS | ~730ms | Single |
| 8 | ~240 RPS | ~480 RPS | ~730ms | Cluster recommended |
| 16 | ~480 RPS | ~960 RPS | ~730ms | Cluster |
| 32 | ~960 RPS | ~1,900 RPS | ~730ms | Cluster |
| HPA auto (2–64) | up to ~1,900 RPS | up to ~3,800 RPS | ~730ms constant | Cluster |
p50 latency is constant across all pod counts — adding pods distributes load, it does not change per-request overhead. Redis Cluster mode recommended beyond 4 proxy pods to prevent the cache layer from becoming a bottleneck.
This is the key architectural advantage of Smartflow's stateless Rust proxy: latency does not increase as you scale. Whether you are running 2 pods at 40 RPS or 32 pods at 800 RPS, the p50 latency a user experiences for a cache-miss request is the same ~730ms (dominated by OpenAI's response time). The proxy overhead itself remains flat at ~190ms regardless of cluster size or request rate. Kubernetes HPA handles the scale-out automatically — no manual intervention required.
A common alternative architecture uses a Python-based AI gateway (LiteLLM, custom FastAPI proxy, etc.) deployed on Kubernetes or Docker Compose. While Python async (asyncio/aiohttp) enables concurrency, it has fundamental limitations that prevent it from achieving Smartflow's per-pod efficiency:
--scale can run multiple containers but provides no health-based autoscaling, no rolling updates, no cross-node distribution, and no load balancing beyond a basic round-robin. For production HA, Kubernetes (or Docker Swarm) is required regardless of language.To match Smartflow's 2-pod performance at 60 RPS all-miss, a Python-based proxy requires 5–8 pods. At 240 RPS (Smartflow: 8 pods), a Python equivalent requires 20–30 pods. The infrastructure cost difference at scale is 2.5–4× — before accounting for the higher memory footprint per pod. For latency-sensitive or cost-sensitive deployments, the runtime language choice has material operational consequences.
With realistic enterprise workloads (50% repeated prompts), Smartflow delivers p50 latency 3.2× lower than direct OpenAI calls at 40 RPS. This improves as RPS increases and cache warms further.
On cache misses, Smartflow adds a consistent ~190ms to each request across all sustainable load levels (5–60 RPS). This overhead is flat — no degradation as load increases.
Zero failures across all RPS levels from 5 to 60 RPS in both test modes. 3,400+ successful requests in the cache-miss run alone with no errors or timeouts.
p50 varies by only 53ms across a 12× increase in request rate (5→60 RPS). Adding proxy pods extends capacity linearly — no architectural bottlenecks at the proxy layer.
The worst-case saturation point on a minimal 2-node cluster is 60 RPS of fully unique requests. In real workloads with any cache hit rate, effective capacity is proportionally higher.
Compliance checks run in parallel with the OpenAI upstream call. On pass (the vast majority of requests), compliance adds no latency to the response path.
Single proxy pod is sufficient. Pre-warm the cache with representative queries at startup to avoid cold-start effects. At <10 RPS, cache hit rate is the primary latency driver — optimise for hit rate over concurrency.
2 proxy pods (the default HPA minimum) handle this range comfortably. Enable Smartflow's HPA (minReplicas: 2, maxReplicas: 8) to allow automatic scale-out under burst. Expect p50 latency equal to or better than direct OpenAI within minutes of cache warm-up.
Configure HPA to allow 4–8 proxy pods. Redis remains a single point of scaling; consider Redis Cluster mode (or a managed Redis cluster) beyond 4 proxy pods. All proxy pods share the cache coherently — a cache hit on any pod is available to all.
With 60%+ cache hit rates (achievable in domain-specific deployments such as HR Q&A, code review, document analysis), effective throughput per pod doubles. A 2-pod deployment can sustain 120+ effective RPS when the majority of requests are served from cache.
All test tooling is open and available in the Smartflow repository. To run the full comparison against your own deployment:
bash# Install dependencies
pip install aiohttp
# Run mixed workload (5, 10, 20, 40 RPS)
SF_API_KEY="your-openai-key" \
SF_BASE="https://your-smartflow-host" \
python3 tests/rps_comparison_test.py \
--rps 5 10 20 40 \
--duration 30 \
--mode mixed
# Run cache-miss isolation (proxy overhead only)
SF_API_KEY="your-openai-key" \
SF_BASE="https://your-smartflow-host" \
python3 tests/rps_comparison_test.py \
--rps 5 10 20 40 60 \
--duration 30 \
--mode miss
# Smartflow-only (no direct OpenAI comparison)
SF_API_KEY="your-openai-key" \
SF_BASE="https://your-smartflow-host" \
python3 tests/rps_comparison_test.py \
--rps 5 10 20 40 60 80 \
--duration 60 \
--mode mixed \
--smartflow-only
aiohttp installedapi.openai.comFor apples-to-apples comparison, run the Smartflow and direct OpenAI legs back-to-back in a single session. OpenAI's response times vary hour-to-hour; running both legs minutes apart minimises model-side variance as a confounding factor.
About LangSmart — LangSmart builds Smartflow Enterprise, a Rust-based AI gateway for regulated organisations. Smartflow provides enterprise identity integration, a four-phase semantic cache, real-time policy enforcement, compliance tooling, and a management dashboard — all in a single binary deployable on Kubernetes or Docker Compose.
For questions about this whitepaper, enterprise licensing, or deployment support, contact support@langsmart.ai.