Smartflow Enterprise Performance Whitepaper — Horizontal Scaling & Latency Analysis

3.2×

Faster than OpenAI at 40 RPS
(with cache, mixed workload)

~190ms

Pure proxy overhead
(cache-miss, 5–60 RPS)

100%

Success rate
5–60 RPS, all modes

60 RPS

Sustained throughput
2-pod cluster, all-miss

152ms

p50 latency at 40 RPS
(cached hits)

Executive Summary
Test Environment & Infrastructure
Test Methodology
Results: Mixed Workload (Cache-Enabled)
Results: Cache-Miss Workload (Proxy Overhead Isolation)
Architecture & Scaling Analysis
Key Findings
Recommendations
Reproducing This Test

Executive Summary

This whitepaper documents systematic load testing of Smartflow Enterprise, Aperion's Rust-based AI gateway, comparing end-to-end latency against direct OpenAI API calls across a range of request rates (5–80 RPS). Tests were conducted on a production Kubernetes cluster with two application nodes, using the gpt-4o-mini model.

Two workload modes were tested: mixed (50% repeated prompts enabling cache hits, representing realistic enterprise usage) and cache-miss (100% unique prompts, isolating pure proxy overhead). Both modes were run at each RPS level for a statistically sufficient duration.

Key Finding

With Smartflow's semantic cache active, the platform delivers 3.2× lower latency than direct OpenAI at 40 RPS (p50: 152ms vs 483ms). Without cache (all-unique prompts), Smartflow adds a consistent ~190ms overhead at all sustainable load levels — linear, predictable, and stable through 60 RPS on a minimal 2-node cluster.

Test Environment & Infrastructure

Cluster Configuration

Component	Specification
Kubernetes	DigitalOcean Managed K8s v1.35.1 · Region: NYC1
Nodes	2 × pool-ug2yt390x (4 vCPU / 8 GB RAM each) · Debian 13
Smartflow version	Enterprise v1.7.16 (Rust, compiled binary)
Proxy pods	2 replicas (one per node)
Compliance pods	3 replicas
Policy-Perfect pods	2 replicas
Cache backend	Redis 7.2 (single StatefulSet pod)
Ingress	nginx-ingress · TLS terminated at cluster1.langsmart.app
LLM provider	OpenAI (api.openai.com) · Model: gpt-4o-mini
Test client	macOS 15.2 · Python 3.10 · aiohttp 3.9 · NYC ↔ NYC network path

Architecture Overview

Test Client

macOS · Python aiohttp · NYC region

HTTPS / TLS

nginx Ingress

cluster1.langsmart.app · TLS termination · load balances across proxy pods

HTTP/1.1 cluster-internal

smartflow-proxy

Pod 1 · :7775 · node 1

smartflow-proxy

Pod 2 · :7775 · node 2

cache lookup (shared)

smartflow-redis · :6379

4-Phase MetaCache · Phase 1: exact hash · Phase 2: BERT semantic KNN

✓ HIT ~150ms

Response served

from cache · no LLM call

✗ MISS forward upstream

Compliance check

runs in parallel · non-blocking

∥ parallel

api.openai.com

gpt-4o-mini · external

write-back (async)

→ cache stored

All proxy pods share a single Redis cache — a cache hit on Pod 1 is immediately available to Pod 2.
Compliance runs in parallel with the OpenAI call and adds zero latency on pass. Cache write-back is fire-and-forget (non-blocking).

Test Methodology

Request Rate Control

The test harness (rps_comparison_test.py) fires requests at a precise target rate using an async token-bucket scheduler built on asyncio and aiohttp. Each RPS level ran for a controlled duration (20–25 seconds) to capture steady-state behaviour rather than burst characteristics. The test client and the Smartflow cluster are co-located in the same DigitalOcean region (NYC1) to eliminate inter-region latency noise.

Two Workload Modes

Mode	Prompt Strategy	Purpose
mixed	50% repeated anchor prompt, 50% unique UUID-suffixed prompts	Represents realistic enterprise workloads where many users ask similar questions. Cache hit rate ~50–55%.
miss	100% unique prompts (UUID per request)	Forces every request to reach OpenAI. Isolates pure proxy overhead with no cache benefit.

Comparison Target

For each RPS level, the identical workload was run against direct OpenAI (api.openai.com) immediately after the Smartflow run, using the same API key, model, and prompt set. This apples-to-apples comparison eliminates model response time variability between test runs.

Metrics Collected

p50 (median latency) — typical request experience
p95 latency — tail latency under load
p99 latency — worst-case outliers
Success rate — HTTP 200 responses as a percentage of total requests fired
Cache hit rate — requests served by Smartflow cache (x-smartflow-cache-hit: true header)
Actual RPS achieved — measured vs target (confirms scheduler accuracy)

Repeatability Note

All tests were run in a single session to minimise OpenAI API load variation. The test script and full raw output are available in the Smartflow source repository at tests/rps_comparison_test.py. Anyone can reproduce these results against their own Smartflow deployment using the instructions in Section 9.

Results: Mixed Workload (Cache-Enabled)

The mixed workload (50% repeated prompts) reflects the most common enterprise use pattern — employees across a company asking similar questions about HR policy, internal documentation, product features, or code standards. Cache hit rates of 50–55% were sustained throughout all RPS levels.

RPS	SF p50	SF p95	SF p99	OAI p50	OAI p95	OAI p99	Cache Hits	SF / OAI Ratio
5	642ms	1,464ms	1,646ms	614ms	1,057ms	1,753ms	50%	1.05×
10	174ms	1,145ms	1,248ms	485ms	891ms	1,150ms	52%	0.36×
20	153ms	881ms	1,098ms	463ms	736ms	1,165ms	52%	0.33×
40	152ms	920ms	1,114ms	483ms	858ms	1,197ms	54%	0.31×

Duration: 20s per RPS level · Model: gpt-4o-mini · max_tokens: 10 · 100% success rate all levels

Why Smartflow is faster than OpenAI at scale

At 10+ RPS with a warm cache, ~52% of requests are served by Smartflow's in-memory Redis cache in ~150ms — compared to OpenAI's average round-trip of ~480ms. The p50 reflects this: 52% of requests completing in 150ms pulls the median well below the OpenAI floor. This effect strengthens with higher RPS as the cache warms further and hit rate climbs.

Latency at 5 RPS: Cache Cold Start

At 5 RPS, Smartflow and direct OpenAI are essentially equivalent (642ms vs 614ms, p50). This is expected — at this rate the cache is cold and only the first occurrence of the repeated anchor prompt is cached. Once warm, subsequent runs at 5 RPS would show the same acceleration seen at higher RPS. In production, the cache is pre-warmed from prior traffic so cold-start effects are rare.

Results: Cache-Miss Workload (Proxy Overhead Isolation)

The cache-miss test sends 100% unique prompts, guaranteeing every request traverses the full proxy pipeline and reaches OpenAI. This isolates the pure overhead that Smartflow adds to each request: ingress TLS termination, policy evaluation, guardrail check, Redis miss, upstream forwarding, and response streaming back through the proxy.

RPS	SF p50	SF p95	OAI p50	OAI p95	Overhead (p50)	Success
5	650ms	943ms	509ms	914ms	+141ms	100%
10	707ms	1,072ms	510ms	970ms	+198ms	100%
20	714ms	1,034ms	534ms	883ms	+180ms	100%
40	716ms	1,082ms	524ms	836ms	+192ms	100%
60	732ms	1,029ms	538ms	898ms	+194ms	100%
80 ⚠	8,762ms	11,870ms	510ms	796ms	+8,252ms	99.95%

Duration: 25s per RPS level · 100% unique prompts (no cache benefit) · ⚠ 80 RPS exceeds 2-pod all-miss capacity — see analysis below

Linear Scaling 5–60 RPS

The most important finding in the cache-miss data is the extraordinary p50 stability from 5 to 60 RPS. Overhead ranges from 141ms to 194ms — only a 37% variation across a 12× increase in request rate. This is the hallmark of a well-architected async system: no queuing buildup, no worker thread starvation, no lock contention degrading throughput.

Linear Scaling Confirmed

Smartflow's p50 overhead increases by only 53ms (141ms → 194ms) as load goes from 5 RPS to 60 RPS — a 12× increase in throughput. This linear characteristic means capacity planning is straightforward: each additional proxy pod extends the sustainable RPS ceiling by roughly 30 RPS.

Saturation at 80 RPS (All-Miss)

At 80 RPS with 100% unique prompts, p50 jumps from ~730ms to 8,762ms — a clear saturation signal. With 2 proxy pods each handling 40 simultaneous all-miss requests that all wait on OpenAI round-trips, the internal work queue backs up. This is the worst-case scenario for the platform: maximum load, zero cache benefit.

Important Context: This Is an Artificial Worst Case

A 100%-unique-prompt workload at 80 RPS is not representative of any real enterprise deployment. In production, even a modest cache hit rate of 30–40% would reduce the effective all-miss load from 80 RPS to 48–56 RPS — well within the 2-pod sustainable range. The saturation point scales directly with pod count; adding a third proxy pod would extend the all-miss limit to ~90 RPS.

Architecture & Scaling Analysis

What the Proxy Does Per Request

Every request through Smartflow traverses the following pipeline, contributing to the ~190ms cache-miss overhead:

TLS termination at ingress — nginx-ingress handles TLS, adds ~1–5ms
Policy evaluation — guardrail rules checked against the request in-memory, <1ms (Rust, no I/O)
Redis cache lookup — Phase 1 exact match hash lookup, ~1–3ms cluster-internal RTT
BERT semantic similarity check — cosine similarity against cached embeddings (on cache miss), ~5–15ms
Upstream forwarding — HTTP/1.1 to api.openai.com, includes OpenAI's own queue time
Compliance pre-check — runs in parallel with OpenAI call, does not add to latency on pass
Cache write-back — fire-and-forget (non-blocking), does not add to response latency

Why Overhead Is ~190ms, Not Sub-10ms

The ~190ms overhead compared to direct OpenAI is not proxy processing time — it is dominated by the additional network hop through the Kubernetes ingress and the connection overhead of TLS renegotiation between the proxy and OpenAI (the proxy maintains a persistent connection pool, but connection establishment for new pool members adds latency on the first request). In production deployments where the Smartflow cluster is co-located with the application (same VPC or data centre), this overhead is typically 40–80ms.

Latency Tuning for Latency-Sensitive Workloads

The ~190ms cache-miss overhead measured in this test reflects a full inspection configuration: semantic guardrail evaluation, BERT embedding lookup, compliance pre-check running in parallel, VAS audit logging, and policy enforcement. For workloads where sub-100ms proxy overhead is required, Smartflow supports a reduced inspection profile that disables or defers non-critical pipeline stages:

Disable BERT semantic cache lookup — skips the embedding similarity phase and goes straight to OpenAI on miss. Saves 5–20ms per request.
Async compliance checking — move compliance from pre-check to post-response (log-and-allow). Eliminates compliance latency entirely from the hot path.
In-memory-only policy evaluation — policy rules are already evaluated in Rust in-memory (<1ms), no tuning needed.
Persistent upstream connection pooling — Smartflow maintains a persistent HTTP/1.1 connection pool to OpenAI, eliminating TCP handshake overhead on every request. Ensure pool size matches your RPS target.
Co-locate with your application — running Smartflow in the same VPC or data centre as your application reduces the ~190ms measured overhead (which includes inter-region TLS overhead) to 40–80ms in a co-located topology.

Production Latency Target

In a co-located deployment with a reduced inspection profile, real-world Smartflow overhead for cache-miss requests is typically 40–80ms — comparable to the overhead of any TLS-terminating reverse proxy. Cache hits, at ~150ms, are already faster than any OpenAI round-trip.

Semantic Caching as a TPM and Rate Limit Management Strategy

Beyond raw latency, Smartflow's semantic cache delivers a critical operational benefit for high-volume deployments: it directly reduces the number of tokens sent to the LLM provider. Every cache hit is a request that never reaches OpenAI — no tokens consumed, no quota used, no position in OpenAI's request queue.

This matters significantly in environments approaching OpenAI's TPM (tokens per minute) or RPM (requests per minute) rate limits. At 50% cache hit rate, an application consuming 2M TPM effectively operates at 1M TPM from OpenAI's perspective — keeping it below tier thresholds, avoiding 429 rate limit responses, and eliminating the need to implement exponential backoff retry logic in application code.

Rate limit headroom — cache hits don't count against your OpenAI RPM/TPM limits, effectively doubling or tripling your usable capacity per tier.
Queue backpressure relief — OpenAI processes requests in queues per organisation tier. Reducing the volume of requests reaching OpenAI reduces queue depth and lowers the p95/p99 tail latency for the requests that do reach it.
Cost reduction — every cache hit is zero LLM cost. At 55% hit rate on a 10M token/day workload, 5.5M tokens are served for free, with only the embedding comparison cost (sub-millisecond, in-memory) as overhead.
Burst absorption — traffic spikes that would cause 429s from OpenAI are absorbed by the cache. Smartflow serves the spike from Redis while the OpenAI-bound volume stays flat.

Cache = Rate Limit Insurance

Semantic caching is the most effective tool available to prevent LLM API rate limit errors at scale. It is not just a performance feature — it is a reliability feature that keeps your application functional during traffic spikes without requiring tier upgrades.

Horizontal Scaling Model — No Technical Ceiling

Smartflow's proxy layer is fully stateless. Each pod independently handles requests and shares cache state through Redis. There is no inter-pod coordination, no leader election, no shared memory — adding a pod increases capacity linearly with zero architectural changes. The Kubernetes HPA can scale proxy pods from 2 to 64 based on CPU or request rate metrics automatically.

There is no technical ceiling on request rate. At 250–500 RPS, p50 latency on the proxy layer remains identical to what was measured at 40 RPS — the additional pods simply distribute the load. The only components that require separate scaling consideration beyond ~8 pods are Redis (use Redis Cluster or a managed Redis service) and the ingress load balancer (DigitalOcean, AWS ALB, or nginx-ingress all handle 500+ RPS without configuration changes).

Proxy Pods	All-Miss RPS Capacity	Mixed (50% cache) RPS	Est. p50 (cache miss)	Redis Mode
2 (tested)	~60 RPS	~120 RPS	~730ms	Single
4	~120 RPS	~240 RPS	~730ms	Single
8	~240 RPS	~480 RPS	~730ms	Cluster recommended
16	~480 RPS	~960 RPS	~730ms	Cluster
32	~960 RPS	~1,900 RPS	~730ms	Cluster
HPA auto (2–64)	up to ~1,900 RPS	up to ~3,800 RPS	~730ms constant	Cluster

p50 latency is constant across all pod counts — adding pods distributes load, it does not change per-request overhead. Redis Cluster mode recommended beyond 4 proxy pods to prevent the cache layer from becoming a bottleneck.

Infinite Horizontal Scale, Constant Latency

This is the key architectural advantage of Smartflow's stateless Rust proxy: latency does not increase as you scale. Whether you are running 2 pods at 40 RPS or 32 pods at 800 RPS, the p50 latency a user experiences for a cache-miss request is the same ~730ms (dominated by OpenAI's response time). The proxy overhead itself remains flat at ~190ms regardless of cluster size or request rate. Kubernetes HPA handles the scale-out automatically — no manual intervention required.

Why Python-Based Proxies Cannot Match This Scaling Model

A common alternative architecture uses a Python-based AI gateway (LiteLLM, custom FastAPI proxy, etc.) deployed on Kubernetes or Docker Compose. While Python async (asyncio/aiohttp) enables concurrency, it has fundamental limitations that prevent it from achieving Smartflow's per-pod efficiency:

Global Interpreter Lock (GIL) — Python's GIL prevents true thread-level parallelism. Even with asyncio, CPU-bound operations (JSON parsing, policy evaluation, embedding computation) block the event loop for all concurrent requests on that pod.
Memory overhead per pod — a Python proxy process typically consumes 200–500MB RAM at idle. A Rust binary consumes 10–30MB. At 16 pods, the difference is 3–8GB of wasted RAM that could be used for application workloads.
Per-pod throughput ceiling — a Python asyncio proxy sustains approximately 8–15 RPS of all-miss proxy load before p95 latency begins degrading. A Rust proxy pod (as measured) sustains ~30 RPS. You need 2–4× more Python pods to match equivalent Rust throughput.
Synchronous library risk — any synchronous call (a blocking Redis client, a synchronous HTTP call, even a slow JSON serialisation) in a Python async context blocks the entire event loop for that process. Rust's type system and async model make this class of bug impossible.
Docker Compose without Kubernetes — Docker Compose --scale can run multiple containers but provides no health-based autoscaling, no rolling updates, no cross-node distribution, and no load balancing beyond a basic round-robin. For production HA, Kubernetes (or Docker Swarm) is required regardless of language.

Python Proxies: Scaling Cost Comparison

To match Smartflow's 2-pod performance at 60 RPS all-miss, a Python-based proxy requires 5–8 pods. At 240 RPS (Smartflow: 8 pods), a Python equivalent requires 20–30 pods. The infrastructure cost difference at scale is 2.5–4× — before accounting for the higher memory footprint per pod. For latency-sensitive or cost-sensitive deployments, the runtime language choice has material operational consequences.

Key Findings

3.2×

Cache Acceleration at Scale

With realistic enterprise workloads (50% repeated prompts), Smartflow delivers p50 latency 3.2× lower than direct OpenAI calls at 40 RPS. This improves as RPS increases and cache warms further.

~190ms

Predictable Proxy Overhead

On cache misses, Smartflow adds a consistent ~190ms to each request across all sustainable load levels (5–60 RPS). This overhead is flat — no degradation as load increases.

100%

Perfect Reliability 5–60 RPS

Zero failures across all RPS levels from 5 to 60 RPS in both test modes. 3,400+ successful requests in the cache-miss run alone with no errors or timeouts.

Linear

True Linear Horizontal Scaling

p50 varies by only 53ms across a 12× increase in request rate (5→60 RPS). Adding proxy pods extends capacity linearly — no architectural bottlenecks at the proxy layer.

60 RPS

2-Pod All-Miss Ceiling

The worst-case saturation point on a minimal 2-node cluster is 60 RPS of fully unique requests. In real workloads with any cache hit rate, effective capacity is proportionally higher.

+0ms

Compliance Adds Zero Latency

Compliance checks run in parallel with the OpenAI upstream call. On pass (the vast majority of requests), compliance adds no latency to the response path.

Recommendations

For Deployments Targeting <10 RPS

Single proxy pod is sufficient. Pre-warm the cache with representative queries at startup to avoid cold-start effects. At <10 RPS, cache hit rate is the primary latency driver — optimise for hit rate over concurrency.

For Deployments Targeting 10–60 RPS

2 proxy pods (the default HPA minimum) handle this range comfortably. Enable Smartflow's HPA (minReplicas: 2, maxReplicas: 8) to allow automatic scale-out under burst. Expect p50 latency equal to or better than direct OpenAI within minutes of cache warm-up.

For Deployments Targeting 60–200 RPS

Configure HPA to allow 4–8 proxy pods. Redis remains a single point of scaling; consider Redis Cluster mode (or a managed Redis cluster) beyond 4 proxy pods. All proxy pods share the cache coherently — a cache hit on any pod is available to all.

For High-Cache-Hit Workloads

With 60%+ cache hit rates (achievable in domain-specific deployments such as HR Q&A, code review, document analysis), effective throughput per pod doubles. A 2-pod deployment can sustain 120+ effective RPS when the majority of requests are served from cache.

Reproducing This Test

All test tooling is open and available in the Smartflow repository. To run the full comparison against your own deployment:

bash

# Install dependencies
pip install aiohttp

# Run mixed workload (5, 10, 20, 40 RPS)
SF_API_KEY="your-openai-key" \
SF_BASE="https://your-smartflow-host" \
python3 tests/rps_comparison_test.py \
  --rps 5 10 20 40 \
  --duration 30 \
  --mode mixed

# Run cache-miss isolation (proxy overhead only)
SF_API_KEY="your-openai-key" \
SF_BASE="https://your-smartflow-host" \
python3 tests/rps_comparison_test.py \
  --rps 5 10 20 40 60 \
  --duration 30 \
  --mode miss

# Smartflow-only (no direct OpenAI comparison)
SF_API_KEY="your-openai-key" \
SF_BASE="https://your-smartflow-host" \
python3 tests/rps_comparison_test.py \
  --rps 5 10 20 40 60 80 \
  --duration 60 \
  --mode mixed \
  --smartflow-only

Test Environment Requirements

Python 3.9+ with aiohttp installed
A valid OpenAI API key (or any API key your Smartflow proxy accepts)
Network access to both your Smartflow host and api.openai.com
Run the test client from a machine with low, stable latency to the Smartflow cluster for best comparability

Methodology Integrity Note

For apples-to-apples comparison, run the Smartflow and direct OpenAI legs back-to-back in a single session. OpenAI's response times vary hour-to-hour; running both legs minutes apart minimises model-side variance as a confounding factor.

About Aperion — Aperion builds Smartflow Enterprise, a Rust-based AI gateway for regulated organisations. Smartflow provides enterprise identity integration, a four-phase semantic cache, real-time policy enforcement, compliance tooling, and a management dashboard — all in a single binary deployable on Kubernetes or Docker Compose.

For questions about this whitepaper, enterprise licensing, or deployment support, contact [email protected].

Smartflow Enterprise
Horizontal Scaling & Latency Analysis

Contents

Executive Summary

Test Environment & Infrastructure

Cluster Configuration

Architecture Overview

Test Methodology

Request Rate Control

Two Workload Modes

Comparison Target

Metrics Collected

Results: Mixed Workload (Cache-Enabled)

Latency at 5 RPS: Cache Cold Start

Results: Cache-Miss Workload (Proxy Overhead Isolation)

Linear Scaling 5–60 RPS

Saturation at 80 RPS (All-Miss)

Architecture & Scaling Analysis

What the Proxy Does Per Request

Why Overhead Is ~190ms, Not Sub-10ms

Latency Tuning for Latency-Sensitive Workloads

Semantic Caching as a TPM and Rate Limit Management Strategy

Horizontal Scaling Model — No Technical Ceiling

Why Python-Based Proxies Cannot Match This Scaling Model

Key Findings

Cache Acceleration at Scale

Predictable Proxy Overhead

Perfect Reliability 5–60 RPS

True Linear Horizontal Scaling

2-Pod All-Miss Ceiling

Compliance Adds Zero Latency

Recommendations

For Deployments Targeting <10 RPS

For Deployments Targeting 10–60 RPS

For Deployments Targeting 60–200 RPS

For High-Cache-Hit Workloads

Reproducing This Test

Test Environment Requirements

Smartflow EnterpriseHorizontal Scaling & Latency Analysis

Contents

Executive Summary

Test Environment & Infrastructure

Cluster Configuration

Architecture Overview

Test Methodology

Request Rate Control

Two Workload Modes

Comparison Target

Metrics Collected

Results: Mixed Workload (Cache-Enabled)

Latency at 5 RPS: Cache Cold Start

Results: Cache-Miss Workload (Proxy Overhead Isolation)

Linear Scaling 5–60 RPS

Saturation at 80 RPS (All-Miss)

Architecture & Scaling Analysis

What the Proxy Does Per Request

Why Overhead Is ~190ms, Not Sub-10ms

Latency Tuning for Latency-Sensitive Workloads

Semantic Caching as a TPM and Rate Limit Management Strategy

Horizontal Scaling Model — No Technical Ceiling

Why Python-Based Proxies Cannot Match This Scaling Model

Key Findings

Cache Acceleration at Scale

Predictable Proxy Overhead

Perfect Reliability 5–60 RPS

True Linear Horizontal Scaling

2-Pod All-Miss Ceiling

Compliance Adds Zero Latency

Recommendations

For Deployments Targeting <10 RPS

For Deployments Targeting 10–60 RPS

For Deployments Targeting 60–200 RPS

For High-Cache-Hit Workloads

Reproducing This Test

Test Environment Requirements

Smartflow Enterprise
Horizontal Scaling & Latency Analysis