June 15, 2026, (Inside AI) — When two AI agents share a single GPU through Kubernetes time-slicing, the cluster dashboard glows green, both pods report “Running,” and throughput barely dips. But beneath that placid surface, the latency-sensitive agent’s p99 response time quietly balloons by 66%, a hidden tax that no scheduler will ever surface. That is the core finding of a new end-to-end measurement framework that pairs CUDA event timing with DCGM hardware counters to expose the true cost of GPU contention in agentic AI pipelines.
The Illusion of Healthy Pods
In a production Kubernetes cluster, the NVIDIA device plugin’s time-slicing feature can carve one physical GPU into multiple logical slices, allowing several pods to each request nvidia.com/gpu: 1 and land on the same silicon. The scheduler sees four allocatable slots and happily places workloads side by side. But unlike Multi-Instance GPU (MIG) partitioning, time-slicing offers no memory or compute isolation—it merely context-switches kernels on the same streaming multiprocessors. “Kubernetes reports ‘two healthy pods.’ The silicon reports a memory-bus street fight,” the study’s author notes. The damage hides in the latency tail, invisible to pod status checks and average-throughput dashboards.
Two Agents, Opposite Personalities
The experiment pits two containerized PyTorch workers against each other on a single seven-year-old NVIDIA GTX 1080 (8 GB, Pascal) running K3s with the stock device plugin and CUDA time-slicing. The first is a small, twitchy, latency-sensitive agent that continuously runs 4096×4096 complex FFTs—a stand-in for routers, guardrails, or tool-callers that must respond instantly. The second is a heavy, steady, compute-hungry agent that streams large square matrix multiplies, mimicking the GEMM core of a transformer forward pass. Both are configured via environment variables and measured with CUDA events and cudaDeviceSynchronize() to capture true GPU execution time, not host-side launch overhead.
The Numbers That Dashboards Hide
When run solo, the FFT worker’s p99 latency sits at 3.68 ms and the GEMM worker’s at 25.67 ms. Under time-sliced sharing, the medians barely flinch—FFT p50 remains 3.59 ms, GEMM p50 19.98 ms. But the tail tells a different story: the FFT’s p99 jumps to 6.10 ms, a 1.66× degradation, while the GEMM’s p99 rises only modestly to 30.30 ms (1.18×). Jitter (p99/p50) for the small agent surges from 1.02 to 1.70. Throughput drops a mere 7.3% for the FFT and 1.4% for the GEMM—numbers that would earn a thumbs-up in a status report. “Averages are where contention goes to hide,” the study warns. “The mean is a kind soul who rounds your worst moments away. The p99 is the friend who remembers everything.”
Why the Small Agent Suffers Most
The asymmetry is not accidental. The heavyweight GEMM worker grabs its time quantum and bulldozes through, while the lightweight FFT worker keeps getting interrupted mid-kernel, evicted from the SMs, and forced to wait for its next slice. Time-slicing hands out turns but respects no deadlines. For agent swarms—where routers, classifiers, and safety checkers must respond quickly to keep the pipeline moving—this means the most deadline-sensitive components degrade first and worst. The framework computes a Degradation Factor (DF) as shared p99 divided by solo p99, yielding a crisp metric to surface on a slide: 1.66 for the FFT, 1.18 for the GEMM.
Hardware Counters Confirm the Contention
To rule out measurement artifacts, the profiler also scrapes DCGM GPU-utilization counters every 100 ms. During shared execution, the FFT worker’s SM and DRAM activity spike as its cycles overlap with the GEMM hammering the same memory subsystem. The contention appears at two independent layers—application latency and silicon-level counters—leaving no doubt that the tail degradation is real.
From a $150 GPU to Production Swarms
The author deliberately chose an old, budget GPU to prove that bad scheduling doesn’t vanish on an H100; it just executes bottlenecks faster. “Throwing $30,000 at an orchestration flaw doesn’t fix the contention; it just lets you execute bad architecture in fewer milliseconds.” The findings echo challenges in 5G/6G RAN engineering, where operators co-locate latency-critical baseband processing with heavy LLM inference on edge servers. A latency-sensitive agent that misses deadlines can drop calls or, in extreme cases, disrupt emergency services.
Running the Experiment Yourself
The entire pipeline is open-source and designed to be boringly reproducible. On a node with K3s, NVIDIA drivers, the container toolkit, and the device plugin, three commands build, deploy, and collect results. The repo includes manifests for solo baselines and the shared scenario, plus a log parser that aggregates raw CUDA event timestamps into p50/p95/p99 histograms.
Caveats and Next Steps
The study is a deliberate micro-benchmark, not a datacenter capacity model. It uses only two agents, synthetic kernels, a single GPU generation, and no MPS or MIG. But the core insight holds: time-slicing buys capacity at the cost of isolation, and the latency-critical agent always pays the price. The solution isn’t to ban sharing but to measure the tail, attribute degradation, and schedule with silicon limits in mind. The tool gpudeg turns the vague feeling of “the GPU seems slow” into a quantifiable Degradation Factor. Part 3 of the series will tackle the PCIe commute in RAG pipelines by building a custom CUDA Top-K kernel to keep retrieval entirely on the GPU.