Today, we’re launching Kubex support for the KAI Scheduler and automated GPU sharing for inference workloads.
As AI inference moves into production, platform teams are being asked to serve more models, support more teams, and control GPU costs at the same time. But many inference workloads do not need an entire GPU all the time. When teams reserve full GPUs or oversized GPU fractions to stay safe, expensive capacity can sit idle across the cluster.
In one analysis of a Kubernetes inference environment with hundreds of GPUs, Kubex found that only about 20% of containers were using their full GPU capacity at any time of day, while the large majority were using less than half. That kind of utilization profile makes static GPU allocation difficult to justify at scale.
That often shows up as a one-GPU-per-container pattern: each workload gets a full GPU, or an oversized fraction of one, even when its real usage is much lower. Kubex helps teams move beyond that model by making shared GPU use policy-driven, observable, and responsive to changing demand.
With this launch, platform teams can automatically rightsize GPU fractions, monitor shared GPU usage, upsize and rebalance inference workloads when demand increases, scale allocations back when usage drops, and improve GPU node utilization through KAI-aware rightsizing, rebalancing, and consolidation analysis.
The scheduling foundation for shared GPU inference
The KAI Scheduler gives Kubernetes teams a powerful foundation for GPU-aware scheduling. It supports shared GPU placement, hierarchical queues with quotas and limits, workload priority and preemption, gang scheduling for multi-pod AI workloads, topology-aware placement, and scheduling-time consolidation to reduce fragmentation when placing workloads.
For platform teams running inference on Kubernetes, KAI provides the scheduling layer needed to place and govern shared GPU workloads. Kubex complements that foundation by continuously observing real GPU usage, rightsizing GPU fractions, and identifying when workloads can be safely rebalanced or consolidated over time.
But once workloads are sharing GPUs, the operational challenge shifts. Teams still need to answer questions like:
How much of a GPU should each inference service request?
When should that request increase?
When is it safe to scale it back?
How do we avoid noisy-neighbor issues?
How do we improve packing across GPU nodes without creating instability?
Kubex is designed to automate that lifecycle.
Introducing Kubex automation for KAI GPU sharing
With this launch, Kubex adds automation for KAI Scheduler GPU sharing, helping platform teams continuously manage shared GPU allocations for inference workloads.
This is especially useful in self-hosted inference environments that rely on GPU models such as T4, L4, and A10, where hardware partitioning is not available. In those cases, software-level sharing gives teams a practical way to use stranded capacity without redesigning workloads around different hardware.
Kubex can automatically rightsize KAI GPU fractions based on observed usage, monitor shared GPU environments, immediately upsize workloads when demand increases, scale allocations back when usage drops, and rebalance pods across nodes as conditions change.
The result is a more efficient inference environment where teams can make better use of existing GPU infrastructure while reducing the manual effort required to tune fractional GPU allocations.
This launch includes several key capabilities.
Automated rightsizing of KAI GPU fractions
Static GPU fractions are often just educated guesses.
Kubex helps solve this by automatically adjusting KAI GPU fractions based on real usage.
When a workload approaches its current allocation, Kubex can increase the requested GPU fraction to provide additional headroom. When usage remains lower over a sustained period, Kubex can scale the allocation back. This helps inference services receive the GPU capacity they need while reducing long-term over-allocation.
Sharing-aware GPU observability
Standard GPU exporters such as the NVIDIA DCGM Exporter provide valuable device-level metrics, but they were not designed to understand fractional GPU allocations or how shared GPU capacity is being consumed by individual workloads.
Kubex includes a sharing-aware GPU metrics exporter that provides visibility into GPU utilization at the workload level, including compute and memory consumption relative to a workload’s allocated GPU fraction. These signals allow Kubex to make more informed rightsizing and rebalancing decisions for KAI-managed workloads.
This is especially important for inference environments, where demand patterns can shift quickly and where individual services may use GPU compute and memory differently.
Active rebalancing for changing demand
Kubex actively monitors GPU-backed workloads and can rebalance shared GPU requests as usage changes. This helps teams avoid the two most common outcomes of manual allocation: reserving too much capacity for safety or undersizing workloads and reacting only after performance issues appear.
With Kubex, GPU sharing becomes a continuous optimization loop rather than a one-time configuration decision.
Fair-use enforcement and noisy-neighbor protection
Kubex helps platform teams enforce fair use by monitoring shared GPU consumption and adjusting allocations based on policy avoiding noisy-neighbor issues that can affect the reliability and performance of inference based workloads.
This gives organizations a safer way to increase GPU density without giving up control.
KAI-aware bin packing and GPU node consolidation
Kubex adds KAI-aware GPU bin-packing capabilities to help improve node usage. It can also support GPU node consolidation by identifying underutilized GPU nodes within compatible node pools and determining whether shared GPU workloads can fit elsewhere.
For platform teams, this creates a path to run more inference workloads on the same GPU footprint and reduce waste from underused nodes.
Built for practical operations
This launch is designed to help platform teams adopt GPU sharing in a controlled, operationally realistic way.
Platform teams decide which workloads participate, how aggressively allocations can change, and what utilization thresholds should trigger adjustments. This allows organizations to start with a small set of inference services, validate behavior, and expand adoption over time.
Run more inference on the GPUs you already have
The growth of AI inference is putting more pressure on GPU infrastructure. Buying and renting more GPUs is not always the fastest or most efficient answer, especially when existing clusters still have unused capacity hidden behind static allocations and conservative requests.
With Kubex support for the KAI Scheduler, shared GPU inference becomes easier to operate at scale: KAI provides the scheduling foundation, while Kubex continuously observes, rightsizes, and rebalances workloads to improve GPU utilization over time.
Together, they help teams run more inference workloads on existing GPU infrastructure, reduce manual tuning, and build a more efficient Kubernetes platform for production AI.
To learn more, reach out to see how Kubex can help optimize your inference workloads.