Press Release

Kubex Launches KAI Scheduler Integration to Automate GPU Sharing for AI Inference

June 24, 2026

Kubex today launched support for the KAI Scheduler, adding automated GPU sharing and continuous rightsizing to Kubernetes-based AI inference environments. The integration gives platform teams a practical path to serve more models on existing GPU infrastructure without adding hardware or manual tuning.

GPU capacity is being wasted at scale. In one analysis of a Kubernetes inference environment with hundreds of GPUs, Kubex found that only about 20 percent of containers were using their full GPU allocation at any given time. The large majority were consuming less than half. Static allocation masks that waste. Teams reserve full GPUs or oversized fractions to play it safe, and the idle capacity accumulates.

That pattern is not a technical failure. It is a governance failure. There is no feedback loop connecting real usage to what workloads are actually requesting.

This launch closes that loop.

“KAI Scheduler gives teams the foundation for GPU sharing. Kubex makes it operational. We continuously observe what workloads are actually using and adjust allocations accordingly. The result is better utilization, less manual work, and a cluster that can handle significantly more throughput with the hardware already in place.”

 -Andrew Hillier, CTO and Co Founder, Kubex

How the KAI Scheduler Works

The KAI Scheduler provides the Kubernetes-native foundation for GPU-aware scheduling: shared GPU placement, hierarchical queues with quotas, workload priority and preemption, gang scheduling for multi-pod workloads, and topology-aware placement. Kubex operates on top of that foundation, continuously observing real GPU usage and making adjustments that keep allocations accurate over time.

Together they address the full lifecycle of shared GPU inference. KAI handles placement and governance at scheduling time. Kubex handles observation and adjustment continuously.

What Kubex Adds

Automated rightsizing of KAI GPU fractions

Kubex both proactively manages KAI fractions using its machine learning predictive modeling and reactively by adjusting KAI GPU fractions in real-time for changes in demand. When usage stays low over time, Kubex scales it back. Inference services get what they need without permanent over-allocation.

Sharing-aware GPU observability

Standard GPU exporters surface device-level metrics. They do not show how shared capacity is being consumed by individual workloads. Kubex includes a sharing-aware metrics exporter that tracks compute and memory utilization at the workload level, relative to each service’s allocated fraction. Those signals feed directly into rightsizing and rebalancing decisions.

Active rebalancing

Kubex monitors GPU-backed workloads and rebalances shared GPU requests as usage shifts. Teams stop choosing between over-reserving capacity for safety and undersizing workloads then reacting after performance degrades.

Fair-use enforcement

Kubex monitors shared GPU consumption and adjusts allocations based on policy. Noisy-neighbor issues are caught and corrected before they affect other inference services.

KAI-aware bin packing and node consolidation

Kubex adds GPU bin-packing to improve node usage and can identify underutilized GPU nodes where workloads could consolidate. More inference, same GPU footprint.

Built for Self-hosted Inference 

This launch is especially relevant for environments running on T4, L4, and A10 GPUs where hardware partitioning is not available. Software-level GPU sharing gives teams a practical way to recover stranded capacity without redesigning workloads. 

Platform teams control which workloads participate, how aggressively allocations can change, and what thresholds trigger adjustments. That makes it possible to start with a small set of services, validate behavior, and expand. 

To learn more about how Kubex can help optimize your inference workloads, visit kubex.ai.