A New Approach to GPU Sharing: Deterministic, SLA-Based GPU Kernel Scheduling for Higher Utilization

Modern GPU-sharing techniques like MIG and time-slicing improve utilization, but they fundamentally operate by partitioning or alternating access to the device—leaving large amounts of GPU compute idle and introducing latency variability. WoolyAI takes a different approach. Instead of slicing the GPU, we orchestrate GPU kernel execution directly, enabling multiple ML jobs to run concurrently inside a shared GPU context with deterministic, SLA-based scheduling. This allows the GPU to be treated as a continuously active compute fabric, maximizing SM and memory utilization while preserving predictable performance for high-priority workloads.

What WoolyAI Deterministic, SLA-Based GPU Kernel Scheduling Is

WoolyAI introduces a global, deterministic GPU scheduler that runs all tenants within a single unified GPU context, rather than letting each application run its own context.

Wooly libraries from inside the User’s ML containers intercept all framework-level ( e.g., from PyTorch) GPU operations, convert them into Wooly IR (Intermediate Representation), and routes them to the Wooly server scheduler component running on Nvidia or AMD GPU hosts that:

JIT (Just-In-Time) compiles it to target GPU instruction set
Orders all kernel launches across tenants
Applies strict priority and fair-share rules
Controls kernel pacing, concurrency, and interleaving
Prevents any single tenant from monopolizing GPU compute
Removes random jitter caused by native driver-level kernel contention

The result is a predictable, SLA-friendly GPU execution model where workload behavior is repeatable and tunable.

Comparison With Status Quo GPU Sharing

Approach	How It Works	Problems
Native CUDA / ROCm (multi-process)	Each process manages its own context and streams	Idle gaps, uncontrolled concurrency, context switching, noisy neighbors
CUDA MPS	Shares SMs across processes	Still best-effort, not deterministic, no SLA guarantees, applies to similar processes, no isolation
MIG	Hard partitions	Rigid, wasteful, low flexibility, still no kernel-level control
Time Slicing	Share GPU by alternating execution time	No SLA guarantees, only applies to controlled jobs, still wasteful

Factors leading to low utilization (Status Quo)

GPUs execute short kernels from many apps → constant idle gaps
Python overhead + batching delays → huge underutilization
Some tenants spike (large matmuls) → noisy neighbor interference
No global view of GPU → drivers do non-deterministic scheduling
VRAM fragmentation → fewer models can co-exist on a single GPU

Typical effective utilization: 20–40%.

How WoolyAI Achieves 3× Higher Utilization

WoolyAI solves all these inefficiencies through a unified execution model:

Unified GPU Context

All tenants load into one context on the GPU:

No context switching
Shared model weights (dedup) where applicable
Unified memory allocator
Zeroed fragmentation

This alone gives +30–50% more effective utilization.

Global Kernel Scheduler Removes Idle Gaps

Wooly merges all tenants’ kernel streams into one continuous GPU work queue.

When Tenant A pauses (Python, token sampling, I/O), Wooly server scheduler immediately fills that gap with work from Tenant B or C.

Wooly calculates a new metric for GPU utilization – Starvation Metric. A higher Starvation Metric means the GPU is being fully utilized efficiently across many jobs.

GPU active time rises from ~30% → ~80–95%. Check out a video demo at https://youtu.be/udps3B0KCnc

Deterministic Priority + Fair Share Scheduling

High priority tenants get execution priority with low priority interleaving when applicable.

High-priority tenants meet consistent latency SLAs
Background jobs do not introduce tail latency spikes

This eliminates noisy neighbors without hardware slicing.

VRAM Quotas With Weight Deduplication

Since all tenants run under one context:

Shared Base model weights can be loaded once, especially with LoRA adapters tuned on a single base model, or SLMs tuned on a standard base foundational model.
VRAM waste drops significantly

This increases concurrency per GPU. Check out a video Demo – https://youtu.be/o0lldESNHR4

In summary, SLA-based kernel-level GPU Scheduling transforms GPUs from single-tenant, over-reserved devices into fully utilized, multi-tenant accelerators that deliver predictable performance. It lets ML teams run far more experiments, iterate faster, and unlock higher throughput—all without buying more GPUs or rewriting a line of model code.

Triple Your GPU Utilization in ML Development with WoolyAI

Stop letting idle GPUs drain your budget during the experimentation phase. Machine learning development is expensive, but it doesn’t have to be wasteful. In the typical ML lifecycle, the “experimentation, dev, and test” stages are notorious for low GPU utilization.

December 4, 2025

A New Approach to GPU Sharing: Deterministic, SLA-Based GPU Kernel Scheduling for Higher Utilization