A New Approach to GPU Sharing: Deterministic, SLA-Based GPU Kernel Scheduling for Higher Utilization

A New Approach to GPU Sharing: Deterministic, SLA-Based GPU Kernel Scheduling for Higher Utilization

A New Approach to GPU Sharing: Deterministic, SLA-Based GPU Kernel Scheduling for Higher Utilization

Modern GPU-sharing techniques like MIG and time-slicing improve utilization, but they fundamentally operate by partitioning or alternating access to the device—leaving large amounts of GPU compute idle and introducing latency variability. WoolyAI takes a different approach. Instead of slicing the GPU, we orchestrate GPU kernel execution directly, enabling multiple ML jobs to run concurrently inside a shared GPU context with deterministic, SLA-based scheduling. This allows the GPU to be treated as a continuously active compute fabric, maximizing SM and memory utilization while preserving predictable performance for high-priority workloads.

What WoolyAI Deterministic, SLA-Based GPU Kernel Scheduling Is

WoolyAI introduces a global, deterministic GPU scheduler that runs all tenants within a single unified GPU context, rather than letting each application run its own context.

Wooly libraries from inside the User’s ML containers intercept all framework-level ( e.g., from PyTorch) GPU operations, convert them into Wooly IR (Intermediate Representation), and routes them to the Wooly server scheduler component running on Nvidia or AMD GPU hosts that:

  • JIT (Just-In-Time) compiles it to target GPU instruction set
  • Orders all kernel launches across tenants
  • Applies strict priority and fair-share rules
  • Controls kernel pacing, concurrency, and interleaving
  • Prevents any single tenant from monopolizing GPU compute
  • Removes random jitter caused by native driver-level kernel contention
 

The result is a predictable, SLA-friendly GPU execution model where workload behavior is repeatable and tunable.

Comparison With Status Quo GPU Sharing

Approach
How It Works
Problems

Native CUDA / ROCm (multi-process)

Each process manages its own context and streams

Idle gaps, uncontrolled concurrency, context switching, noisy neighbors

CUDA MPS

Shares SMs across processes

Still best-effort, not deterministic, no SLA guarantees, applies to similar processes, no isolation

MIG

Hard partitions

Rigid, wasteful, low flexibility, still no kernel-level control

Time Slicing

Share GPU by alternating execution time

No SLA guarantees, only applies to controlled jobs, still wasteful

Factors leading to low utilization (Status Quo)

  1. GPUs execute short kernels from many apps → constant idle gaps

  2. Python overhead + batching delays → huge underutilization

  3. Some tenants spike (large matmuls) → noisy neighbor interference

  4. No global view of GPU → drivers do non-deterministic scheduling

  5. VRAM fragmentation → fewer models can co-exist on a single GPU

Typical effective utilization: 20–40%.

How WoolyAI Achieves 3× Higher Utilization

WoolyAI solves all these inefficiencies through a unified execution model:

Unified GPU Context

All tenants load into one context on the GPU:

  1. No context switching

  2. Shared model weights (dedup) where applicable

  3. Unified memory allocator

  4. Zeroed fragmentation

This alone gives +30–50% more effective utilization.

Global Kernel Scheduler Removes Idle Gaps

Wooly merges all tenants’ kernel streams into one continuous GPU work queue.

When Tenant A pauses (Python, token sampling, I/O), Wooly server scheduler immediately fills that gap with work from Tenant B or C.

Wooly calculates a new metric for GPU utilization – Starvation Metric. A higher Starvation Metric means the GPU is being fully utilized efficiently across many jobs.

GPU active time rises from ~30% → ~80–95%. Check out a video demo at https://youtu.be/udps3B0KCnc

Deterministic Priority + Fair Share Scheduling

High priority tenants get execution priority with low priority interleaving when applicable.

  1. High-priority tenants meet consistent latency SLAs

  2. Background jobs do not introduce tail latency spikes

This eliminates noisy neighbors without hardware slicing.

VRAM Quotas With Weight Deduplication

Since all tenants run under one context:

  1. Shared Base model weights can be loaded once, especially with LoRA adapters tuned on a single base model, or SLMs tuned on a standard base foundational model.
  2. VRAM waste drops significantly
 

This increases concurrency per GPU. Check out a video Demo – https://youtu.be/o0lldESNHR4

In summary, SLA-based kernel-level GPU Scheduling transforms GPUs from single-tenant, over-reserved devices into fully utilized, multi-tenant accelerators that deliver predictable performance. It lets ML teams run far more experiments, iterate faster, and unlock higher throughput—all without buying more GPUs or rewriting a line of model code.

Triple Your GPU Utilization in ML Development with WoolyAI

Stop letting idle GPUs drain your budget during the experimentation phase. Machine learning development is expensive, but it doesn’t have to be wasteful. In the typical ML lifecycle, the “experimentation, dev, and test” stages are notorious for low GPU utilization.