Turn one GPU into a multi-model agent execution environment.

WoolyAI turns a single GPU into a controlled multi-model execution environment for Agentic workloads — improving concurrent throughput, protecting high-priority model SLAs, and enabling larger workflows through intelligent VRAM swap.

up to 1.77×

Higher prefill throughput

Observed in the pinned-KV dual-vLLM benchmark for Qwen3.5-4B under WoolyAI versus native concurrent execution.

0.89–0.94×

Priority-0 TG vs native single-model

The high-priority Mistral model stayed close to native single-model token generation while a second model ran in the background.

~2.4s

Swap-in TTFT probe

Measured for 24B and 27B models when WoolyAI swapped live GPU state back into an H100 80GB GPU.

Concurrent model serving

Native concurrent vLLM leaves utilization on the table.

WoolyAI adds a GPU runtime scheduling layer under existing model servers. In the benchmark, two vLLM servers were running together with pinned KV cache. WoolyAI improved prefill throughput, token generation throughput, and TTFT across most tested configurations.

Equal-priority concurrent run

Summary ranges from the 10-run pinned-KV matrix.

Model PP TG TTFT
Mistral-7B-Instruct-v0.3
1.38×–1.63×
1.07×–1.33×
0.62×–0.76×
Qwen3.5-4B
1.08×–1.77×
1.05×–1.33×
0.58×–0.96×

PP and TG are tokens/second where higher is better. TTFT is latency where lower is better.

NVIDIA H100 SXM5 80GB HBM3

What changes with WoolyAI?

Native concurrent vLLM
Independent vLLM process compete indirectly
No explicit runtime priority or fine-grained process-aware scheduling
WoolyAI Runtime
Model kernels scheduled through WoolyAI
Priority-aware compute allocation and higher effective utilization under contention

Deterministic priority control

Protect the important model while background models keep running.

Agentic workflows often use planner, worker, verifier, summarizer, and tool-calling models. WoolyAI lets teams pack these models onto the same GPU and assign deterministic runtime priorities.

 

Priority split used in the benchmark

Model Priority Role example
Mistral-7B-Instruct-v0.3
P0 / highest
Worker or SLA-sensitive model
Qwen3.5-4B
P1 / lower
Planner, verifier, or background model

Priority-0 performance

The priority-0 Mistral run achieved close to native single-vLLM token generation while the lower-priority Qwen server continued executing.

Config Priority-0 TG vs native single
2048/256 d0 c1
0.93×
512/64 d0 c4
0.94×
2048/256 d0 c4
0.89×
2048/256 d0 c4
0.93×
2048/256 d0 c4
0.90×

Planner

Lower-priority model creates the task plan.

Worker

Priority-0 reasoning model gets compute when it needs it.

Verifier

Background model checks and summarizes using idle capacity.
 

VRAM overcommit and swap

Run workflows larger than physical GPU memory.

WoolyAI swaps the client’s live GPU allocations instead of treating the full GPU device capacity as a monolithic object. This allows inactive model state to be swapped out and restored when needed.

Large-model swap probe

Two larger vLLM servers, each with 80GB VRAM reservation were run concurrently on Nvidia H100 SXM5 80GB HBM3, under one WoolyAI server with explicit 8 GiB KV cache per model.

Model Swapped state Mean TTFT
Mistral-Small-24B
54.48 GiB
2.385s
Qwen3.6-27B
60.71 GiB
2.453s

Swap only what matters

Active model
Model A resident in H100 VRAM
Inactive model
Model B live GPU state swapped out

Includes model memory and working memory such as KV cache and graph cache, based on live allocations.

Benchmark setup

Reproducible, pinned-KV comparison.

The benchmark used a pinned-KV, 10-run dual-vLLM matrix. Native and WoolyAI runs used the same manually pinned KV allocation to keep the comparison focused on runtime behavior.

Hardware

NVIDIA H100 SXM5 80GB HBM3

Serving stack

vLLM 0.22.1rc1 CUDA graphs enabled

Benchmark tool

llama-benchy 10 runs per case

CUDA

CUDA 13.0.2 Containerized workload

Note: Benchmark results are workload-specific. Run your own model mix to validate performance under your target SLA, context length, and concurrency profile.