How WoolyAI Works
Architecture at a glance
Unified Container – > WIS – > JIT on GPU nodes
WoolyAI Client
your ML container
- Wooly Client Container Image : Run your existing CUDA PyTorch / vLLM apps in a Wooly Unified Container on CPU or GPU machines
- Wooly runtime libraries inside the container intercept CUDA kernel launches, convert them to the Wooly Instruction Set (WIS), and dispatch to a remote GPU host.
WoolyAI Controller
orchestrator for Multi-GPU environments
- Routes Client requests across GPUs : Sends CUDA workloads to the best available GPU
- Uses live GPU Utilization & saturation metrics for Intelligent routing
WoolyAI Server
on GPU nodes
- Wooly Server Hypervisor receives WIS and performs Just-in-Time compilation to the node’s native backend (CUDA on NVIDIA, ROCm on AMD), then executes kernels. It retains all the hardware-specific optimizations and executes kernels with the native runtime drivers with near-native performance.
- Wooly Server runs concurrent kernel processes in a single context with greater control over resource allocation and isolation.
- Our GPU compute core & VRAM resource manager dynamically allocates resources across concurrent kernel process —no context switching or static time-slicing wastage.
Result
One image runs on both vendors—no config conflicts, no rebuilds.
Execute from CPU-only dev/CI while kernels run on a shared GPU pool.
More workloads per GPU with consistent performance.
Integration & Operations
Wooly Controller to manage client kernel requests across multiple GPU clusters – Wooly Controller routes client CUDA kernels to available GPUs based on live utilization and saturation metrics.
Integration with Kubernetes – Use Wooly Client Docker Image and your existing K8 workflow to spin up/manage ML dev environments. K8 pods are not bound to specific GPUs.
Ray for orchestration, Wooly for all GPU work – Ray head + workers run on CPU instances (or mixed), each worker uses the Wooly Client container. Ray doesn’t bind real GPUs.