Announcing the Beta Launch of WoolyAI: The Era of Unbound GPU Execution

Reimagining GPU Resource Utilization

Current GPU resource consumption and management in machine learning scope is constrained and highly inefficient. It’s constrained because of the dominance of CUDA(Nvidia) in the ML software ecosystem and the consumption is inefficient because organizations have to choose between cost-efficiency, resource utilization, SLA goals and control when consuming GPUs from cloud service providers and/or setting up their own managed GPU clusters.

We have built a Wooly Abstraction Layer that decouples Kernel Shader executions from applications that use CUDA into a Wooly Abstraction layer. We are launching this in the first phase for Pytorch Applications. In this abstraction layer, we compile these applications to a new binary and Shaders are compiled into a Wooly Instruction Set. At runtime, Kernel Shader launch events initiate transfer of Shader over the network from a CPU host to a GPU host where they are recompiled and their execution is managed to achieve max GPU resource utilization, isolation between workloads and cross compatibility with hardware vendors before being converted to be passed on to the respective GPU hardware runtime and drivers. The Wooly Abstraction layer sits at the intersection of application software and hardware and optimizes GPU performance like an operating system for GPUs.

Here’s how it works:

Decoupled Execution: Shaders are compiled into the Wooly Instruction Set, allowing for cross-vendor GPU compatibility.

Dynamic GPU Allocation: At runtime, kernel shaders(In Wooly Instruction Set) are transferred over the network from CPU hosts to GPU hosts, where execution is dynamically managed to ensure maximum GPU resource utilization.

Multi-Tenant Efficiency: Instead of reserving GPUs in fixed partitions, WoolyAI flexibly assigns GPU memory and processing cycles to workloads based on predefined SLAs.

Actual Resource consumption metrics: Our service tracks actual GPU core processing and memory resources consumed during shader execution ensuring cost-efficient execution.

Introducing WoolyAI Acceleration Service

Built on WoolyStack, our GPU Cloud service allows users to run PyTorch applications seamlessly without modifying their existing workflows. Unlike traditional cloud GPU solutions that either lock users into expensive, underutilized instances or require disruptive workflow changes, WoolyAI Acceleration Service enables:

Data Scientists work with their Pytorch applications inside CPU backed container environments while Shaders execute on GPU through the WoolyAI Acceleration Service.

GPU Usage Billing is based on actual GPU core processing and memory resources consumed during execution and Not Time Used.

Scaling Your K8s PyTorch CPU Pods to Run CUDA with the Remote WoolyAI GPU Acceleration Service

Currently, to run CUDA-GPU-accelerated workloads inside K8s pods, your K8s nodes must have an NVIDIA GPU exposed and the appropriate GPU libraries installed. In this

March 27, 2025

GPU Consumption Model Based on Core and Memory Usage — Not Time Used

At WoolyAI, we’ve built a technology stack that decouples kernel execution from CUDA by introducing our own abstraction layer. Within this layer, kernels are compiled

March 12, 2025

Announcing the Beta Launch of WoolyAI: The Era of Unbound GPU Execution

Reimagining GPU Resource Utilization

Introducing WoolyAI Acceleration Service

Join the Beta Today

Scaling Your K8s PyTorch CPU Pods to Run CUDA with the Remote WoolyAI GPU Acceleration Service

GPU Consumption Model Based on Core and Memory Usage — Not Time Used

Company

Contact

Get Started

Announcing the Beta Launch of WoolyAI: The Era of Unbound GPU Execution

Reimagining GPU Resource Utilization

Introducing WoolyAI Acceleration Service

Join the Beta Today

Scaling Your K8s PyTorch CPU Pods to Run CUDA with the Remote WoolyAI GPU Acceleration Service

GPU Consumption Model Based on Core and Memory Usage — Not Time Used

Company

Contact