GPU Consumption Model Based on Core and Memory Usage — Not Time Used

At WoolyAI, we’ve built a technology stack that decouples kernel execution from CUDA by introducing our own abstraction layer. Within this layer, kernels are compiled into a Wooly Instruction Set. At runtime, when a kernel is launched, it is transferred in Wooly Instruction Set format over the network from a CPU host to a Wooly Server running on the GPU. The Wooly Server dynamically recompiles these kernels for the target runtime (e.g., NVIDIA runtime) and manages GPU core and memory allocation for execution.

This architecture enables us to measure actual GPU core and memory utilization during model execution in PyTorch. Based on this real utilization data, we calculate Wooly Credits, which are applied to model runs for users of the WoolyAI Acceleration Service.

Unlike traditional GPU cloud service providers that charge based on time used, WoolyAI Acceleration Service charges based on actual GPU cores and memory consumed during model execution.

Environment Details

Wooly Client: Linux non-GPU container running PyTorch scripts for all ten models
Models were downloaded using Hugging Face Transformers library from vendor-specific repositories.
Each model was executed 20 times using the same script to collect average Wooly Credits for both CPU and VRAM usage.

Models Tested

Llama-3.2-1B
Llama-3.2-1B-Instruct
Llama-3.2-3B
Llama-3.2-3B-Instruct
Mistral-7B-Instruct
Falcon3-7B-Instruct
Llama-3.1-8B-Instruct
Llama-3.1-8B
Dolly-v2-12B
Llama-2-13B-Chat-HF

Pytorch Script

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
torch.manual_seed(100000)
# Model name or path
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda")
# Input text
input_text = "What is the capital of United States of America"
# Tokenize input text
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
# Decode and print output
for z in range (1, 10):
   outputs = model.generate(**inputs, max_length=100)
   generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
   print(generated_text)

GPU Core and Memory utilization Metrics

	Core Wooly Credits Used	VRAM Wooly Credits Used
Llama-3.2-1B	46000	31072
Llama-3.2-1B-Instruct	94868	60964
Llama-3.2-3B	195936	84715
Llama-3.2-3B-Instruct	502448	258125
Mistral-7b-instr	525689	397181
Falcon3-7B-Instruc	136094	26528
Llama-3.1-8B	283458	167515
Llama-3.1-8B-Instruct	574872	403934
Dolly-v2-12b	767108	342877
Llama-2-13b-chat-hf	313809	120067

Share your thoughts with us in our slack channel at https://woolyaicommunitychat.slack.com.

Scaling Your K8s PyTorch CPU Pods to Run CUDA with the Remote WoolyAI GPU Acceleration Service

Currently, to run CUDA-GPU-accelerated workloads inside K8s pods, your K8s nodes must have an NVIDIA GPU exposed and the appropriate GPU libraries installed. In this

March 27, 2025

Announcing the Beta Launch of WoolyAI: The Era of Unbound GPU Execution

Today, we’re thrilled to announce the beta launch of WoolyAI Acceleration Service, a revolutionary GPU Cloud service built on WoolyStack, our cutting-edge CUDA abstraction layer.

March 7, 2025

Company

Contact

[email protected]

CUDA is a registered trademark of NVIDIA Corporation. This website is not affiliated with or endorsed by NVIDIA Corporation.

Get Started