At WoolyAI, we’ve built a technology stack that decouples kernel execution from CUDA by introducing our own abstraction layer. Within this layer, kernels are compiled into a Wooly Instruction Set. At runtime, when a kernel is launched, it is transferred in Wooly Instruction Set format over the network from a CPU host to a Wooly Server running on the GPU. The Wooly Server dynamically recompiles these kernels for the target runtime (e.g., NVIDIA runtime) and manages GPU core and memory allocation for execution.
This architecture enables us to measure actual GPU core and memory utilization during model execution in PyTorch. Based on this real utilization data, we calculate Wooly Credits, which are applied to model runs for users of the WoolyAI Acceleration Service.
Unlike traditional GPU cloud service providers that charge based on time used, WoolyAI Acceleration Service charges based on actual GPU cores and memory consumed during model execution.
Environment Details
- Wooly Client: Linux non-GPU container running PyTorch scripts for all ten models
- Models were downloaded using Hugging Face Transformers library from vendor-specific repositories.
- Each model was executed 20 times using the same script to collect average Wooly Credits for both CPU and VRAM usage.
Models Tested
- Llama-3.2-1B
- Llama-3.2-1B-Instruct
- Llama-3.2-3B
- Llama-3.2-3B-Instruct
- Mistral-7B-Instruct
- Falcon3-7B-Instruct
- Llama-3.1-8B-Instruct
- Llama-3.1-8B
- Dolly-v2-12B
- Llama-2-13B-Chat-HF
Pytorch Script
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
torch.manual_seed(100000)
# Model name or path
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda")
# Input text
input_text = "What is the capital of United States of America"
# Tokenize input text
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
# Decode and print output
for z in range (1, 10):
outputs = model.generate(**inputs, max_length=100)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
GPU Core and Memory utilization Metrics
Core Wooly Credits Used | VRAM Wooly Credits Used | |
Llama-3.2-1B | 46000 | 31072 |
Llama-3.2-1B-Instruct | 94868 | 60964 |
Llama-3.2-3B | 195936 | 84715 |
Llama-3.2-3B-Instruct | 502448 | 258125 |
Mistral-7b-instr | 525689 | 397181 |
Falcon3-7B-Instruc | 136094 | 26528 |
Llama-3.1-8B | 283458 | 167515 |
Llama-3.1-8B-Instruct | 574872 | 403934 |
Dolly-v2-12b | 767108 | 342877 |
Llama-2-13b-chat-hf | 313809 | 120067 |
Share your thoughts with us in our slack channel at https://woolyaicommunitychat.slack.com.