GPU Consumption Model Based on Core and Memory Usage — Not Time Used

At WoolyAI, we’ve built a technology stack that decouples kernel execution from CUDA by introducing our own abstraction layer. Within this layer, kernels are compiled into a Wooly Instruction Set. At runtime, when a kernel is launched, it is transferred in Wooly Instruction Set format over the network from a CPU host to a Wooly Server running on the GPU. The Wooly Server dynamically recompiles these kernels for the target runtime (e.g., NVIDIA runtime) and manages GPU core and memory allocation for execution.

This architecture enables us to measure actual GPU core and memory utilization during model execution in PyTorch. Based on this real utilization data, we calculate Wooly Credits, which are applied to model runs for users of the WoolyAI Acceleration Service.

Unlike traditional GPU cloud service providers that charge based on time used, WoolyAI Acceleration Service charges based on actual GPU cores and memory consumed during model execution.

Environment Details

  • Wooly Client: Linux non-GPU container running PyTorch scripts for all ten models
  • Models were downloaded using Hugging Face Transformers library from vendor-specific repositories.
  • Each model was executed 20 times using the same script to collect average Wooly Credits for both CPU and VRAM usage.

Models Tested

  1. Llama-3.2-1B
  2. Llama-3.2-1B-Instruct
  3. Llama-3.2-3B
  4. Llama-3.2-3B-Instruct
  5. Mistral-7B-Instruct
  6. Falcon3-7B-Instruct
  7. Llama-3.1-8B-Instruct
  8. Llama-3.1-8B
  9. Dolly-v2-12B
  10. Llama-2-13B-Chat-HF

Pytorch Script

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
torch.manual_seed(100000)
# Model name or path
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda")
# Input text
input_text = "What is the capital of United States of America"
# Tokenize input text
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
# Decode and print output
for z in range (1, 10):
   outputs = model.generate(**inputs, max_length=100)
   generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
   print(generated_text)

GPU Core and Memory utilization Metrics

Core Wooly Credits UsedVRAM Wooly Credits Used
Llama-3.2-1B4600031072
Llama-3.2-1B-Instruct9486860964
Llama-3.2-3B19593684715
Llama-3.2-3B-Instruct502448258125
Mistral-7b-instr525689397181
Falcon3-7B-Instruc13609426528
Llama-3.1-8B 283458167515
Llama-3.1-8B-Instruct574872403934
Dolly-v2-12b767108342877
Llama-2-13b-chat-hf313809120067

Share your thoughts with us in our slack channel at https://woolyaicommunitychat.slack.com.