Fractional GPU Sharing in Kubernetes Keeps Failing During LLM Fine-Tuning

I had four A100s sitting in a Kubernetes cluster and a backlog of LoRA fine-tuning jobs that each needed maybe 8GB of VRAM. Instead of running four jobs in parallel, Kubernetes scheduled one pod per GPU and left the rest of each card’s memory idle. That’s when I went down the fractional GPU sharing rabbit hole, and it broke in more ways than I expected before it actually worked.

If you’re trying to get fractional GPU sharing working in Kubernetes for LLM fine-tuning and you’re hitting OOM errors, pods stuck in “Pending,” or silent performance degradation, this is the troubleshooting guide I wish I’d had.

Why Fractional GPU Sharing Fails in Kubernetes

Kubernetes was never built with GPU memory partitioning in mind. The default NVIDIA device plugin treats a GPU as a single, indivisible resource — you either get the whole card or none of it. That mismatch is the root of almost every issue you’ll run into.

Cause 1: The default device plugin doesn’t support memory slicing

The stock nvidia-device-plugin exposes GPUs as nvidia.com/gpu: 1 resources. There’s no concept of “give me 6GB of this GPU.” When people try to fake fractional sharing using resources.limits without a sharing-aware plugin, Kubernetes still allocates the whole device, and your “fractional” pods silently fight over the same physical memory with no isolation at all.

Cause 2: Time-slicing and MPS get configured incorrectly

NVIDIA offers two real sharing mechanisms — time-slicing (via the device plugin’s config.yaml) and CUDA MPS (Multi-Process Service). Time-slicing shares compute but not memory isolation, so two fine-tuning jobs can still OOM each other even though Kubernetes “thinks” they’re sharing nicely. MPS needs a daemon running per-node with specific environment variables (CUDA_MPS_PIPE_DIRECTORY, CUDA_MPS_LOG_DIRECTORY) set consistently across every pod that touches that GPU. Miss one mount path and the MPS client silently falls back to default context creation, defeating the whole point.

Cause 3: Memory fragmentation from PyTorch’s caching allocator

This is the one nobody mentions in setup guides. Even when fractional sharing is configured correctly at the Kubernetes layer, PyTorch’s CUDA caching allocator doesn’t release memory back to the device — it holds onto it for reuse within the process. If you’ve allocated a 10GB slice but your training loop briefly needs 11GB for a single batch (common with variable-length sequences during fine-tuning), the allocator can’t borrow from another tenant’s slice. Kubernetes correctly enforced the boundary; PyTorch just didn’t fit inside it.

Cause 4: MIG and time-slicing get conflated

On A100/H100 hardware, Multi-Instance GPU (MIG) creates hardware-level partitions with their own memory and compute, which is a completely different mechanism from software time-slicing. People often copy a time-slicing config onto MIG-capable cards (or vice versa) and end up with a device plugin that reports GPU resources that don’t actually exist, causing scheduling failures that look like quota issues but are actually plugin misconfiguration.

Common Scenarios Where This Breaks

A single H100 needs to run 3–4 small LoRA fine-tuning jobs concurrently, but the cluster only schedules one pod per node because nvidia.com/gpu requests round up to whole integers.
Pods request fractional GPU memory via a third-party plugin (like the open-source gpu-fractional or vendor plugins), but jobs intermittently crash with CUDA out of memory despite the math adding up on paper.
Multi-tenant clusters where one team’s fine-tuning job degrades another team’s job’s throughput by 5–10x with no error at all — just slow training, because compute time-slicing was configured without compute quotas.
MIG-enabled nodes show GPU resources in kubectl describe node, but the scheduler can’t bind pods to them because the MIG strategy (single vs mixed) doesn’t match how the device plugin was deployed.

Step-by-Step Fixes

Step 1: Confirm what your hardware actually supports

Run this on each GPU node before touching any YAML:

bash

nvidia-smi --query-gpu=name,compute_cap,mig.mode.current --format=csv

If mig.mode.current returns Enabled, you’re on MIG-capable hardware (A100, H100, A30) and should use MIG partitioning, not time-slicing, for hard memory isolation.
If MIG isn’t supported (V100, RTX-series, older T4s), you’re limited to time-slicing or MPS — there’s no hardware partitioning option.

This single check prevents half the misconfigurations I see, because the two paths require entirely different device plugin configs.

Step 2: Deploy the correct device plugin for your hardware

For MIG-capable GPUs, use NVIDIA’s MIG-aware device plugin with an explicit strategy:

yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin-config
data:
  config.yaml: |
    version: v1
    flags:
      migStrategy: "mixed"

Use mixed if some pods need a full GPU and others need MIG slices on the same node. Use single only if every GPU on that node is partitioned the same way — mismatching this is the single most common cause of “GPU resources visible but pods stuck Pending.”

For non-MIG GPUs, configure time-slicing instead:

yaml

version: v1
sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/gpu
      replicas: 4

This tells Kubernetes to advertise 4 virtual GPU slots per physical card. Apply it, then restart the device plugin daemonset so it picks up the config:

bash

kubectl rollout restart daemonset nvidia-device-plugin-daemonset -n kube-system

Step 3: Set realistic memory limits inside your training job, not just at the pod level

Kubernetes resource limits don’t enforce CUDA memory boundaries by themselves under time-slicing. You need to cap it inside the training process too:

python

import torch
torch.cuda.set_per_process_memory_fraction(0.25, device=0)

For Hugging Face transformers + accelerate fine-tuning jobs, set this before model loading, and also disable the caching allocator’s tendency to over-reserve by setting:

bash

export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128

This directly addresses Cause 3 above — it stops PyTorch from grabbing more than its fractional share even momentarily.

Step 4: Variations by setup

Single-node dev cluster (no MIG): time-slicing with replicas: 2 or 3 is usually enough for LoRA/QLoRA fine-tuning jobs under 13B parameters.
Multi-tenant production cluster on A100/H100: use MIG with fixed profiles (e.g., 1g.10gb, 2g.20gb) so teams get guaranteed memory and compute, not best-effort sharing.
Cloud-managed Kubernetes (GKE, EKS, AKS): check whether the managed node pool already ships a modified device plugin — GKE’s GPU node pools, for example, require gpu-sharing-strategy set at node pool creation, not just in a ConfigMap, or your changes get silently overwritten on node upgrade.

Advanced Fixes and Edge Cases

If the basic steps above don’t resolve OOM crashes or scheduling failures, here’s where to dig deeper.

Advanced Path 1: Diagnose with `nvidia-smi` inside the pod, not just on the host

A pod can report healthy resource limits in kubectl describe pod while the actual CUDA context is fragmented or shared incorrectly. Exec into the pod during a training run:

bash

kubectl exec -it <pod-name> -- nvidia-smi --query-gpu=memory.used,memory.free,utilization.gpu --format=csv -l 2

If memory.used climbs steadily without plateauing during a fine-tuning run that should have a stable memory footprint (e.g., LoRA with fixed batch size), you’re looking at fragmentation, not a true memory leak. Cross-check with:

bash

kubectl exec -it <pod-name> -- python -c "import torch; print(torch.cuda.memory_summary())"

This shows reserved vs. allocated memory. A large gap between “reserved” and “allocated” confirms PyTorch’s allocator is holding fragmented blocks it can’t release back to your fractional slice.

Advanced Path 2: Check for compute starvation masquerading as a sharing bug

Time-slicing shares the GPU’s compute pipeline in fixed time quanta. Under heavy contention, one tenant’s job can appear to “hang” — not crash, just stall — because the scheduler is round-robining compute slices unfairly when job priorities aren’t set. Diagnose this with node-level monitoring (DCGM exporter):

bash

kubectl exec -it <dcgm-pod> -- dcgm-dia -r 1 -d 5

Look at SM Activity per process. If one pod’s SM activity is near 0% for extended windows while another’s is pegged near 100%, you have unfair time-slicing, not a memory or config bug. The fix is setting priorityClassName on pods or moving high-priority fine-tuning jobs to MIG partitions where compute is hardware-isolated instead of software-scheduled.

Edge case: MPS daemon crashes silently on node restart

If you’re using CUDA MPS instead of time-slicing, the MPS control daemon doesn’t automatically restart with the node unless it’s deployed as a DaemonSet with a restart policy. After a node reboot or driver update, pods that should be sharing via MPS instead fall back to exclusive-process mode, and you’ll see GPU utilization drop cluster-wide with no obvious error. Check daemon health with:

bash

kubectl exec -it <node-debug-pod> -- ps aux | grep nvidia-cuda-mps

If it’s not running, your DaemonSet needs a livenessProbe checking for the MPS pipe directory, not just a process check.

Tips for Stable Fractional GPU Sharing

Pin a specific MIG profile per workload type instead of mixing profiles freely — inconsistent profiles across a node pool make scheduling unpredictable.
Always set PYTORCH_CUDA_ALLOC_CONF in your base training image, not as a one-off env var per job; people forget it on new pipelines.
Use resource quotas at the namespace level (nvidia.com/gpu requests/limits) in addition to device plugin sharing config — sharing config alone doesn’t stop one noisy team from requesting every slice.
Monitor with DCGM exporter + Prometheus from day one. Fractional GPU issues are almost invisible without per-process GPU metrics; kubectl top won’t show you any of this.
Re-test your sharing config after every GPU driver or Kubernetes version upgrade — device plugin compatibility breaks more often than people expect.

FAQ

Why does kubectl describe node show GPU resources but pods stay Pending? This usually means your MIG strategy doesn’t match your device plugin deployment mode (single vs mixed), or the device plugin daemonset crashed after a config change and never re-registered the node’s extended resources. Check daemonset pod logs first.

Can I run fractional GPU sharing without MIG on non-A100 hardware? Yes, through time-slicing or MPS, but you lose hardware memory isolation. One job’s memory spike can still crash a neighboring job even though Kubernetes shows both within their resource limits.

Why does my LoRA fine-tuning job OOM only after running for 20+ minutes, not immediately? This is almost always allocator fragmentation, not an actual memory leak. Variable-length sequence batches gradually request larger contiguous blocks that the fractional slice can’t satisfy even though total memory in use hasn’t grown much. Setting max_split_size_mb in PYTORCH_CUDA_ALLOC_CONF is the direct fix.

Does GKE/EKS handle fractional GPU sharing automatically? No. Managed Kubernetes still requires you to explicitly configure GPU sharing at node pool creation (GKE) or through a custom device plugin DaemonSet (EKS, AKS). It is not enabled by default on any major cloud provider.

Is fractional GPU sharing even worth it for fine-tuning vs. just buying more GPUs? For small-to-medium fine-tuning jobs (LoRA, QLoRA, adapters under 20GB VRAM), sharing dramatically improves utilization. For full-parameter fine-tuning of large models, the overhead and isolation issues often aren’t worth it — dedicated GPUs are simpler and more predictable.

Editor’s Opinion

honestly this took me way longer to get stable than i expected going in. the docs make MIG and time-slicing sound interchangeable and they really really arent, learned that the hard way after a weekend of pods just sitting there pending for no clear reason. the allocator fragmentation thing was the sneaky one, took me a day to even realize that was the issue and not some quota misconfig. if youre on A100s or H100s just use MIG honestly, dont bother with time slicing unless you have to, save yourself the headache. anyway hope this saves someone a weekend, it definitely cost me one.