Kubernetes handles orchestration very well for stateless web services — but ML workloads have different characteristics: training jobs run for hours then terminate, inference services need consistently low latency, data pipelines carry large memory footprints, and GPUs are scarce resources that must be shared carefully across teams. Default Kubernetes configurations don't handle these characteristics well — and the problems only appear once you scale.

This post covers four design layers: splitting node pools by workload type, proper GPU provisioning, a storage layer for model artifacts, and resource quotas to prevent training jobs from starving the cluster. These are architectural decisions that need to happen before the first workload is deployed — changing them later, once the cluster has running workloads, is significantly more expensive.

Node pool separation — don't let GPU nodes run web servers

The most common mistake when starting with Kubernetes ML: putting all workloads into a single node pool. The result is expensive GPU nodes running Nginx because the scheduler fills the first available node, or training jobs pausing mid-run because an inference pod was rescheduled onto the same node. Separating node pools by workload type is the correct architectural foundation — and this decision needs to be made when creating the cluster, not after:

System pool: small nodes, dedicated to Kubernetes system components (etcd, API server, CoreDNS, Ingress controller). Taint with `CriticalAddonsOnly` so only system pods schedule here — preventing workloads from consuming control plane resources
CPU pool: mid-size nodes (8–32 vCPU) for data processing, quantized FastAPI inference servers, batch ETL, and monitoring. This is the pool that scales out the most — the cluster autoscaler adds and removes nodes based on demand
GPU pool: GPU nodes (A100, H100, or L4 depending on budget) with the NVIDIA device plugin. Taint with `gpu=true:NoSchedule` — only pods with a matching toleration can schedule here. Prevents GPU nodes from running non-GPU workloads and wasting expensive resources
Spot/Preemptible pool: nodes 60–80% cheaper for training jobs and experiments that can tolerate interruption — useful when running on AWS, GCP, or Azure. Requires checkpoint logic in training code to resume when a node is preempted mid-run

Each pool has its own node selector and taint. Training jobs declare tolerations for the GPU pool and spot pool; inference servers declare anti-affinity to avoid sharing nodes with training jobs using the same GPU — preventing contention that raises inference latency.

GPU scheduling — taints, tolerations, and quotas

GPUs are the most expensive and scarce resource in an ML cluster. Without controls, a training job with no resource limit can monopolize all GPUs on a node for hours, blocking inference services from scaling. Three mechanisms must be in place together:

NVIDIA device plugin: a DaemonSet running on every GPU node that exposes `nvidia.com/gpu` as a schedulable resource. Install via Helm with `nvidia/gpu-operator` — includes driver, device plugin, and DCGM exporter for GPU utilization monitoring. Kubernetes cannot see GPUs without this plugin
Mandatory resource requests: every GPU pod must declare `resources.limits.nvidia.com/gpu` — unlike CPU and memory, Kubernetes does not partition GPUs naturally, so a pod without a limit claims the entire GPU on the node. Use a LimitRange to reject pods that don't declare GPU limits in the GPU namespace
Namespace quotas: each team or project receives its own GPU quota via `ResourceQuota`. Team A gets a maximum of 4 GPUs, Team B gets 2 — preventing one team from monopolizing the cluster during peak training runs
PriorityClass: inference services (`inference-high`) get higher priority than training jobs (`training-low`) — when the cluster is full, training jobs are preempted first rather than inference pods

Storage layer for model artifacts

Model artifacts range from a few hundred MB to tens of GB, need to be accessed from multiple pods simultaneously when inference has multiple replicas, and need version control that is independent of the application code. Models should not live inside container images — every model update would trigger an image rebuild and slow down the deployment pipeline unnecessarily.

Object storage (S3, Cloudflare R2, GCS) is the right primary store for model artifacts: cheap, durable, and mountable into pods via sidecar or init container. Common pattern: an init container pulls the model from S3 into an `emptyDir` volume, and the main container reads from that local path — startup latency increases slightly, but images stay small and model updates are decoupled from code
PersistentVolume with ReadWriteMany (NFS or Amazon EFS) for cases where multiple pods need to share a model file without re-downloading: one pod writes the new model to the PV, inference pods read from it. Requires careful cache invalidation on model updates — use symlink atomic swaps to avoid serving a partial file
Model registry (MLflow, Weights & Biases, or a custom build on top of object storage and a metadata database): tracks version, metrics, and artifact location. The deployment pipeline references the model registry rather than hardcoding a path — rolling back to an older model version is just changing a pointer, not rebuilding an image

Do not put model artifacts in the container image — it inflates image size, slows the CI/CD pipeline, and couples the model lifecycle to the serving code lifecycle. These two things change at different rates: code updates on sprint cycles, models update on training cycles.

Four commonly overlooked configurations

Beyond the major decisions on node pools and storage, four small configurations are routinely overlooked in new ML clusters but cause significant problems as the workload count grows:

PodDisruptionBudget for inference services: during cluster upgrades or node drains, Kubernetes can evict all replicas of a Deployment simultaneously without a PDB. `minAvailable: 1` ensures at least one inference pod is always running — critical when SLA requires high uptime
TopologySpreadConstraint: distribute inference pods evenly across availability zones so a single AZ failure doesn't take down all capacity. Especially important on managed Kubernetes like EKS, GKE, or AKS in a multi-AZ cluster configuration
LimitRange defaults for namespaces: without a LimitRange, pods that don't declare resource requests and limits can still schedule — making resource usage unpredictable and rendering quotas ineffective. LimitRange sets defaults and maximums for all containers in a namespace
Preemption policy for training jobs: declare `preemptionPolicy: Never` on training jobs so they only schedule when resources are already available — not by evicting other pods. Prevents training jobs from kicking inference pods off nodes and violating the SLA

Scaling inference and batch jobs

Inference services need to scale with traffic; training jobs do not. Two scaling mechanisms fit the different characteristics of these two workload types:

HPA (Horizontal Pod Autoscaler) based on custom metrics (requests per second or GPU utilization): inference pods scale out when traffic rises, scale in when it drops. For GPU model serving, account for cold start time — a new pod needs to load the model into GPU memory (can take 30–90 seconds) before it can serve. Set a long ScaleDown stabilization window (5–10 minutes) to avoid rapid scale-in followed by immediate scale-out
KEDA (Kubernetes Event-Driven Autoscaling) for batch inference: scales based on queue depth (RabbitMQ, Kafka, SQS) — 0 pods when the queue is empty, N pods when N batches are waiting. KEDA supports scale-to-zero — important for expensive GPU nodes that shouldn't sit idle between training runs or batch inference windows

Real example: cluster for a retail analytics system

In the customer behavior analysis and inventory optimization project for a 12-store chain, the Kubernetes cluster was designed with three pools: system (2 nodes), CPU (4–8 nodes with autoscaling), and GPU (0–2 nodes with L4, KEDA scale-to-zero based on queue). Training jobs run on the GPU pool using spot instances — reducing cost by 65% compared to on-demand. The product recommendation inference endpoint runs on the CPU pool after the model is quantized to INT8 — no GPU needed at serving time. A PDB with `minAvailable: 2` ensures the recommendation service never drops below two replicas during any maintenance window.

Conclusion

Kubernetes doesn't automatically solve the ML infrastructure problem — it provides the primitives to build the right solution. Separate node pools by workload type, GPU taints and namespace quotas, storage decoupled from images, and four small but important configurations are decisions that need to happen before the first workload is deployed. In KonexForge's Server & Database layer, Kubernetes is the foundation that lets IoT, analytics, and AI workloads run reliably with SLA — and getting the design right upfront avoids expensive refactoring as the system scales.

Kubernetes for ML and data workloads: designing a cluster from node pools to GPU scheduling