GitOps and CI/CD for ML pipelines: from notebook experiment to production deployment with Argo CD
ML pipelines have a different development lifecycle from traditional software: code changes less frequently, models and data change more. Applying software GitOps directly to ML without adjustment typically produces pipelines that are heavy, slow, and hard to debug. How to design CI/CD for ML with Argo CD: decouple model artifacts from code deployments, trigger pipelines when a new model finishes training, and roll back automatically when metrics degrade.
Software CI/CD pipelines usually follow a simple logic: code changes → build image → deploy. For ML pipelines, this logic is only partially correct — because there is a second, often more important dimension of change: models and data. A new model can be deployed without any code change; conversely, a code change doesn't always require retraining the model. This is why applying software CI/CD directly to ML tends to produce pipelines that are either too rigid (rebuild everything) or too loose (manual deployment).
GitOps for ML requires separating two independent lifecycles: the code lifecycle (written by engineers, deployed via git commits) and the model lifecycle (produced by the training pipeline, deployed when metrics pass). Argo CD is the right tool to orchestrate both — but it requires the right manifest structure and trigger strategy from the start.
Why software CI/CD isn't enough for ML
When ML teams use a pure software CI/CD pipeline, a few common patterns cause problems specific to ML workflows. Recognizing them helps avoid repeating them:
- Model artifact inside the Docker image: bundling the model into the image inflates image size (potentially several GB) and couples the model lifecycle to the code lifecycle. Every retrain requires a rebuild and repush — a CI pipeline that runs 15–30 minutes for a change that only touched model weights
- No model validation gate: the pipeline deploys when tests pass — but tests typically check code logic, not model performance on a real test set. A model with lower accuracy than its predecessor can be deployed if there's no evaluation step in the pipeline
- Rollback via git revert: for software, rollback = revert a code commit. For ML, rollback usually means returning to an older model version — these two are independent, and reverting code doesn't automatically roll back the model
- Training data not versioned: a new model trained on new data may perform worse than the old model trained on old data — but if data isn't versioned alongside the model artifact, the old results can't be reproduced for debugging
ML CI/CD needs at least two separate pipelines: a training pipeline (data → model artifact, triggered on a schedule or when new data arrives) and a deployment pipeline (model artifact → production, triggered when the new model passes evaluation). These two pipelines communicate through the model registry, not through a git branch.
Decouple model artifacts from the deployment pipeline
The most important principle in ML GitOps: model artifacts don't go through git and don't live inside Docker images. Model artifacts live in a model registry (MLflow, or object storage with a metadata database) and are referenced by version ID in the Kubernetes manifest:
- The deployment manifest only contains a model version reference: `model_name: product-recommendation`, `model_version: v2.3` — not a file path or weight checksum. An init container in the Pod spec pulls the model from the registry into an emptyDir volume when the Pod starts
- The model registry is the source of truth for the model lifecycle: the training pipeline registers a new model with its metrics (accuracy, F1, latency benchmark). The deployment pipeline only promotes a model version from `staging` to `production` when metrics pass the threshold
- Image builds only trigger when serving code changes: the serving code (FastAPI handler, preprocessing, postprocessing) lives in git. When code changes → rebuild image → deploy. When only the model changes → update the model version in the manifest → Argo CD syncs → Pod restarts with the new model, no image rebuild needed
Argo CD for multi-environment deployment
Argo CD manages Kubernetes manifests using the GitOps model: the desired state of the cluster is stored in a git repo, and Argo CD continuously syncs the cluster to that state. ML deployments have a few specific design points to address:
- App-of-apps pattern: a root Argo CD Application manages multiple child Applications — one per model service (recommendation, fraud-detection, churn-prediction). Adding a new model service means adding one Application to the root, not configuring Argo CD from scratch
- Environment promotion through git: staging and production are two git branches (or two directories) in the manifest repo. Promoting from staging to production = merging a PR from the staging branch to the production branch → Argo CD syncs automatically. No manual `kubectl apply` in production
- Sync policy: for production, use `manual` sync so deployments require a review before executing. For staging, use `automated` sync with `prune: true` so staging always reflects the latest commit. Set `selfHeal: true` so Argo CD rolls back any out-of-band cluster changes
- Health checks for ML services: Argo CD checks `Deployment.status.availableReplicas` to confirm a sync is successful. For ML model serving, add a custom health check: call the `/health` endpoint to confirm the model has finished loading — a Pod in `Running` state doesn't mean the model is ready to serve
Training pipeline and trigger strategy
The training pipeline runs independently from the deployment pipeline and is not triggered by a git commit. Two common trigger strategies:
- Scheduled trigger: retrain the model on a fixed schedule (weekly, monthly) using the latest data from the data warehouse. Appropriate for models less sensitive to data drift where training is expensive. Use an Airflow DAG or Argo Workflows to orchestrate the training steps
- Event-based trigger: the training pipeline fires when data drift exceeds a threshold (from the monitoring pipeline — covered in the drift monitoring post), or when enough new data has accumulated. Better suited to models that need to adapt quickly to real-world changes
- After training completes: the pipeline automatically runs evaluation against a held-out test set, then registers the model and metrics in the registry. If metrics pass the threshold — automatically open a PR to the staging manifest branch to trigger deployment. If metrics fail — alert and stop the pipeline, don't deploy a worse model than the current one
Rollback and progressive delivery
Rollback in ML GitOps needs to handle two independent dimensions: code rollback (revert a git commit, Argo CD syncs) and model rollback (update the manifest to reference an older model version in the registry). Combine with progressive delivery to reduce risk:
- Canary deployment: route 5–10% of traffic to the new model version for the first few hours. Compare business metrics (CTR, conversion, error rate) between the canary and stable versions. If canary metrics are better or equal → promote fully. If worse → automatically roll back to the stable version. Argo Rollouts or Flagger integrates with Argo CD to automate this logic
- Automated rollback via metric analysis: define an `AnalysisTemplate` in Argo Rollouts with thresholds (error rate must not exceed 2%, P99 latency must stay under 500ms for the first 15 minutes after deployment). If violated, Argo Rollouts automatically rolls back without requiring human intervention
Conclusion
GitOps for ML is not software CI/CD with a training step added — it's a redesign of two independent lifecycles (code and model) with a clear interface between them through the model registry. Argo CD handles deployment orchestration and rollback well, but only when the training pipeline and evaluation gate are working correctly to close the loop. In KonexForge's Development layer, GitOps is the default operating principle for every Pilot Build — not a feature to add later.