Software CI/CD pipelines usually follow a simple logic: code changes → build image → deploy. For ML pipelines, this logic is only partially correct — because there is a second, often more important dimension of change: models and data. A new model can be deployed without any code change; conversely, a code change doesn't always require retraining the model. This is why applying software CI/CD directly to ML tends to produce pipelines that are either too rigid (rebuild everything) or too loose (manual deployment).

GitOps for ML requires separating two independent lifecycles: the code lifecycle (written by engineers, deployed via git commits) and the model lifecycle (produced by the training pipeline, deployed when metrics pass). Argo CD is the right tool to orchestrate both — but it requires the right manifest structure and trigger strategy from the start.

Why software CI/CD isn't enough for ML

When ML teams use a pure software CI/CD pipeline, a few common patterns cause problems specific to ML workflows. Recognizing them helps avoid repeating them:

Model artifact inside the Docker image: bundling the model into the image inflates image size (potentially several GB) and couples the model lifecycle to the code lifecycle. Every retrain requires a rebuild and repush — a CI pipeline that runs 15–30 minutes for a change that only touched model weights
No model validation gate: the pipeline deploys when tests pass — but tests typically check code logic, not model performance on a real test set. A model with lower accuracy than its predecessor can be deployed if there's no evaluation step in the pipeline
Rollback via git revert: for software, rollback = revert a code commit. For ML, rollback usually means returning to an older model version — these two are independent, and reverting code doesn't automatically roll back the model
Training data not versioned: a new model trained on new data may perform worse than the old model trained on old data — but if data isn't versioned alongside the model artifact, the old results can't be reproduced for debugging

ML CI/CD needs at least two separate pipelines: a training pipeline (data → model artifact, triggered on a schedule or when new data arrives) and a deployment pipeline (model artifact → production, triggered when the new model passes evaluation). These two pipelines communicate through the model registry, not through a git branch.

Decouple model artifacts from the deployment pipeline

The most important principle in ML GitOps: model artifacts don't go through git and don't live inside Docker images. Model artifacts live in a model registry (MLflow, or object storage with a metadata database) and are referenced by version ID in the Kubernetes manifest:

The deployment manifest only contains a model version reference: `model_name: product-recommendation`, `model_version: v2.3` — not a file path or weight checksum. An init container in the Pod spec pulls the model from the registry into an emptyDir volume when the Pod starts
The model registry is the source of truth for the model lifecycle: the training pipeline registers a new model with its metrics (accuracy, F1, latency benchmark). The deployment pipeline only promotes a model version from `staging` to `production` when metrics pass the threshold
Image builds only trigger when serving code changes: the serving code (FastAPI handler, preprocessing, postprocessing) lives in git. When code changes → rebuild image → deploy. When only the model changes → update the model version in the manifest → Argo CD syncs → Pod restarts with the new model, no image rebuild needed

Argo CD for multi-environment deployment

Argo CD manages Kubernetes manifests using the GitOps model: the desired state of the cluster is stored in a git repo, and Argo CD continuously syncs the cluster to that state. ML deployments have a few specific design points to address:

App-of-apps pattern: a root Argo CD Application manages multiple child Applications — one per model service (recommendation, fraud-detection, churn-prediction). Adding a new model service means adding one Application to the root, not configuring Argo CD from scratch
Environment promotion through git: staging and production are two git branches (or two directories) in the manifest repo. Promoting from staging to production = merging a PR from the staging branch to the production branch → Argo CD syncs automatically. No manual `kubectl apply` in production
Sync policy: for production, use `manual` sync so deployments require a review before executing. For staging, use `automated` sync with `prune: true` so staging always reflects the latest commit. Set `selfHeal: true` so Argo CD rolls back any out-of-band cluster changes
Health checks for ML services: Argo CD checks `Deployment.status.availableReplicas` to confirm a sync is successful. For ML model serving, add a custom health check: call the `/health` endpoint to confirm the model has finished loading — a Pod in `Running` state doesn't mean the model is ready to serve

Training pipeline and trigger strategy

The training pipeline runs independently from the deployment pipeline and is not triggered by a git commit. Two common trigger strategies:

Scheduled trigger: retrain the model on a fixed schedule (weekly, monthly) using the latest data from the data warehouse. Appropriate for models less sensitive to data drift where training is expensive. Use an Airflow DAG or Argo Workflows to orchestrate the training steps
Event-based trigger: the training pipeline fires when data drift exceeds a threshold (from the monitoring pipeline — covered in the drift monitoring post), or when enough new data has accumulated. Better suited to models that need to adapt quickly to real-world changes
After training completes: the pipeline automatically runs evaluation against a held-out test set, then registers the model and metrics in the registry. If metrics pass the threshold — automatically open a PR to the staging manifest branch to trigger deployment. If metrics fail — alert and stop the pipeline, don't deploy a worse model than the current one

Rollback and progressive delivery

Rollback in ML GitOps needs to handle two independent dimensions: code rollback (revert a git commit, Argo CD syncs) and model rollback (update the manifest to reference an older model version in the registry). Combine with progressive delivery to reduce risk:

Canary deployment: route 5–10% of traffic to the new model version for the first few hours. Compare business metrics (CTR, conversion, error rate) between the canary and stable versions. If canary metrics are better or equal → promote fully. If worse → automatically roll back to the stable version. Argo Rollouts or Flagger integrates with Argo CD to automate this logic
Automated rollback via metric analysis: define an `AnalysisTemplate` in Argo Rollouts with thresholds (error rate must not exceed 2%, P99 latency must stay under 500ms for the first 15 minutes after deployment). If violated, Argo Rollouts automatically rolls back without requiring human intervention

Conclusion

GitOps for ML is not software CI/CD with a training step added — it's a redesign of two independent lifecycles (code and model) with a clear interface between them through the model registry. Argo CD handles deployment orchestration and rollback well, but only when the training pipeline and evaluation gate are working correctly to close the loop. In KonexForge's Development layer, GitOps is the default operating principle for every Pilot Build — not a feature to add later.

GitOps and CI/CD for ML pipelines: from notebook experiment to production deployment with Argo CD