FIELD GUIDE · 2024 EDITION

Under­standing MLOps

The discipline of deploying, monitoring, and maintaining machine learning models in production — reliably, repeatably, at scale.

87% of ML projects never reach production
10× faster deployment with mature MLOps
3 core pillars: People, Process, Technology
Explore the Lifecycle ↓
Data
Train
Eval
Deploy
Monitor

What is MLOps?

MLOps (Machine Learning Operations) is a set of practices that combines Machine Learning, DevOps, and Data Engineering to deploy and maintain ML systems in production reliably and efficiently.

It bridges the gap between experimental model development and robust production systems — addressing the unique challenges that arise when software systems learn from data and change behavior over time.

Unlike traditional software, ML systems can silently degrade as the world changes around them. MLOps provides the infrastructure, tooling, and culture to detect and respond to this.

⚙️
DevOps
CI/CD, automation, infrastructure as code, reliability engineering
🧠
Machine Learning
Model development, experimentation, evaluation, versioning
🗄️
Data Engineering
Pipelines, feature stores, data quality, lineage tracking

The MLOps Lifecycle

Click any stage to explore its components, challenges, and best practices.

01
Problem
Definition
02
Data
Engineering
03
Model
Training
04
Evaluation
& Validation
05
Deployment
& Serving
06
Monitoring
& Alerting
07
Feedback
& Retraining

Problem Definition

The most critical — and most overlooked — stage. Before writing a single line of code, teams must rigorously define what success looks like.

Key Questions
  • Is ML the right solution?
  • What is the business metric we're optimizing?
  • What data do we have / need?
  • What's the acceptable latency and cost?
  • How will the model be used in production?
Deliverables
  • ML problem framing document
  • Success metrics (offline + online)
  • Data availability assessment
  • Feasibility study
  • Baseline performance target
Common Pitfalls
  • Optimizing proxy metrics, not business goals
  • Skipping feasibility analysis
  • No defined "good enough" threshold
  • Ignoring data collection costs

CI/CD for Machine Learning

ML CI/CD extends traditional software pipelines with data validation, model testing, and automated deployment gates.

📝
Code Commit
Push model code, config, or data changes to version control
🧪
Unit Tests
Test data transforms, feature logic, model components
Data Validation
Schema checks, distribution tests, quality gates
🏋️
Model Training
Triggered training run on validated data
📊
Evaluation Gate
Performance vs. baseline, bias checks, latency profiling
🚀
Deploy to Staging
Shadow mode or canary on staging environment
🌐
Production
Gradual rollout with automated rollback triggers
YAML · GitHub Actions ML Pipeline
name: ML Training Pipeline
on:
  push:
    paths: ['src/**', 'data/**', 'configs/**']

jobs:
  validate-and-train:
    runs-on: ubuntu-latest
    steps:
      - name: Data Validation
        run: python validate_data.py --config configs/schema.yaml

      - name: Run Training
        run: python train.py --experiment ${{ github.sha }}

      - name: Evaluate vs Champion
        run: python evaluate.py --challenger ${{ github.sha }}

      - name: Deploy if Better
        if: steps.evaluate.outputs.is_better == 'true'
        run: python deploy.py --strategy canary --traffic 10

Monitoring & Drift Detection

The three types of drift that silently kill production models.

Data Drift
aka Covariate Shift
The statistical distribution of input features changes over time, even if the underlying relationship between features and labels remains the same.
Example
A fraud detection model trained on pre-pandemic spending patterns sees completely different transaction distributions post-pandemic.
Detection Methods
KS Test PSI Score Chi-Squared MMD
Concept Drift
aka Label Shift
The relationship between input features and the target variable changes. The model's learned mapping becomes incorrect even if inputs look similar.
Example
A sentiment model trained on 2020 tweets misclassifies 2024 slang because language semantics have evolved.
Detection Methods
DDM ADWIN Page-Hinkley EDDM
Prediction Drift
aka Output Shift
The distribution of model outputs changes significantly. Often the first observable signal of upstream data or concept drift.
Example
A recommendation model starts suggesting the same 10 items to everyone — prediction distribution collapses.
Detection Methods
Output Histogram Entropy Monitor Confidence Tracking

MLOps Maturity Model

Google's four-level framework for assessing and evolving your MLOps practice.

L0
Manual Process
All steps are manual. Data scientists train models in notebooks, export them, and hand off to engineers for deployment. No automation, no monitoring.
Manual training No versioning No monitoring Rare releases
L1
ML Pipeline Automation
Training pipelines are automated and triggered on new data. Experiment tracking is in place. Models are versioned. Deployment is still semi-manual.
Automated training Experiment tracking Model registry Manual deployment
L2
CI/CD Pipeline Automation
Full CI/CD for ML pipelines. Automated testing, validation gates, and deployment. Models are continuously trained and deployed with minimal human intervention.
Full CI/CD Auto deployment Automated testing Basic monitoring
L3
Full MLOps Automation
End-to-end automation including drift detection, automated retraining triggers, champion/challenger evaluation, and self-healing pipelines. Humans set policy; systems execute.
Drift-triggered retraining Auto champion/challenger Full observability Self-healing pipelines

The MLOps Tools Landscape

Filter by category to explore the tools that power modern ML systems.

DVC
Data Versioning
Git for data and models. Version datasets, track experiments, and create reproducible ML pipelines.
Great Expectations
Data Validation
Define, document, and validate data quality expectations as code.
Feast
Feature Store
Open-source feature store for operational ML. Consistent features across training and serving.
LakeFS
Data Lake Versioning
Git-like branching and versioning for data lakes at petabyte scale.
MLflow
Experiment Tracking
Open-source platform for the ML lifecycle: tracking, packaging, and deploying models.
Weights & Biases
Experiment Tracking
Developer-first MLOps platform with rich visualizations, sweeps, and model registry.
Ray
Distributed Training
Unified framework for scaling ML workloads from laptop to cluster.
Optuna
Hyperparameter Tuning
Automatic hyperparameter optimization framework with efficient search strategies.
TF Serving
Model Serving
High-performance serving system for TensorFlow models with REST and gRPC APIs.
Triton
Inference Server
NVIDIA's inference server supporting multiple frameworks with GPU optimization.
BentoML
Model Serving
Framework-agnostic model serving with built-in batching and adaptive micro-batching.
Seldon Core
K8s Model Serving
Deploy ML models on Kubernetes with A/B testing, canary deployments, and explainability.
Evidently AI
ML Monitoring
Open-source tool for ML model monitoring, data drift detection, and quality reports.
Arize AI
ML Observability
ML observability platform for monitoring, troubleshooting, and explainability.
WhyLogs
Data Logging
Lightweight data logging library for ML pipelines with statistical profiling.
Apache Airflow
Workflow Orchestration
Platform to programmatically author, schedule, and monitor data and ML workflows.
Prefect
Workflow Orchestration
Modern workflow orchestration with dynamic DAGs, automatic retries, and observability.
Kubeflow Pipelines
ML Pipelines
Kubernetes-native ML pipeline platform for building and deploying portable ML workflows.
ZenML
MLOps Framework
Open-source MLOps framework for creating portable, production-ready ML pipelines.
SageMaker
AWS ML Platform
Fully managed ML platform covering the entire lifecycle from data prep to deployment.
Vertex AI
GCP ML Platform
Google Cloud's unified ML platform with AutoML, custom training, and model monitoring.
Azure ML
Azure ML Platform
Enterprise-grade ML platform with responsible AI tools and MLOps capabilities.
Databricks
Unified Analytics
Lakehouse platform combining data engineering, ML, and analytics with MLflow integration.

Core Principles & Anti-Patterns

✓ Principles
Reproducibility First
Every model must be reproducible from code + data + config. If you can't reproduce it, you can't debug it.
Automate Everything Repeatable
If you do it more than twice, automate it. Manual steps are error-prone and don't scale.
Monitor Proactively
Don't wait for users to report problems. Instrument everything and set alerts before issues become incidents.
Treat Data as Code
Version your data, test your data, review your data changes. Data bugs are harder to find than code bugs.
Fail Fast, Rollback Faster
Design for failure. Automated rollback should be faster than any human response.
✗ Anti-Patterns
The Notebook Trap
Keeping production models in Jupyter notebooks. Notebooks are for exploration, not production systems.
Training-Serving Skew
Using different feature computation logic in training vs. serving. This is the #1 silent killer of ML systems.
Evaluation on Stale Data
Evaluating models on data that was available during training. Always use truly held-out, time-ordered test sets.
Ignoring Feedback Loops
Deploying a model that influences the data it will be retrained on without accounting for this in evaluation.
One-Shot Deployment
Deploying directly to 100% traffic without canary or shadow testing. Always validate on a subset first.