MLOps — Complete Field Guide

01 — FOUNDATION

What is MLOps?

MLOps (Machine Learning Operations) is a set of practices that combines Machine Learning, DevOps, and Data Engineering to deploy and maintain ML systems in production reliably and efficiently.

It bridges the gap between experimental model development and robust production systems — addressing the unique challenges that arise when software systems learn from data and change behavior over time.

Unlike traditional software, ML systems can silently degrade as the world changes around them. MLOps provides the infrastructure, tooling, and culture to detect and respond to this.

⚙️

DevOps

CI/CD, automation, infrastructure as code, reliability engineering

🧠

Machine Learning

Model development, experimentation, evaluation, versioning

🗄️

Data Engineering

Pipelines, feature stores, data quality, lineage tracking

02 — LIFECYCLE

The MLOps Lifecycle

Click any stage to explore its components, challenges, and best practices.

01

Problem
Definition

02

Data
Engineering

03

Model
Training

04

Evaluation
& Validation

05

Deployment
& Serving

06

Monitoring
& Alerting

feedback loop

07

Feedback
& Retraining

Problem Definition

The most critical — and most overlooked — stage. Before writing a single line of code, teams must rigorously define what success looks like.

Key Questions

Is ML the right solution?
What is the business metric we're optimizing?
What data do we have / need?
What's the acceptable latency and cost?
How will the model be used in production?

Deliverables

ML problem framing document
Success metrics (offline + online)
Data availability assessment
Feasibility study
Baseline performance target

Common Pitfalls

Optimizing proxy metrics, not business goals
Skipping feasibility analysis
No defined "good enough" threshold
Ignoring data collection costs

03 — AUTOMATION

CI/CD for Machine Learning

ML CI/CD extends traditional software pipelines with data validation, model testing, and automated deployment gates.

📝

Code Commit

Push model code, config, or data changes to version control

→

🧪

Unit Tests

Test data transforms, feature logic, model components

→

✅

Data Validation

Schema checks, distribution tests, quality gates

→

🏋️

Model Training

Triggered training run on validated data

→

📊

Evaluation Gate

Performance vs. baseline, bias checks, latency profiling

→

🚀

Deploy to Staging

Shadow mode or canary on staging environment

→

🌐
Production
Gradual rollout with automated rollback triggers

YAML · GitHub Actions ML Pipeline

name: ML Training Pipeline
on:
  push:
    paths: ['src/**', 'data/**', 'configs/**']

jobs:
  validate-and-train:
    runs-on: ubuntu-latest
    steps:
      - name: Data Validation
        run: python validate_data.py --config configs/schema.yaml

      - name: Run Training
        run: python train.py --experiment ${{ github.sha }}

      - name: Evaluate vs Champion
        run: python evaluate.py --challenger ${{ github.sha }}

      - name: Deploy if Better
        if: steps.evaluate.outputs.is_better == 'true'
        run: python deploy.py --strategy canary --traffic 10

04 — OBSERVABILITY

Monitoring & Drift Detection

The three types of drift that silently kill production models.

Data Drift

aka Covariate Shift

The statistical distribution of input features changes over time, even if the underlying relationship between features and labels remains the same.

Example

A fraud detection model trained on pre-pandemic spending patterns sees completely different transaction distributions post-pandemic.

Detection Methods

KS Test PSI Score Chi-Squared MMD

Concept Drift

aka Label Shift

The relationship between input features and the target variable changes. The model's learned mapping becomes incorrect even if inputs look similar.

Example

A sentiment model trained on 2020 tweets misclassifies 2024 slang because language semantics have evolved.

Detection Methods

DDM ADWIN Page-Hinkley EDDM

Prediction Drift

aka Output Shift

The distribution of model outputs changes significantly. Often the first observable signal of upstream data or concept drift.

Example

A recommendation model starts suggesting the same 10 items to everyone — prediction distribution collapses.

Detection Methods

Output Histogram Entropy Monitor Confidence Tracking

05 — MATURITY

MLOps Maturity Model

Google's four-level framework for assessing and evolving your MLOps practice.

L0

Manual Process

All steps are manual. Data scientists train models in notebooks, export them, and hand off to engineers for deployment. No automation, no monitoring.

Manual training No versioning No monitoring Rare releases

L1

ML Pipeline Automation

Training pipelines are automated and triggered on new data. Experiment tracking is in place. Models are versioned. Deployment is still semi-manual.

Automated training Experiment tracking Model registry Manual deployment

L2

CI/CD Pipeline Automation

Full CI/CD for ML pipelines. Automated testing, validation gates, and deployment. Models are continuously trained and deployed with minimal human intervention.

Full CI/CD Auto deployment Automated testing Basic monitoring

L3

Full MLOps Automation

End-to-end automation including drift detection, automated retraining triggers, champion/challenger evaluation, and self-healing pipelines. Humans set policy; systems execute.

Drift-triggered retraining Auto champion/challenger Full observability Self-healing pipelines

06 — ECOSYSTEM

The MLOps Tools Landscape

Filter by category to explore the tools that power modern ML systems.

DVC

Data Versioning

Git for data and models. Version datasets, track experiments, and create reproducible ML pipelines.

GE

Great Expectations

Data Validation

Define, document, and validate data quality expectations as code.

FS

Feast

Feature Store

Open-source feature store for operational ML. Consistent features across training and serving.

LF

LakeFS

Data Lake Versioning

Git-like branching and versioning for data lakes at petabyte scale.

MLF

MLflow

Experiment Tracking

Open-source platform for the ML lifecycle: tracking, packaging, and deploying models.

W&B

Weights & Biases

Experiment Tracking

Developer-first MLOps platform with rich visualizations, sweeps, and model registry.

RAY

Ray

Distributed Training

Unified framework for scaling ML workloads from laptop to cluster.

OPT

Optuna

Hyperparameter Tuning

Automatic hyperparameter optimization framework with efficient search strategies.

TFS

TF Serving

Model Serving

High-performance serving system for TensorFlow models with REST and gRPC APIs.

TRI

Triton

Inference Server

NVIDIA's inference server supporting multiple frameworks with GPU optimization.

BNT

BentoML

Model Serving

Framework-agnostic model serving with built-in batching and adaptive micro-batching.

SEL

Seldon Core

K8s Model Serving

Deploy ML models on Kubernetes with A/B testing, canary deployments, and explainability.

EVI

Evidently AI

ML Monitoring

Open-source tool for ML model monitoring, data drift detection, and quality reports.

ARZ

Arize AI

ML Observability

ML observability platform for monitoring, troubleshooting, and explainability.

WHY

WhyLogs

Data Logging

Lightweight data logging library for ML pipelines with statistical profiling.

AFW

Apache Airflow

Workflow Orchestration

Platform to programmatically author, schedule, and monitor data and ML workflows.

PRF

Prefect

Workflow Orchestration

Modern workflow orchestration with dynamic DAGs, automatic retries, and observability.

KFP

Kubeflow Pipelines

ML Pipelines

Kubernetes-native ML pipeline platform for building and deploying portable ML workflows.

ZEN

ZenML

MLOps Framework

Open-source MLOps framework for creating portable, production-ready ML pipelines.

SMA

SageMaker

AWS ML Platform

Fully managed ML platform covering the entire lifecycle from data prep to deployment.

VTX

Vertex AI

GCP ML Platform

Google Cloud's unified ML platform with AutoML, custom training, and model monitoring.

AML

Azure ML

Azure ML Platform

Enterprise-grade ML platform with responsible AI tools and MLOps capabilities.

DBR

Databricks

Unified Analytics

Lakehouse platform combining data engineering, ML, and analytics with MLflow integration.

07 — PRINCIPLES

Core Principles & Anti-Patterns

✓ Principles

Reproducibility First

Every model must be reproducible from code + data + config. If you can't reproduce it, you can't debug it.

Automate Everything Repeatable

If you do it more than twice, automate it. Manual steps are error-prone and don't scale.

Monitor Proactively

Don't wait for users to report problems. Instrument everything and set alerts before issues become incidents.

Treat Data as Code

Version your data, test your data, review your data changes. Data bugs are harder to find than code bugs.

Fail Fast, Rollback Faster

Design for failure. Automated rollback should be faster than any human response.

✗ Anti-Patterns

The Notebook Trap

Keeping production models in Jupyter notebooks. Notebooks are for exploration, not production systems.

Training-Serving Skew

Using different feature computation logic in training vs. serving. This is the #1 silent killer of ML systems.

Evaluation on Stale Data

Evaluating models on data that was available during training. Always use truly held-out, time-ordered test sets.

Ignoring Feedback Loops

Deploying a model that influences the data it will be retrained on without accounting for this in evaluation.

One-Shot Deployment

Deploying directly to 100% traffic without canary or shadow testing. Always validate on a subset first.

What is MLOps?

The MLOps Lifecycle

Problem Definition

Data Engineering

Model Training

Evaluation & Validation

Deployment & Serving

Monitoring & Alerting

Feedback & Retraining

CI/CD for Machine Learning

Monitoring & Drift Detection

MLOps Maturity Model

The MLOps Tools Landscape

Core Principles & Anti-Patterns