ML Model Serving as REST API Using Clipper

A comprehensive guide to deploying machine learning models as scalable REST APIs using Clipper, the low-latency prediction serving system designed for production ML workloads.

GT
Gonnect Team
January 14, 202412 min readView on GitHub
PythonClipperDockerMachine LearningREST APIscikit-learn

The Model Serving Challenge

Building accurate machine learning models is only half the battle. The real challenge lies in deploying these models to production where they can serve predictions at scale, with low latency, and high reliability. At Gonnect, we've seen countless organizations struggle with this transition - their models perform brilliantly in notebooks but falter when exposed to real-world traffic.

Clipper addresses this fundamental challenge by providing a prediction serving system that sits between your applications and ML models, managing the complexity of model deployment, versioning, and scaling.

What is Clipper?

Clipper is a low-latency prediction serving system developed at UC Berkeley's RISE Lab. It provides a general-purpose platform that:

  • Exposes ML models as REST APIs without custom server code
  • Supports multiple ML frameworks (scikit-learn, TensorFlow, PyTorch, XGBoost)
  • Enables online model updates and A/B testing
  • Implements intelligent caching and batching for performance
  • Provides fault tolerance and model versioning

The key philosophy behind Clipper is framework agnosticism - data scientists can use any ML library they prefer, and Clipper handles the serving infrastructure.

Architecture Overview

Clipper's architecture separates concerns into distinct layers:

                    ┌─────────────────────────────────────┐
                    │         Application Layer           │
                    │    (REST API / gRPC Clients)        │
                    └─────────────────┬───────────────────┘
                                      │
                    ┌─────────────────▼───────────────────┐
                    │         Clipper Frontend            │
                    │  - Request routing                  │
                    │  - Caching                          │
                    │  - Batching                         │
                    │  - Model selection                  │
                    └─────────────────┬───────────────────┘
                                      │
         ┌────────────────────────────┼────────────────────────────┐
         │                            │                            │
┌────────▼────────┐        ┌─────────▼─────────┐        ┌────────▼────────┐
│  Model Container │        │  Model Container  │        │  Model Container │
│  (scikit-learn)  │        │   (TensorFlow)    │        │    (PyTorch)     │
└─────────────────┘        └───────────────────┘        └─────────────────┘

Key Components

  1. Query Frontend: Receives prediction requests via REST API
  2. Model Containers: Docker containers running individual models
  3. Clipper Manager: Handles model deployment and lifecycle management
  4. Selection Policy: Routes requests to appropriate model versions

HR Hiring Prediction: A Practical Example

The Clipper project demonstrates model serving with a practical HR use case - predicting hiring decisions based on candidate attributes. This involves:

  • A Decision Tree classifier trained on HR data
  • REST API endpoints for real-time predictions
  • Docker containerization for deployment

The HR Dataset

import pandas as pd

# Load HR training data
hr_data = pd.read_csv('HR.csv')

# Features typically include:
# - Years of experience
# - Education level
# - Technical skills assessment
# - Interview scores
# - Previous employment history

Training the Decision Tree Model

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import pickle

class HRDecisionTree:
    """HR Hiring Decision Tree Classifier"""

    def __init__(self):
        self.model = DecisionTreeClassifier(
            max_depth=10,
            min_samples_split=5,
            min_samples_leaf=2,
            random_state=42
        )

    def train(self, X, y):
        """Train the model on HR data"""
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.2, random_state=42
        )
        self.model.fit(X_train, y_train)

        # Evaluate accuracy
        accuracy = self.model.score(X_test, y_test)
        print(f"Model accuracy: {accuracy:.4f}")

        return self

    def predict(self, features):
        """Make hiring predictions"""
        return self.model.predict(features)

    def save(self, path='hr_model.pkl'):
        """Serialize model for deployment"""
        with open(path, 'wb') as f:
            pickle.dump(self.model, f)

Connecting to Clipper

The ClipperConnection class manages the interaction between your application and the Clipper serving system:

from clipper_admin import ClipperConnection, DockerContainerManager

class ClipperModelServer:
    """Manages Clipper model deployment and serving"""

    def __init__(self, host='localhost'):
        self.clipper_conn = ClipperConnection(
            DockerContainerManager()
        )
        self.host = host

    def start_clipper(self):
        """Initialize Clipper cluster"""
        self.clipper_conn.start_clipper()
        print("Clipper started successfully")
        print(f"Query endpoint: http://{self.host}:1337")
        print(f"Management endpoint: http://{self.host}:1338")

    def register_application(self, name, input_type, default_output, slo_micros):
        """Register a new application with Clipper"""
        self.clipper_conn.register_application(
            name=name,
            input_type=input_type,
            default_output=default_output,
            slo_micros=slo_micros  # Service Level Objective in microseconds
        )
        print(f"Application '{name}' registered")

    def deploy_model(self, name, version, input_type, func, pkgs_to_install):
        """Deploy a Python model to Clipper"""
        from clipper_admin.deployers import python as python_deployer

        python_deployer.deploy_python_closure(
            self.clipper_conn,
            name=name,
            version=version,
            input_type=input_type,
            func=func,
            pkgs_to_install=pkgs_to_install
        )
        print(f"Model '{name}' version {version} deployed")

    def link_model_to_app(self, app_name, model_name):
        """Connect a model to an application"""
        self.clipper_conn.link_model_to_app(
            app_name=app_name,
            model_name=model_name
        )
        print(f"Model '{model_name}' linked to app '{app_name}'")

Deploying the HR Model

Here's a complete workflow for deploying the HR hiring prediction model:

import pickle
import numpy as np

# Initialize Clipper connection
server = ClipperModelServer()
server.start_clipper()

# Register the HR application
server.register_application(
    name='hr-hiring',
    input_type='doubles',
    default_output='-1.0',
    slo_micros=100000  # 100ms SLO
)

# Load the trained model
with open('hr_model.pkl', 'rb') as f:
    hr_model = pickle.load(f)

# Define the prediction function
def predict_hiring(inputs):
    """
    Clipper-compatible prediction function

    Args:
        inputs: List of feature arrays

    Returns:
        List of predictions as strings
    """
    predictions = hr_model.predict(inputs)
    return [str(pred) for pred in predictions]

# Deploy the model
server.deploy_model(
    name='hr-decision-tree',
    version='1',
    input_type='doubles',
    func=predict_hiring,
    pkgs_to_install=['scikit-learn']
)

# Link model to application
server.link_model_to_app('hr-hiring', 'hr-decision-tree')

print("HR Hiring model deployed and ready for predictions!")

REST API Usage

Once deployed, the model is accessible via REST API:

Making Predictions

# Predict hiring decision for a candidate
curl -X POST http://localhost:1337/hr-hiring/predict \
  -H "Content-Type: application/json" \
  -d '{
    "input": [5.0, 3.0, 85.0, 4.5, 2.0]
  }'

Response Format

{
  "query_id": 1,
  "output": "1",
  "default": false
}

Python Client Example

import requests
import json

class HRPredictionClient:
    """Client for HR hiring predictions via Clipper"""

    def __init__(self, host='localhost', port=1337):
        self.base_url = f"http://{host}:{port}"

    def predict(self, candidate_features):
        """
        Predict hiring decision for a candidate

        Args:
            candidate_features: List of numeric features
                - years_experience
                - education_level
                - skill_score
                - interview_rating
                - previous_jobs

        Returns:
            dict: Prediction result
        """
        url = f"{self.base_url}/hr-hiring/predict"
        payload = {"input": candidate_features}

        response = requests.post(
            url,
            headers={"Content-Type": "application/json"},
            data=json.dumps(payload)
        )

        return response.json()

    def batch_predict(self, candidates):
        """Predict hiring decisions for multiple candidates"""
        results = []
        for candidate in candidates:
            result = self.predict(candidate)
            results.append(result)
        return results

# Usage
client = HRPredictionClient()

# Single prediction
result = client.predict([5.0, 3.0, 85.0, 4.5, 2.0])
print(f"Hiring decision: {'Hire' if result['output'] == '1' else 'No Hire'}")

# Batch predictions
candidates = [
    [5.0, 3.0, 85.0, 4.5, 2.0],
    [2.0, 2.0, 70.0, 3.0, 1.0],
    [8.0, 4.0, 95.0, 5.0, 3.0]
]
batch_results = client.batch_predict(candidates)

Docker Containerization

Clipper uses Docker containers to isolate and scale model serving:

Model Container Dockerfile

FROM python:3.8-slim

# Install dependencies
RUN pip install --no-cache-dir \
    scikit-learn \
    numpy \
    pandas \
    clipper_admin

# Copy model files
COPY hr_model.pkl /app/
COPY hr_prediction_service.py /app/

WORKDIR /app

# Expose Clipper's default port
EXPOSE 1337

CMD ["python", "hr_prediction_service.py"]

Docker Compose for Full Stack

version: '3.8'

services:
  clipper-query-frontend:
    image: clipper/query_frontend:latest
    ports:
      - "1337:1337"
    networks:
      - clipper-network
    environment:
      - CLIPPER_MANAGEMENT_PORT=1338

  clipper-mgmt-frontend:
    image: clipper/management_frontend:latest
    ports:
      - "1338:1338"
    networks:
      - clipper-network

  redis:
    image: redis:6-alpine
    ports:
      - "6379:6379"
    networks:
      - clipper-network

  hr-model:
    build: ./hr_model
    networks:
      - clipper-network
    depends_on:
      - clipper-query-frontend
      - redis

networks:
  clipper-network:
    driver: bridge

Model Versioning and Updates

Clipper supports seamless model updates without downtime:

def update_model(server, new_model_path):
    """Deploy a new version of the HR model"""

    # Load improved model
    with open(new_model_path, 'rb') as f:
        improved_model = pickle.load(f)

    def improved_predict(inputs):
        predictions = improved_model.predict(inputs)
        return [str(pred) for pred in predictions]

    # Deploy new version (version 2)
    server.deploy_model(
        name='hr-decision-tree',
        version='2',
        input_type='doubles',
        func=improved_predict,
        pkgs_to_install=['scikit-learn']
    )

    # Traffic automatically shifts to new version
    print("Model updated to version 2")

A/B Testing Multiple Models

# Deploy multiple model versions for comparison
def setup_ab_test(server):
    """Configure A/B testing between model versions"""

    # Version 1: Decision Tree
    server.deploy_model(
        name='hr-model-dt',
        version='1',
        input_type='doubles',
        func=decision_tree_predict,
        pkgs_to_install=['scikit-learn']
    )

    # Version 2: Random Forest
    server.deploy_model(
        name='hr-model-rf',
        version='1',
        input_type='doubles',
        func=random_forest_predict,
        pkgs_to_install=['scikit-learn']
    )

    # Clipper can route traffic between models
    # based on configurable policies

Performance Optimization

Clipper includes several built-in optimizations:

Adaptive Batching

# Clipper automatically batches requests for efficiency
# Configure batch parameters when starting Clipper

clipper_conn.start_clipper(
    cache_size=1000,          # LRU cache size
    num_frontend_replicas=2,  # Scale query frontend
)

Caching Configuration

# Enable prediction caching for repeated queries
clipper_conn.register_application(
    name='hr-hiring',
    input_type='doubles',
    default_output='-1.0',
    slo_micros=100000,
    cache=True  # Enable caching
)

Latency Monitoring

import time

def benchmark_predictions(client, num_requests=100):
    """Measure prediction latency"""

    test_input = [5.0, 3.0, 85.0, 4.5, 2.0]
    latencies = []

    for _ in range(num_requests):
        start = time.time()
        client.predict(test_input)
        latency = (time.time() - start) * 1000  # ms
        latencies.append(latency)

    avg_latency = sum(latencies) / len(latencies)
    p99_latency = sorted(latencies)[int(0.99 * len(latencies))]

    print(f"Average latency: {avg_latency:.2f}ms")
    print(f"P99 latency: {p99_latency:.2f}ms")

Error Handling and Resilience

Default Predictions

Clipper returns default predictions when models fail:

# Register with meaningful default
server.register_application(
    name='hr-hiring',
    input_type='doubles',
    default_output='{"status": "pending_review", "confidence": 0.0}',
    slo_micros=100000
)

Health Monitoring

def check_clipper_health(conn):
    """Monitor Clipper cluster health"""

    apps = conn.get_all_apps()
    models = conn.get_all_models()
    containers = conn.get_all_model_replicas()

    print(f"Registered applications: {len(apps)}")
    print(f"Deployed models: {len(models)}")
    print(f"Active containers: {len(containers)}")

    # Check each container's status
    for container in containers:
        print(f"  - {container['model_name']}:{container['model_version']} "
              f"Status: {container['status']}")

Production Deployment Best Practices

1. Resource Allocation

# Kubernetes deployment for production
apiVersion: apps/v1
kind: Deployment
metadata:
  name: clipper-hr-model
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: hr-model
          image: hr-decision-tree:v1
          resources:
            requests:
              memory: "256Mi"
              cpu: "250m"
            limits:
              memory: "512Mi"
              cpu: "500m"

2. Logging and Observability

import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

logger = logging.getLogger('clipper-hr-service')

def logged_prediction(inputs):
    """Prediction function with logging"""
    logger.info(f"Received prediction request: {len(inputs)} samples")

    start_time = time.time()
    predictions = hr_model.predict(inputs)
    inference_time = time.time() - start_time

    logger.info(f"Prediction completed in {inference_time:.4f}s")

    return [str(pred) for pred in predictions]

3. Graceful Shutdown

def cleanup_clipper(conn):
    """Clean shutdown of Clipper cluster"""

    # Unlink models from applications
    conn.unlink_model_from_app('hr-hiring', 'hr-decision-tree')

    # Stop model containers
    conn.stop_models('hr-decision-tree')

    # Stop Clipper
    conn.stop_all()

    print("Clipper shutdown complete")

Clipper vs. Other Serving Solutions

FeatureClipperSeldonTensorFlow ServingTorchServe
Multi-frameworkYesYesTF onlyPyTorch only
REST APIYesYesYesYes
gRPCNoYesYesYes
A/B TestingYesYesManualManual
Adaptive BatchingYesYesYesYes
CachingYesNoNoNo
Python-nativeYesYesNoYes

Conclusion

Clipper provides a practical solution for deploying ML models as REST APIs with minimal infrastructure code. Its key strengths include:

  1. Framework Agnosticism: Deploy models from any Python ML library
  2. Simple REST Interface: Standard HTTP endpoints for predictions
  3. Built-in Optimization: Automatic batching and caching
  4. Model Management: Version control and seamless updates
  5. Container-based Scaling: Docker-native deployment

The HR hiring prediction example demonstrates how to transform a scikit-learn model into a production-ready service. The same patterns apply to any ML use case - from image classification to recommendation systems.

For teams seeking to bridge the gap between data science experimentation and production deployment, Clipper offers a lightweight yet powerful approach. It handles the operational complexity of model serving, allowing data scientists to focus on what they do best - building accurate and impactful models.

The containerized approach ensures portability across environments, from local development to cloud-native Kubernetes deployments. Whether you're serving a single model or managing a fleet of ML services, Clipper provides the foundation for reliable, scalable prediction APIs.


Explore the implementation at github.com/mgorav/clipper and adapt these patterns for your ML model serving requirements.